Expression Trees

In this chapter, we continue the discussion of recursive data types by way of a slightly more complex example related to the execution of programming languages. Specifically, we look into the evaluation of simple arithmetic expressions; in a subsequent chapter, we will extend this to cover variable bindings and function definitions as well. Computer languages —except for the machine code directly processed by the computer hardware— are executed in one of two possible ways: they are either compiled or interpreted. In the case of compilation, the original program (aka the source code) is translated into machine code by a compiler. Once compiled, the hardware can directly execute the program. When code is interpreted, the operations and instructions of the source code are directly executed by an interpreter by performing the respective computations, without generating machine code.

Alternatively, some language implementations choose a hybrid scheme, where the compiler, instead of machine code, generates a lower-level language, which in turn is interpreted. For example, the Haskell interpreter GHCi, uses such a hybrid scheme, where Haskell is translated into a bytecode (designed to facilitate fast interpretation), which is then interpreted by the Haskell runtime system. Similarly, Java compilers generate Java Bytecode, which is then interpreted by a Java Virtual Machine (JVM).

Evaluating Expressions

In More About Algebraic Data Types, we discussed how trees, implemented as recursive data types, can be used to achieve a more efficient representation for collections of items than ordinary lists. To evaluate arithmetic expressions, we will use trees to represent the expression structure. Similar, if more complex, tree representations are generally used to represent programs in interpreters and compilers.

If we enter an arithmetic expression, like 2 + 7 * 13 in a Haskell for Mac playground or in GHCi, this expression will be evaluated and the result, 93, will be displayed. The interpreter achieves this in several steps: (1) the input string is decomposed into individual symbols or tokens (by a process often called lexing), (2) the token stream is parsed and turned into a tree representation, (3) this expression tree gets evaluated by a recursive tree traversal. We will now look at these three steps in turn.

Lexing

We start with decomposing the string containing the arithmetic expression into a sequence of symbols, which we call tokens. For the example of 2 + 7 * 13, we expect the following token sequence: 2, +, 7, *, 13. During lexing (or tokenisation), we usually discard all whitespace and similar characters that do not contribute to the meaning of the tokenised expression — they simply serve to increase readability for human consumption. This process is carried out by a lexer function:1

lexer :: String -> [Token]

The type of Tokens needs to represent all symbols that may be produced when the lexer tokenises arithmetic expressions. In other words, we need a variant for each operator in the language —for this chapter, this is only addition and multiplication— as well as a representation for opening and closing parentheses and integer literals (represented by an Int value):

data Token
  = PlusTok
  | TimesTok
  | OpenTok
  | CloseTok
  | IntTok Int
  deriving (Show)

The only type constructor that requires an argument is IntTok, as it is not sufficient to know that we read an integer literal, we also need to know the value that it represents. To implement the lexer function, we traverse the input string character by character. We convert all single character tokens into their corresponding token:

lexer :: String -> [Token]
lexer []              = []
lexer ('+' : restStr) = PlusTok  : lexer restStr
lexer ('*' : restStr) = TimesTok : lexer restStr 
lexer ('(' : restStr) = OpenP    : lexer restStr 
lexer (')' : restStr) = CloseP   : lexer restStr

Moreover, we drop all whitespace characters (including space, newline, and tab characters), identifier by the isSpace :: Char -> Bool function from module Data.Char:

lexer (chr : restStr) | isSpace chr = lexer restStr

We identify the beginning of integer literals by using isDigit :: Char -> Bool. In contrast to tokens consisting of only one character, the handling of integer literals is more complex. We need to collect all successive digits, convert them into an integer value (to be used as an argument for the IntTok token), before recursively continuing to traverse the reminder of the input string. We do this by splitting the input string into the initial continuous sequence of digits and the remainder, which starts with a non-digit character. For example, given a string "312 + 4 * 5", we want to split it into the string "312" and the remainder " + 4 * 5". We could achieve this by writing a helper function

-- Split a string into leading digits and remaining string
splitDigits :: String -> (String, String)

from scratch or we use the Prelude function break :: (a -> Bool) -> [a] -> ([a], [a]). The function break, given a predicate, splits a list into the leading elements for which the predicate evaluates to False, and the rest of the list. For example break (< 5) [7,9,5,1,2,6,4] will return the pair ([7,9,5],[1,2,6,4]). To use break for our purpose, we need the opposite of isDigit to break the string at the first non-digit character. To this end, we might define another helper function

notIsDigit :: Char -> Bool
notIsDigit chr = not (isDigit chr)

This can alternatively be written using the function composition operator (.) :: (b -> c) -> (a -> b) -> (a -> b) as

notIsDigit :: Char -> Bool
notIsDigit chr = (not . isDigit) chr

or, equivalently, in point-free notation as

notIsDigit :: Char -> Bool
notIsDigit = not . isDigit  -- 'notIsDigit' is the composition of 'isDigit' with 'not'

We discussed this notation in some detail in Higher-order Functions. Given the compact, point-free representation, we can simply inline notIsDigit in the call to break and get break (not . isDigit). We use this in lexer as follows:

lexer str@(chr : _) | isDigit chr
  = IntTok (stringToInt digitStr) : lexer restStr
  where
    (digitStr, restStr) = break (not . isDigit) str
    stringToInt         = foldl (\acc chr -> 10 * acc + digitToInt chr) 0

We convert the initial sequence of digits, digitStr, into an integer value using the stringToInt function that we defined in Higher-order Functions. Finally, if the lexer encounters an unknown character, it fails with an error. Overall, we have got the following definition:

lexer :: String -> [Token]
lexer []              = []
lexer ('+' : restStr) = PlusTok  : lexer restStr
lexer ('*' : restStr) = TimesTok : lexer restStr 
lexer ('(' : restStr) = OpenP    : lexer restStr 
lexer (')' : restStr) = CloseP   : lexer restStr
lexer (chr : restStr) 
  | isSpace chr       = lexer restStr
lexer str@(chr : _) 
  | isDigit chr
  = IntTok (stringToInt digitStr) : lexer restStr
  where
     (digitStr, restStr) = break (not . isDigit) str
     -- local function to convert a string to an integer value
     stringToInt :: String -> Int
     stringToInt = foldl (\acc chr -> 10 * acc + digitToInt chr) 0
  -- runtime error for all other characters:
lexer (_ : restString) 
  = error ("lexer: unexpected character: '" ++ show chr ++ "'")

When we apply the lexer to the arithmetic expression string "2 + 7 * 13", we get the token sequence [IntTok 2, PlusTok, IntTok 7, TimesTok, IntTok 13]. The token sequence is much harder to read for humans, but it is a more convenient representation for further processing by our expression evaluator. The following screencasts demonstrates the use of the lexer function.

Parsing

The next step is to convert the sequence of tokens into a representation which exposes the fact that the expression "2 + 7 * 13" is the sum of 2 and the product of 7 and 13, or, if we draw it as a diagram:

expression tree for 2 + 7 * 13

This looks very much like the binary trees we discussed in More About Algebraic Data Types. The difference is that, here, the leaves contain the numbers and the nodes contain the operators. The purpose of the parsing function is to convert the list of tokens into such a tree structure:

parser :: [Token] -> Expr

Representing expressions

Recall our definition of binary trees:

data BinaryTree a
  = Node a (BinaryTree a) (BinaryTree a)
  | Leaf

We need to adapt this such that the nodes represent operators whose subtrees comprise the arguments of that operator. Moreover, the leaves of the tree contain the integer literals. Expression trees (as used in this example) are not polymorphic, so the type constructor Expr has no type argument.

data Expr
  = IntLit Int          -- integer constants, leaves of the expression tree
  | Add    Expr Expr    -- addition node
  | Mult   Expr Expr    -- multiplication node

With this definition, we can represent the arithmetic expression "2 + 7 * 13" as Add (IntLit 2) (Mult (IntLit 7) (IntLit 13)). But what happens with parentheses? They are not directly represented in the expression tree! That's fine, as we don't need them. Take the expression "(1 + 2) * 5" as an example: in this string, we need parentheses to express that we multiply the sum, despite multiplication having a higher precedence. With the expression tree representation, the parentheses are not necessary as the order of computation is implicit in the nesting of the subtrees: Mult (Add (IntLit 1) (IntLit 2)) (IntLit 5).

Now, we need to find a way to covert the list of Tokens into a value of type Expr. Depending on the language, parsing can be a complex problem. Luckily, it is by now quite well understood how to design a language in a way that parsing is not overly complicated. To simplify the parsing problem for our arithmetic expressions even further, we break with convention and define the operators to be right associative, instead of left associative; that is, we parse the expression "1 + 2 + 3" as "1 + (2 + 3)", and accordingly for multiplication.

Parsing integer literals

Let us start writing a parser which only works for the most basic subset of the language, integer constants. Given a list of token, it checks if the first is an IntTok. If so, it returns the corresponding IntLit node. But what should it do if the first token is not an IntTok? A runtime error would be one option, but that would make it harder to use as part of a more complex parser. Instead, we can use the Maybe type discussed previously and return Nothing. As a second result, we return the rest of the token list:

parseInt :: [Token] -> Maybe (Expr, [Token])
parseInt (IntTok n : restTokens)
  = Just (IntLit n, restTokens)
parseInt tokens
  = Nothing

Parsing products

Now, let's write a parser which can deal with IntLit, but also products. When this parser gets the token list, it doesn't yet know whether the list contains just one integer, or a sequence of integers interspersed with multiplication operators. But it knows that the first token has to be a literal, so it calls parseIntLit (which we just implemented), and then, checks if the first token in the list of remaining tokens is a multiplication operator. If not, it knows, its job is done and it returns the expression tree provided by parseInt. If there is a multiplication operator, it removes the operator from the list. The remaining list can, in turn, now either be just an IntTok or another multiplication, so we call the parser function on that list, and take the expression it returns as the second argument to the multiplication expression we ourselves return:

parseProdOrInt :: [Token] -> Maybe (Expr, [Token])
parseProdOrInt tokens
  = case parseInt tokens of
      Just (expr1, (TimesTok : restTokens1)) -> 
          case parseProdOrInt restTokens1 of
            Just (expr2, restTokens2) -> Just (Mult expr1 expr2, restTokens2)
            Nothing                   -> Nothing
      result -> result     -- could be 'Nothing' or a valid expression

This parser can deal with multiplication chains of any length. For example, parseProdOrInt (lexer "12 * 13 * 14 * 15") yields the value Just (t, []) where the term t is

Mult (IntLit 12) (Mult (IntLit 13) (Mult (IntLit 14) (IntLit 15)))

or, as displayed below as tree diagram, where the node labels are the names of the corresponding data constructors, and the subtrees (or the children) of a node are the arguments to the constructor. Since the IntLit constructor has only one argument, an Int value, we display that value in the node instead of as a separate subtree.

Leaf

Note that, if we would have defined multiplication to be left associative (as it should be), the code would not be quite as simple: we can only split the leading integer literal from the list by calling parseInt because of the left associativity. Otherwise, the leftmost subexpression could be another product. But if we would call the product parser on the argument list recursively straight away, it would result in sequence of recursive calls. Luckily, there is a standard solution to this problem that removes all left recursion. We just chose to omit this step here to keep the code simpler.

Parsing sums

Next, we throw sums into the mix and write a parser which can handle all the previous expressions, but also sums. The code is almost the same as for the previous parser, with the only difference that the first subexpression can be a product or an integer constant, so we call parseProdOrInt and we check for the PlusTok token instead of TimesTok:

parseSumOrProdOrInt :: [Token] -> Maybe (Expr, [Token])
parseSumOrProdOrInt tokens
  = case parseProdOrInt tokens of
      Just (expr1, (PlusTok : restTokens1)) -> 
          case parseProdOrInt restTokens1 of
            Just (expr2, restTokens2) -> Just (Add expr1 expr2, restTokens2)
            Nothing                   -> Nothing
      result -> result    -- could be 'Nothing' or a valid expression

Finally, we need to handle parenthesised expressions. They can appear anywhere an integer literal can appear, because, just like integer literals, the parser should not split them. That's why we need to extend the parseInt function to also handled parenthesised expressions. We do this by writing a new version under the name parseIntOrParenExpr. If OpenP is the first token in the list, then the subsequent tokens could be any expression: sum, product, literal or another parenthesised expression, followed by a closing paren. Finally, we adapt parseSumOrProdOrInt and parseProdOrIntExpr to call the new variant while also reflecting that change in their new name.

parseIntOrParenExpr :: [Token] -> Maybe (Expr, [Token])
parseIntOrParenExpr (IntTok n : restTokens)
  = Just (IntLit n,   restTokens)
parseIntOrParenExpr (OpenP : restTokens1)
  = case parseSumOrProdOrIntOrParenExpr restTokens1 of
       Just (expr, (CloseP : restTokens2)) -> Just (expr, restTokens2)
       Just _  -> Nothing -- no closing paren
       Nothing -> Nothing
parseIntOrParenExpr tokens
  = Nothing
      
parseProdOrIntOrParenExpr :: [Token] -> Maybe (Expr, [Token])
parseProdOrIntOrParenExpr tokens
  = case parseIntOrParenExpr tokens of
      Just (expr1, (TimesTok : restTokens1)) -> 
          case parseProdOrIntOrParenExpr restTokens1 of
            Just (expr2, restTokens2) -> Just (Mult expr1 expr2, restTokens2)
            Nothing                   -> Nothing
      result -> result   
              
parseSumOrProdOrIntOrParenExpr :: [Token] -> Maybe (Expr, [Token])
parseSumOrProdOrIntOrParenExpr tokens
  = case parseProdOrIntOrParenExpr tokens of
      Just (expr1, (PlusTok : restTokens1)) -> 
          case parseSumOrProdOrIntOrParenExpr restTokens1 of
            Just (expr2, restTokens2) -> Just (Add expr1 expr2, restTokens2)
            Nothing                   -> Nothing
      result -> result

parse :: [Token] -> Expr
parse tokens =
  case parseSumOrProdOrIntOrParenExpr tokens of
    Just (expr, []) -> expr    
    _               -> error "Could not parse input"

The wrapper function parse simply calls parseSumOrProdOrIntOrParenExpr and checks that it successfully consumed all input tokens — i.e., the result needs to be of the form Just (expr, []) where the empty list indicates that no tokens where left over; otherwise, we have got a parse error.

The expression trees of type Expr produced by the parse function are called abstract syntax trees (ASTs). They are abstract as they ignore elements of the concrete grammar that are not relevant for subsequent processing. In our case, we ignore parenthesis in the tree representation, as the tree structure is already sufficient to fully characterise the intended evaluation order. A large portion of the code in a compiler is concerned with analysing, optimising, and transforming ASTs, to bring them into a form suitable for code generation.

The following screencast illustrates the various parsing functions by way of an example and graphically renders the resulting abstract syntax trees.

Evaluator

As it turns out, parsing is by far the most complex aspect of evaluating arithmetic expressions. Once the expression is represented as an abstract syntax tree, evaluating that tree is almost trivial. As soon as a language contains more sophisticated features, such as variable bindings or even function definitions, evaluation gets more complex, but for our simple language of arithmetic expressions, we simply walk the tree, applying the mathematical operators after recursively evaluating all subtrees representing the arguments to that operator.

eval :: Expr -> Int
eval (IntLit n) = n
eval (Add expr1 expr2)
  = eval expr1 + eval expr2
eval (Mult expr1 expr2)
  = eval expr1 * eval expr2  

Download

To experiment with parsing and expression evaluation yourself, please download the Haskell for Mac project containing the code developed in this chapter: Expressions.hsproj. This project includes two Haskell modules: ShapeGraphics and Expressions. The former contains some of the graphics definitions we already used in previous chapters. The latter is where you'll find the code from this chapter.

Haskell for Mac

Haskell for Mac

Just open Expressions.hsproj in Haskell for Mac and proceed as in the screencasts.

Other Haskell

Command Line Haskell

Run GHCi inside the Expressions.hsproj directory after downloading and unpacking. Then, load the Expressions module. To visualise your drawings, you need to write them to disk as a PNG image file. To do so, use the writePng function as follows:

writePng FILENAME (drawPicture LINEWIDTH (renderTree BINARY_TREE))

Exercises

  1. If we are sure that we don't want to extend our language any further, the above representation is perfectly fine. However, if we extend the language with additional binary operators (.e.g, division, subtractions, and so on) we have to add a new tree constructor every single time. This becomes unwieldy and messy quite quickly. A better design is to have only one tree constructor for binary operators, which then carries an extra argument identifying the the actual operator (and while we're at it, we also add a constructor for the application of unary operators).

    data Expr
      = IntLit    Int               -- integer constants, leaves of the expression trees
      | BinaryApp BinOp   Expr Expr
      | UnaryApp  UnaryOp Expr
    
    data BinOp
      = MultOp
      | AddOp
      | DivOp
      | SubOp
    
    data UnaryOp
      = NegOp
      | AbsOp

    How do these changes affect the lexer, parser, and evaluator code? Rewrite the functions for this new definition of Expr.

  2. Our parser fails to produce helpful error messages - it mainly performs a single top level check to determine whether the token list has been consumed in its entirety. Change to parser such that, instead of returning Nothing, it returns a helpful error message (as a String) in the error cases, which the toplevel parse emits? Hint: Make use of the Either type instead of using Maybe.

  3. One way to fix the associativity problem with multiplication and addition, such that they are parsed as left associative operators, consists of writing a helper function that parses an product (much like parseProdOrIntOrParenExp), but that returns a list of all the factor expressions of a product, instead of returning merely a single expression:

    parseProdOrIntOrParenExprL :: [Token] -> Maybe ([Expr], [Token])
    parseProdOrIntOrParenExprL tokens
      = case parseIntOrParenExpr tokens of
          Just (expr, (TimesTok : restTokens1)) -> 
              case parseProdOrIntOrParenExprL restTokens1 of
                Just (exprs, restTokens2) -> Just (expr:exprs, restTokens2)
                Nothing                   -> Nothing
          Just (expr, restToken)   -> Just ([expr], restToken)
          Nothing                  -> Nothing

    For example, applying it to the token list for "1 * (2 + 3) * 4 * 5" would return a list with five elements, namely the expression trees for 1, 2+3, 4 and 5. Then, we can use this list to assemble the correct, left associative parse tree. For our example, that would be:

    Leaf

    Complete the implementation of parseProdOrIntOrParenExpr below in a way that products are parsed correctly.

    parseProdOrIntOrParenExpr :: [Token] -> Maybe (Expr, [Token])
    parseProdOrIntOrParenExpr tokens = 
      case parseProdOrIntOrParenExprL tokens of
        Just .. -> .......
        Nothing -> Nothing

    Hint: The solution is very short, but not necessarily obvious. Think about how you could use the fold functions we discussed in Higher-order Functions to construct expression trees.


  1. The term lexer stands for lexical analyser.