In this chapter, we continue the discussion of recursive data types by way of a slightly more complex example related to the execution of programming languages. Specifically, we look into the evaluation of simple arithmetic expressions; in a subsequent chapter, we will extend this to cover variable bindings and function definitions as well. Computer languages —except for the machine code directly processed by the computer hardware— are executed in one of two possible ways: they are either *compiled* or *interpreted*. In the case of compilation, the original program (aka the *source code*) is translated into machine code by a *compiler*. Once compiled, the hardware can directly execute the program. When code is interpreted, the operations and instructions of the source code are directly executed by an *interpreter* by performing the respective computations, without generating machine code.

Alternatively, some language implementations choose a hybrid scheme, where the compiler, instead of machine code, generates a lower-level language, which in turn is interpreted. For example, the Haskell interpreter GHCi, uses such a hybrid scheme, where Haskell is translated into a bytecode (designed to facilitate fast interpretation), which is then interpreted by the Haskell runtime system. Similarly, Java compilers generate Java Bytecode, which is then interpreted by a Java Virtual Machine (JVM).

In More About Algebraic Data Types, we discussed how trees, implemented as recursive data types, can be used to achieve a more efficient representation for collections of items than ordinary lists. To evaluate arithmetic expressions, we will use trees to represent the expression structure. Similar, if more complex, tree representations are generally used to represent programs in interpreters and compilers.

If we enter an arithmetic expression, like `2 + 7 * 13`

in a Haskell for Mac playground or in GHCi, this expression will be evaluated and the result, `93`

, will be displayed. The interpreter achieves this in several steps: (1) the input string is decomposed into individual symbols or *tokens* (by a process often called *lexing*), (2) the token stream is parsed and turned into a tree representation, (3) this expression tree gets evaluated by a recursive tree traversal. We will now look at these three steps in turn.

We start with decomposing the string containing the arithmetic expression into a sequence of symbols, which we call tokens. For the example of `2 + 7 * 13`

, we expect the following token sequence: `2`

, `+`

, `7`

, `*`

, `13`

. During lexing (or tokenisation), we usually discard all whitespace and similar characters that do not contribute to the meaning of the tokenised expression — they simply serve to increase readability for human consumption. This process is carried out by a lexer function:^{1}

`lexer :: String -> [Token]`

The type of `Token`

s needs to represent all symbols that may be produced when the lexer tokenises arithmetic expressions. In other words, we need a variant for each operator in the language —for this chapter, this is only addition and multiplication— as well as a representation for opening and closing parentheses and integer literals (represented by an `Int`

value):

```
data Token
= PlusTok
| TimesTok
| OpenTok
| CloseTok
| IntTok Int
deriving (Show)
```

The only type constructor that requires an argument is `IntTok`

, as it is not sufficient to know that we read an integer literal, we also need to know the value that it represents. To implement the `lexer`

function, we traverse the input string character by character. We convert all single character tokens into their corresponding token:

```
lexer :: String -> [Token]
lexer [] = []
lexer ('+' : restStr) = PlusTok : lexer restStr
lexer ('*' : restStr) = TimesTok : lexer restStr
lexer ('(' : restStr) = OpenP : lexer restStr
lexer (')' : restStr) = CloseP : lexer restStr
```

Moreover, we drop all whitespace characters (including space, newline, and tab characters), identifier by the `isSpace :: Char -> Bool`

function from module `Data.Char`

:

`lexer (chr : restStr) | isSpace chr = lexer restStr`

We identify the beginning of integer literals by using `isDigit :: Char -> Bool`

. In contrast to tokens consisting of only one character, the handling of integer literals is more complex. We need to collect all successive digits, convert them into an integer value (to be used as an argument for the `IntTok`

token), before recursively continuing to traverse the reminder of the input string. We do this by splitting the input string into the initial continuous sequence of digits and the remainder, which starts with a non-digit character. For example, given a string `"312 + 4 * 5"`

, we want to split it into the string `"312"`

and the remainder `" + 4 * 5"`

. We could achieve this by writing a helper function

```
-- Split a string into leading digits and remaining string
splitDigits :: String -> (String, String)
```

from scratch or we use the Prelude function `break :: (a -> Bool) -> [a] -> ([a], [a])`

. The function `break`

, given a predicate, splits a list into the leading elements for which the predicate evaluates to `False`

, and the rest of the list. For example `break (< 5) [7,9,5,1,2,6,4]`

will return the pair `([7,9,5],[1,2,6,4])`

. To use `break`

for our purpose, we need the opposite of `isDigit`

to break the string at the first non-digit character. To this end, we might define another helper function

```
notIsDigit :: Char -> Bool
notIsDigit chr = not (isDigit chr)
```

This can alternatively be written using the function composition operator `(.) :: (b -> c) -> (a -> b) -> (a -> b)`

as

```
notIsDigit :: Char -> Bool
notIsDigit chr = (not . isDigit) chr
```

or, equivalently, in point-free notation as

```
notIsDigit :: Char -> Bool
notIsDigit = not . isDigit -- 'notIsDigit' is the composition of 'isDigit' with 'not'
```

We discussed this notation in some detail in Higher-order Functions. Given the compact, point-free representation, we can simply inline `notIsDigit`

in the call to `break`

and get `break (not . isDigit)`

. We use this in `lexer`

as follows:

```
lexer str@(chr : _) | isDigit chr
= IntTok (stringToInt digitStr) : lexer restStr
where
(digitStr, restStr) = break (not . isDigit) str
stringToInt = foldl (\acc chr -> 10 * acc + digitToInt chr) 0
```

We convert the initial sequence of digits, `digitStr`

, into an integer value using the `stringToInt`

function that we defined in Higher-order Functions. Finally, if the lexer encounters an unknown character, it fails with an error. Overall, we have got the following definition:

```
lexer :: String -> [Token]
lexer [] = []
lexer ('+' : restStr) = PlusTok : lexer restStr
lexer ('*' : restStr) = TimesTok : lexer restStr
lexer ('(' : restStr) = OpenP : lexer restStr
lexer (')' : restStr) = CloseP : lexer restStr
lexer (chr : restStr)
| isSpace chr = lexer restStr
lexer str@(chr : _)
| isDigit chr
= IntTok (stringToInt digitStr) : lexer restStr
where
(digitStr, restStr) = break (not . isDigit) str
-- local function to convert a string to an integer value
stringToInt :: String -> Int
stringToInt = foldl (\acc chr -> 10 * acc + digitToInt chr) 0
-- runtime error for all other characters:
lexer (_ : restString)
= error ("lexer: unexpected character: '" ++ show chr ++ "'")
```

When we apply the lexer to the arithmetic expression string `"2 + 7 * 13"`

, we get the token sequence `[IntTok 2, PlusTok, IntTok 7, TimesTok, IntTok 13]`

. The token sequence is much harder to read for humans, but it is a more convenient representation for further processing by our expression evaluator. The following screencasts demonstrates the use of the lexer function.

The next step is to convert the sequence of tokens into a representation which exposes the fact that the expression `"2 + 7 * 13"`

is the sum of `2`

and the product of `7`

and `13`

, or, if we draw it as a diagram:

This looks very much like the binary trees we discussed in More About Algebraic Data Types. The difference is that, here, the leaves contain the numbers and the nodes contain the operators. The purpose of the *parsing* function is to convert the list of tokens into such a tree structure:

`parser :: [Token] -> Expr`

Recall our definition of binary trees:

```
data BinaryTree a
= Node a (BinaryTree a) (BinaryTree a)
| Leaf
```

We need to adapt this such that the nodes represent operators whose subtrees comprise the arguments of that operator. Moreover, the leaves of the tree contain the integer literals. Expression trees (as used in this example) are not polymorphic, so the type constructor `Expr`

has no type argument.

```
data Expr
= IntLit Int -- integer constants, leaves of the expression tree
| Add Expr Expr -- addition node
| Mult Expr Expr -- multiplication node
```

With this definition, we can represent the arithmetic expression `"2 + 7 * 13"`

as `Add (IntLit 2) (Mult (IntLit 7) (IntLit 13))`

. But what happens with parentheses? They are not directly represented in the expression tree! That's fine, as we don't need them. Take the expression `"(1 + 2) * 5"`

as an example: in this string, we need parentheses to express that we multiply the sum, despite multiplication having a higher precedence. With the expression tree representation, the parentheses are not necessary as the order of computation is implicit in the nesting of the subtrees: `Mult (Add (IntLit 1) (IntLit 2)) (IntLit 5)`

.

Now, we need to find a way to covert the list of `Token`

s into a value of type `Expr`

. Depending on the language, parsing can be a complex problem. Luckily, it is by now quite well understood how to design a language in a way that parsing is not overly complicated. To simplify the parsing problem for our arithmetic expressions even further, we break with convention and define the operators to be right associative, instead of left associative; that is, we parse the expression `"1 + 2 + 3"`

as `"1 + (2 + 3)"`

, and accordingly for multiplication.

Let us start writing a parser which only works for the most basic subset of the language, integer constants. Given a list of token, it checks if the first is an `IntTok`

. If so, it returns the corresponding `IntLit`

node. But what should it do if the first token is not an `IntTok`

? A runtime error would be one option, but that would make it harder to use as part of a more complex parser. Instead, we can use the `Maybe`

type discussed previously and return `Nothing`

. As a second result, we return the rest of the token list:

```
parseInt :: [Token] -> Maybe (Expr, [Token])
parseInt (IntTok n : restTokens)
= Just (IntLit n, restTokens)
parseInt tokens
= Nothing
```

Now, let's write a parser which can deal with `IntLit`

, but also products. When this parser gets the token list, it doesn't yet know whether the list contains just one integer, or a sequence of integers interspersed with multiplication operators. But it knows that the first token has to be a literal, so it calls `parseIntLit`

(which we just implemented), and then, checks if the first token in the list of remaining tokens is a multiplication operator. If not, it knows, its job is done and it returns the expression tree provided by `parseInt`

. If there is a multiplication operator, it removes the operator from the list. The remaining list can, in turn, now either be just an `IntTok`

or another multiplication, so we call the parser function on that list, and take the expression it returns as the second argument to the multiplication expression we ourselves return:

```
parseProdOrInt :: [Token] -> Maybe (Expr, [Token])
parseProdOrInt tokens
= case parseInt tokens of
Just (expr1, (TimesTok : restTokens1)) ->
case parseProdOrInt restTokens1 of
Just (expr2, restTokens2) -> Just (Mult expr1 expr2, restTokens2)
Nothing -> Nothing
result -> result -- could be 'Nothing' or a valid expression
```

This parser can deal with multiplication chains of any length. For example, `parseProdOrInt (lexer "12 * 13 * 14 * 15")`

yields the value `Just (t, [])`

where the term `t`

is

`Mult (IntLit 12) (Mult (IntLit 13) (Mult (IntLit 14) (IntLit 15)))`

or, as displayed below as tree diagram, where the node labels are the names of the corresponding data constructors, and the subtrees (or the children) of a node are the arguments to the constructor. Since the `IntLit`

constructor has only one argument, an `Int`

value, we display that value in the node instead of as a separate subtree.

Note that, if we would have defined multiplication to be left associative (as it should be), the code would not be quite as simple: we can only split the leading integer literal from the list by calling `parseInt`

because of the left associativity. Otherwise, the leftmost subexpression could be another product. But if we would call the product parser on the argument list recursively straight away, it would result in sequence of recursive calls. Luckily, there is a standard solution to this problem that removes all left recursion. We just chose to omit this step here to keep the code simpler.

Next, we throw sums into the mix and write a parser which can handle all the previous expressions, but also sums. The code is almost the same as for the previous parser, with the only difference that the first subexpression can be a product or an integer constant, so we call `parseProdOrInt`

and we check for the `PlusTok`

token instead of `TimesTok`

:

```
parseSumOrProdOrInt :: [Token] -> Maybe (Expr, [Token])
parseSumOrProdOrInt tokens
= case parseProdOrInt tokens of
Just (expr1, (PlusTok : restTokens1)) ->
case parseProdOrInt restTokens1 of
Just (expr2, restTokens2) -> Just (Add expr1 expr2, restTokens2)
Nothing -> Nothing
result -> result -- could be 'Nothing' or a valid expression
```

Finally, we need to handle parenthesised expressions. They can appear anywhere an integer literal can appear, because, just like integer literals, the parser should not split them. That's why we need to extend the `parseInt`

function to also handled parenthesised expressions. We do this by writing a new version under the name `parseIntOrParenExpr`

. If `OpenP`

is the first token in the list, then the subsequent tokens could be any expression: sum, product, literal or another parenthesised expression, followed by a closing paren. Finally, we adapt `parseSumOrProdOrInt`

and `parseProdOrIntExpr`

to call the new variant while also reflecting that change in their new name.

```
parseIntOrParenExpr :: [Token] -> Maybe (Expr, [Token])
parseIntOrParenExpr (IntTok n : restTokens)
= Just (IntLit n, restTokens)
parseIntOrParenExpr (OpenP : restTokens1)
= case parseSumOrProdOrIntOrParenExpr restTokens1 of
Just (expr, (CloseP : restTokens2)) -> Just (expr, restTokens2)
Just _ -> Nothing -- no closing paren
Nothing -> Nothing
parseIntOrParenExpr tokens
= Nothing
parseProdOrIntOrParenExpr :: [Token] -> Maybe (Expr, [Token])
parseProdOrIntOrParenExpr tokens
= case parseIntOrParenExpr tokens of
Just (expr1, (TimesTok : restTokens1)) ->
case parseProdOrIntOrParenExpr restTokens1 of
Just (expr2, restTokens2) -> Just (Mult expr1 expr2, restTokens2)
Nothing -> Nothing
result -> result
parseSumOrProdOrIntOrParenExpr :: [Token] -> Maybe (Expr, [Token])
parseSumOrProdOrIntOrParenExpr tokens
= case parseProdOrIntOrParenExpr tokens of
Just (expr1, (PlusTok : restTokens1)) ->
case parseSumOrProdOrIntOrParenExpr restTokens1 of
Just (expr2, restTokens2) -> Just (Add expr1 expr2, restTokens2)
Nothing -> Nothing
result -> result
parse :: [Token] -> Expr
parse tokens =
case parseSumOrProdOrIntOrParenExpr tokens of
Just (expr, []) -> expr
_ -> error "Could not parse input"
```

The wrapper function `parse`

simply calls `parseSumOrProdOrIntOrParenExpr`

and checks that it successfully consumed all input tokens — i.e., the result needs to be of the form `Just (expr, [])`

where the empty list indicates that no tokens where left over; otherwise, we have got a parse error.

The expression trees of type `Expr`

produced by the `parse`

function are called *abstract syntax trees (ASTs)*. They are abstract as they ignore elements of the concrete grammar that are not relevant for subsequent processing. In our case, we ignore parenthesis in the tree representation, as the tree structure is already sufficient to fully characterise the intended evaluation order. A large portion of the code in a compiler is concerned with analysing, optimising, and transforming ASTs, to bring them into a form suitable for code generation.

The following screencast illustrates the various parsing functions by way of an example and graphically renders the resulting abstract syntax trees.

As it turns out, parsing is by far the most complex aspect of evaluating arithmetic expressions. Once the expression is represented as an abstract syntax tree, evaluating that tree is almost trivial. As soon as a language contains more sophisticated features, such as variable bindings or even function definitions, evaluation gets more complex, but for our simple language of arithmetic expressions, we simply walk the tree, applying the mathematical operators after recursively evaluating all subtrees representing the arguments to that operator.

```
eval :: Expr -> Int
eval (IntLit n) = n
eval (Add expr1 expr2)
= eval expr1 + eval expr2
eval (Mult expr1 expr2)
= eval expr1 * eval expr2
```

To experiment with parsing and expression evaluation yourself, please download the Haskell for Mac project containing the code developed in this chapter: Expressions.hsproj. This project includes two Haskell modules: `ShapeGraphics`

and `Expressions`

. The former contains some of the graphics definitions we already used in previous chapters. The latter is where you'll find the code from this chapter.

Just open `Expressions.hsproj`

in Haskell for Mac and proceed as in the screencasts.

Run GHCi inside the `Expressions.hsproj`

directory after downloading and unpacking. Then, load the `Expressions`

module. To visualise your drawings, you need to write them to disk as a PNG image file. To do so, use the writePng function as follows:

`writePng FILENAME (drawPicture LINEWIDTH (renderTree BINARY_TREE))`

If we are sure that we don't want to extend our language any further, the above representation is perfectly fine. However, if we extend the language with additional binary operators (.e.g, division, subtractions, and so on) we have to add a new tree constructor every single time. This becomes unwieldy and messy quite quickly. A better design is to have only one tree constructor for binary operators, which then carries an extra argument identifying the the actual operator (and while we're at it, we also add a constructor for the application of unary operators).

`data Expr = IntLit Int -- integer constants, leaves of the expression trees | BinaryApp BinOp Expr Expr | UnaryApp UnaryOp Expr data BinOp = MultOp | AddOp | DivOp | SubOp data UnaryOp = NegOp | AbsOp`

How do these changes affect the lexer, parser, and evaluator code? Rewrite the functions for this new definition of

`Expr`

.Our parser fails to produce helpful error messages - it mainly performs a single top level check to determine whether the token list has been consumed in its entirety. Change to parser such that, instead of returning

`Nothing`

, it returns a helpful error message (as a`String`

) in the error cases, which the toplevel`parse`

emits? Hint: Make use of the`Either`

type instead of using`Maybe`

.One way to fix the associativity problem with multiplication and addition, such that they are parsed as left associative operators, consists of writing a helper function that parses an product (much like

`parseProdOrIntOrParenExp`

), but that returns a list of all the factor expressions of a product, instead of returning merely a single expression:`parseProdOrIntOrParenExprL :: [Token] -> Maybe ([Expr], [Token]) parseProdOrIntOrParenExprL tokens = case parseIntOrParenExpr tokens of Just (expr, (TimesTok : restTokens1)) -> case parseProdOrIntOrParenExprL restTokens1 of Just (exprs, restTokens2) -> Just (expr:exprs, restTokens2) Nothing -> Nothing Just (expr, restToken) -> Just ([expr], restToken) Nothing -> Nothing`

For example, applying it to the token list for

`"1 * (2 + 3) * 4 * 5"`

would return a list with five elements, namely the expression trees for`1`

,`2+3`

,`4`

and`5`

. Then, we can use this list to assemble the correct, left associative parse tree. For our example, that would be:Complete the implementation of

`parseProdOrIntOrParenExpr`

below in a way that products are parsed correctly.`parseProdOrIntOrParenExpr :: [Token] -> Maybe (Expr, [Token]) parseProdOrIntOrParenExpr tokens = case parseProdOrIntOrParenExprL tokens of Just .. -> ....... Nothing -> Nothing`

*Hint*: The solution is very short, but not necessarily obvious. Think about how you could use the`fold`

functions we discussed in Higher-order Functions to construct expression trees.

The term lexer stands for

*lexical analyser*.↩