In this chapter, we continue the discussion of recursive data types by way of a slightly more complex example related to the execution of programming languages. Specifically, we look into the evaluation of simple arithmetic expressions; in a subsequent chapter, we will extend this to cover variable bindings and function definitions as well. Computer languages —except for the machine code directly processed by the computer hardware— are executed in one of two possible ways: they are either compiled or interpreted. In the case of compilation, the original program (aka the source code) is translated into machine code by a compiler. Once compiled, the hardware can directly execute the program. When code is interpreted, the operations and instructions of the source code are directly executed by an interpreter by performing the respective computations, without generating machine code.
Alternatively, some language implementations choose a hybrid scheme, where the compiler, instead of machine code, generates a lower-level language, which in turn is interpreted. For example, the Haskell interpreter GHCi, uses such a hybrid scheme, where Haskell is translated into a bytecode (designed to facilitate fast interpretation), which is then interpreted by the Haskell runtime system. Similarly, Java compilers generate Java Bytecode, which is then interpreted by a Java Virtual Machine (JVM).
In More About Algebraic Data Types, we discussed how trees, implemented as recursive data types, can be used to achieve a more efficient representation for collections of items than ordinary lists. To evaluate arithmetic expressions, we will use trees to represent the expression structure. Similar, if more complex, tree representations are generally used to represent programs in interpreters and compilers.
If we enter an arithmetic expression, like
2 + 7 * 13 in a Haskell for Mac playground or in GHCi, this expression will be evaluated and the result,
93, will be displayed. The interpreter achieves this in several steps: (1) the input string is decomposed into individual symbols or tokens (by a process often called lexing), (2) the token stream is parsed and turned into a tree representation, (3) this expression tree gets evaluated by a recursive tree traversal. We will now look at these three steps in turn.
We start with decomposing the string containing the arithmetic expression into a sequence of symbols, which we call tokens. For the example of
2 + 7 * 13, we expect the following token sequence:
13. During lexing (or tokenisation), we usually discard all whitespace and similar characters that do not contribute to the meaning of the tokenised expression — they simply serve to increase readability for human consumption. This process is carried out by a lexer function:1
lexer :: String -> [Token]
The type of
Tokens needs to represent all symbols that may be produced when the lexer tokenises arithmetic expressions. In other words, we need a variant for each operator in the language —for this chapter, this is only addition and multiplication— as well as a representation for opening and closing parentheses and integer literals (represented by an
data Token = PlusTok | TimesTok | OpenTok | CloseTok | IntTok Int deriving (Show)
The only type constructor that requires an argument is
IntTok, as it is not sufficient to know that we read an integer literal, we also need to know the value that it represents. To implement the
lexer function, we traverse the input string character by character. We convert all single character tokens into their corresponding token:
lexer :: String -> [Token] lexer  =  lexer ('+' : restStr) = PlusTok : lexer restStr lexer ('*' : restStr) = TimesTok : lexer restStr lexer ('(' : restStr) = OpenP : lexer restStr lexer (')' : restStr) = CloseP : lexer restStr
Moreover, we drop all whitespace characters (including space, newline, and tab characters), identifier by the
isSpace :: Char -> Bool function from module
lexer (chr : restStr) | isSpace chr = lexer restStr
We identify the beginning of integer literals by using
isDigit :: Char -> Bool. In contrast to tokens consisting of only one character, the handling of integer literals is more complex. We need to collect all successive digits, convert them into an integer value (to be used as an argument for the
IntTok token), before recursively continuing to traverse the reminder of the input string. We do this by splitting the input string into the initial continuous sequence of digits and the remainder, which starts with a non-digit character. For example, given a string
"312 + 4 * 5", we want to split it into the string
"312" and the remainder
" + 4 * 5". We could achieve this by writing a helper function
-- Split a string into leading digits and remaining string splitDigits :: String -> (String, String)
from scratch or we use the Prelude function
break :: (a -> Bool) -> [a] -> ([a], [a]). The function
break, given a predicate, splits a list into the leading elements for which the predicate evaluates to
False, and the rest of the list. For example
break (< 5) [7,9,5,1,2,6,4] will return the pair
([7,9,5],[1,2,6,4]). To use
break for our purpose, we need the opposite of
isDigit to break the string at the first non-digit character. To this end, we might define another helper function
notIsDigit :: Char -> Bool notIsDigit chr = not (isDigit chr)
This can alternatively be written using the function composition operator
(.) :: (b -> c) -> (a -> b) -> (a -> b) as
notIsDigit :: Char -> Bool notIsDigit chr = (not . isDigit) chr
or, equivalently, in point-free notation as
notIsDigit :: Char -> Bool notIsDigit = not . isDigit -- 'notIsDigit' is the composition of 'isDigit' with 'not'
We discussed this notation in some detail in Higher-order Functions. Given the compact, point-free representation, we can simply inline
notIsDigit in the call to
break and get
break (not . isDigit). We use this in
lexer as follows:
lexer str@(chr : _) | isDigit chr = IntTok (stringToInt digitStr) : lexer restStr where (digitStr, restStr) = break (not . isDigit) str stringToInt = foldl (\acc chr -> 10 * acc + digitToInt chr) 0
We convert the initial sequence of digits,
digitStr, into an integer value using the
stringToInt function that we defined in Higher-order Functions. Finally, if the lexer encounters an unknown character, it fails with an error. Overall, we have got the following definition:
lexer :: String -> [Token] lexer  =  lexer ('+' : restStr) = PlusTok : lexer restStr lexer ('*' : restStr) = TimesTok : lexer restStr lexer ('(' : restStr) = OpenP : lexer restStr lexer (')' : restStr) = CloseP : lexer restStr lexer (chr : restStr) | isSpace chr = lexer restStr lexer str@(chr : _) | isDigit chr = IntTok (stringToInt digitStr) : lexer restStr where (digitStr, restStr) = break (not . isDigit) str -- local function to convert a string to an integer value stringToInt :: String -> Int stringToInt = foldl (\acc chr -> 10 * acc + digitToInt chr) 0 -- runtime error for all other characters: lexer (_ : restString) = error ("lexer: unexpected character: '" ++ show chr ++ "'")
When we apply the lexer to the arithmetic expression string
"2 + 7 * 13", we get the token sequence
[IntTok 2, PlusTok, IntTok 7, TimesTok, IntTok 13]. The token sequence is much harder to read for humans, but it is a more convenient representation for further processing by our expression evaluator. The following screencasts demonstrates the use of the lexer function.
The next step is to convert the sequence of tokens into a representation which exposes the fact that the expression
"2 + 7 * 13" is the sum of
2 and the product of
13, or, if we draw it as a diagram:
This looks very much like the binary trees we discussed in More About Algebraic Data Types. The difference is that, here, the leaves contain the numbers and the nodes contain the operators. The purpose of the parsing function is to convert the list of tokens into such a tree structure:
parser :: [Token] -> Expr
Recall our definition of binary trees:
data BinaryTree a = Node a (BinaryTree a) (BinaryTree a) | Leaf
We need to adapt this such that the nodes represent operators whose subtrees comprise the arguments of that operator. Moreover, the leaves of the tree contain the integer literals. Expression trees (as used in this example) are not polymorphic, so the type constructor
Expr has no type argument.
data Expr = IntLit Int -- integer constants, leaves of the expression tree | Add Expr Expr -- addition node | Mult Expr Expr -- multiplication node
With this definition, we can represent the arithmetic expression
"2 + 7 * 13" as
Add (IntLit 2) (Mult (IntLit 7) (IntLit 13)). But what happens with parentheses? They are not directly represented in the expression tree! That's fine, as we don't need them. Take the expression
"(1 + 2) * 5" as an example: in this string, we need parentheses to express that we multiply the sum, despite multiplication having a higher precedence. With the expression tree representation, the parentheses are not necessary as the order of computation is implicit in the nesting of the subtrees:
Mult (Add (IntLit 1) (IntLit 2)) (IntLit 5).
Now, we need to find a way to covert the list of
Tokens into a value of type
Expr. Depending on the language, parsing can be a complex problem. Luckily, it is by now quite well understood how to design a language in a way that parsing is not overly complicated. To simplify the parsing problem for our arithmetic expressions even further, we break with convention and define the operators to be right associative, instead of left associative; that is, we parse the expression
"1 + 2 + 3" as
"1 + (2 + 3)", and accordingly for multiplication.
Let us start writing a parser which only works for the most basic subset of the language, integer constants. Given a list of token, it checks if the first is an
IntTok. If so, it returns the corresponding
IntLit node. But what should it do if the first token is not an
IntTok? A runtime error would be one option, but that would make it harder to use as part of a more complex parser. Instead, we can use the
Maybe type discussed previously and return
Nothing. As a second result, we return the rest of the token list:
parseInt :: [Token] -> Maybe (Expr, [Token]) parseInt (IntTok n : restTokens) = Just (IntLit n, restTokens) parseInt tokens = Nothing
Now, let's write a parser which can deal with
IntLit, but also products. When this parser gets the token list, it doesn't yet know whether the list contains just one integer, or a sequence of integers interspersed with multiplication operators. But it knows that the first token has to be a literal, so it calls
parseIntLit (which we just implemented), and then, checks if the first token in the list of remaining tokens is a multiplication operator. If not, it knows, its job is done and it returns the expression tree provided by
parseInt. If there is a multiplication operator, it removes the operator from the list. The remaining list can, in turn, now either be just an
IntTok or another multiplication, so we call the parser function on that list, and take the expression it returns as the second argument to the multiplication expression we ourselves return:
parseProdOrInt :: [Token] -> Maybe (Expr, [Token]) parseProdOrInt tokens = case parseInt tokens of Just (expr1, (TimesTok : restTokens1)) -> case parseProdOrInt restTokens1 of Just (expr2, restTokens2) -> Just (Mult expr1 expr2, restTokens2) Nothing -> Nothing result -> result -- could be 'Nothing' or a valid expression
This parser can deal with multiplication chains of any length. For example,
parseProdOrInt (lexer "12 * 13 * 14 * 15") yields the value
Just (t, ) where the term
Mult (IntLit 12) (Mult (IntLit 13) (Mult (IntLit 14) (IntLit 15)))
or, as displayed below as tree diagram, where the node labels are the names of the corresponding data constructors, and the subtrees (or the children) of a node are the arguments to the constructor. Since the
IntLit constructor has only one argument, an
Int value, we display that value in the node instead of as a separate subtree.
Note that, if we would have defined multiplication to be left associative (as it should be), the code would not be quite as simple: we can only split the leading integer literal from the list by calling
parseInt because of the left associativity. Otherwise, the leftmost subexpression could be another product. But if we would call the product parser on the argument list recursively straight away, it would result in sequence of recursive calls. Luckily, there is a standard solution to this problem that removes all left recursion. We just chose to omit this step here to keep the code simpler.
Next, we throw sums into the mix and write a parser which can handle all the previous expressions, but also sums. The code is almost the same as for the previous parser, with the only difference that the first subexpression can be a product or an integer constant, so we call
parseProdOrInt and we check for the
PlusTok token instead of
parseSumOrProdOrInt :: [Token] -> Maybe (Expr, [Token]) parseSumOrProdOrInt tokens = case parseProdOrInt tokens of Just (expr1, (PlusTok : restTokens1)) -> case parseProdOrInt restTokens1 of Just (expr2, restTokens2) -> Just (Add expr1 expr2, restTokens2) Nothing -> Nothing result -> result -- could be 'Nothing' or a valid expression
Finally, we need to handle parenthesised expressions. They can appear anywhere an integer literal can appear, because, just like integer literals, the parser should not split them. That's why we need to extend the
parseInt function to also handled parenthesised expressions. We do this by writing a new version under the name
OpenP is the first token in the list, then the subsequent tokens could be any expression: sum, product, literal or another parenthesised expression, followed by a closing paren. Finally, we adapt
parseProdOrIntExpr to call the new variant while also reflecting that change in their new name.
parseIntOrParenExpr :: [Token] -> Maybe (Expr, [Token]) parseIntOrParenExpr (IntTok n : restTokens) = Just (IntLit n, restTokens) parseIntOrParenExpr (OpenP : restTokens1) = case parseSumOrProdOrIntOrParenExpr restTokens1 of Just (expr, (CloseP : restTokens2)) -> Just (expr, restTokens2) Just _ -> Nothing -- no closing paren Nothing -> Nothing parseIntOrParenExpr tokens = Nothing parseProdOrIntOrParenExpr :: [Token] -> Maybe (Expr, [Token]) parseProdOrIntOrParenExpr tokens = case parseIntOrParenExpr tokens of Just (expr1, (TimesTok : restTokens1)) -> case parseProdOrIntOrParenExpr restTokens1 of Just (expr2, restTokens2) -> Just (Mult expr1 expr2, restTokens2) Nothing -> Nothing result -> result parseSumOrProdOrIntOrParenExpr :: [Token] -> Maybe (Expr, [Token]) parseSumOrProdOrIntOrParenExpr tokens = case parseProdOrIntOrParenExpr tokens of Just (expr1, (PlusTok : restTokens1)) -> case parseSumOrProdOrIntOrParenExpr restTokens1 of Just (expr2, restTokens2) -> Just (Add expr1 expr2, restTokens2) Nothing -> Nothing result -> result parse :: [Token] -> Expr parse tokens = case parseSumOrProdOrIntOrParenExpr tokens of Just (expr, ) -> expr _ -> error "Could not parse input"
The wrapper function
parse simply calls
parseSumOrProdOrIntOrParenExpr and checks that it successfully consumed all input tokens — i.e., the result needs to be of the form
Just (expr, ) where the empty list indicates that no tokens where left over; otherwise, we have got a parse error.
The expression trees of type
Expr produced by the
parse function are called abstract syntax trees (ASTs). They are abstract as they ignore elements of the concrete grammar that are not relevant for subsequent processing. In our case, we ignore parenthesis in the tree representation, as the tree structure is already sufficient to fully characterise the intended evaluation order. A large portion of the code in a compiler is concerned with analysing, optimising, and transforming ASTs, to bring them into a form suitable for code generation.
The following screencast illustrates the various parsing functions by way of an example and graphically renders the resulting abstract syntax trees.
As it turns out, parsing is by far the most complex aspect of evaluating arithmetic expressions. Once the expression is represented as an abstract syntax tree, evaluating that tree is almost trivial. As soon as a language contains more sophisticated features, such as variable bindings or even function definitions, evaluation gets more complex, but for our simple language of arithmetic expressions, we simply walk the tree, applying the mathematical operators after recursively evaluating all subtrees representing the arguments to that operator.
eval :: Expr -> Int eval (IntLit n) = n eval (Add expr1 expr2) = eval expr1 + eval expr2 eval (Mult expr1 expr2) = eval expr1 * eval expr2
To experiment with parsing and expression evaluation yourself, please download the Haskell for Mac project containing the code developed in this chapter: Expressions.hsproj. This project includes two Haskell modules:
Expressions. The former contains some of the graphics definitions we already used in previous chapters. The latter is where you'll find the code from this chapter.
Expressions.hsproj in Haskell for Mac and proceed as in the screencasts.
Run GHCi inside the
Expressions.hsproj directory after downloading and unpacking. Then, load the
Expressions module. To visualise your drawings, you need to write them to disk as a PNG image file. To do so, use the writePng function as follows:
writePng FILENAME (drawPicture LINEWIDTH (renderTree BINARY_TREE))
If we are sure that we don't want to extend our language any further, the above representation is perfectly fine. However, if we extend the language with additional binary operators (.e.g, division, subtractions, and so on) we have to add a new tree constructor every single time. This becomes unwieldy and messy quite quickly. A better design is to have only one tree constructor for binary operators, which then carries an extra argument identifying the the actual operator (and while we're at it, we also add a constructor for the application of unary operators).
data Expr = IntLit Int -- integer constants, leaves of the expression trees | BinaryApp BinOp Expr Expr | UnaryApp UnaryOp Expr data BinOp = MultOp | AddOp | DivOp | SubOp data UnaryOp = NegOp | AbsOp
How do these changes affect the lexer, parser, and evaluator code? Rewrite the functions for this new definition of
Our parser fails to produce helpful error messages - it mainly performs a single top level check to determine whether the token list has been consumed in its entirety. Change to parser such that, instead of returning
Nothing, it returns a helpful error message (as a
String) in the error cases, which the toplevel
parse emits? Hint: Make use of the
Either type instead of using
One way to fix the associativity problem with multiplication and addition, such that they are parsed as left associative operators, consists of writing a helper function that parses an product (much like
parseProdOrIntOrParenExp), but that returns a list of all the factor expressions of a product, instead of returning merely a single expression:
parseProdOrIntOrParenExprL :: [Token] -> Maybe ([Expr], [Token]) parseProdOrIntOrParenExprL tokens = case parseIntOrParenExpr tokens of Just (expr, (TimesTok : restTokens1)) -> case parseProdOrIntOrParenExprL restTokens1 of Just (exprs, restTokens2) -> Just (expr:exprs, restTokens2) Nothing -> Nothing Just (expr, restToken) -> Just ([expr], restToken) Nothing -> Nothing
For example, applying it to the token list for
"1 * (2 + 3) * 4 * 5" would return a list with five elements, namely the expression trees for
5. Then, we can use this list to assemble the correct, left associative parse tree. For our example, that would be:
Complete the implementation of
parseProdOrIntOrParenExpr below in a way that products are parsed correctly.
parseProdOrIntOrParenExpr :: [Token] -> Maybe (Expr, [Token]) parseProdOrIntOrParenExpr tokens = case parseProdOrIntOrParenExprL tokens of Just .. -> ....... Nothing -> Nothing
Hint: The solution is very short, but not necessarily obvious. Think about how you could use the
fold functions we discussed in Higher-order Functions to construct expression trees.
The term lexer stands for lexical analyser.↩