50.054 - Top-down recursive parsing using Parser Combinators¶

Learning Outcome¶

By the end of this class, you should be able to

Implement a top-down recursive parser with backtracking
Implement a top-down recursive parser with on-demand backtracking

(Recap) Top-down parsing¶

In this secion we are going to focus on implementing Top-down parser.

Recall the grammar of a math expression.

<<grammar 4>> 
E::= T + E
E::= T
T::= T * F 
T::= F
F::= i

Abstract Syntax Tree¶

To implement top-down parsing, we first consider how to represent a parse tree in Haskell. It's natural to implement the parse trees in terms of some algebraic datatype. The Grammar 4 can be encoded with the following Haskell data type.

data Exp = TermExp Term | PlusExp Term Exp 
    deriving (Show, Eq) 

data Term = FactorTerm Factor | MultTerm Term Factor 
    deriving (Show, Eq)

data Factor = Factor Int
    deriving (Show, Eq)

Left Recursion Elimination¶

Recall Grammar 4 defined above contains some left recursion.

To eliminate the left recursion, we apply the same trick by rewriting left recursive grammar rules

\[ \begin{array}{rcl} N & ::= & N\alpha_1 \\ & ... & \\ N & ::= & N\alpha_n \\ N & ::= & \beta_1 \\ & ... & \\ N & ::= & \beta_m \end{array} \]

into

\[ \begin{array}{rcl} N & ::= & \beta_1 N' \\ & ... & \\ N & ::= & \beta_m N' \\ N' & ::= & \alpha_1 N' \\ & ... & \\ N' & ::= & \alpha_n N' \\ N' & ::= & \epsilon \end{array} \]

Grammar 4 can be rewritten into

E  ::= T + E
E  ::= T
T  ::= FT'
T' ::= *FT'
T' ::= epsilon
F  ::=i

None of the production rules above contains common leading terminal symbols, hence there is no need to apply left-factorization.

Note that in the above Grammar 4 with left recursion eliminated, the only rules affected are thos with non-terminal T,

Hence we only need to added the following enum type

Additional Abstract Syntax Tree¶

We could model it using Algebraic data type.

data TermLE = TermLE Factor TermLEP 

data TermLEP = MultTermLEP Factor TermLEP | Eps 

The main idea is when parsing a Term, instead of parsing directly, we parse a TermLE then convert it back to Term.

[IntTok 1, PlusTok, IntTok 2, AsterixTok, IntTok 3]

A parser method parseExp should generate

PlusExp (FactorTerm (Factor 1)) (TermExp (MultTerm (FactorTerm (Factor 2)) (Factor 3)))

where

sub term IntTok 1 was first parsed as TermLE (Factor 1) Eps then converted to FactorTerm (Factor 1), and
sub term IntTok 2, AsterixTok, IntTok 3 was first parsed as TermLE (Factor 2) (MultTerm (Factor 3) Eps) and converted to MultTerm (Factor 2) (Factor 3).

Parser Combinator with Backtracking¶

We consider implementing the naive top-down recursive parser in Haskell. Let's start with the simplest cases. Let's say we would like to write a parsing function that takes a list of lexical tokens and "consumes" a token, then return the rest.

data Result a = Failed String | Ok a 

item :: [LToken] -> Result (LToken,[LToken])
item [] = Failed "item is called with an empty input"
item (t:ts) = Ok (t,ts)

We define a variant of the Maybe datatype, Result which is either a failure with an error message, or an "Ok" result. The item function is what we would like to implement. It returns the extracted leading token with the rest of input if the input is non-empty, and signals a failure otherwise.

Apply the same idea we could define a conditional parsing function.

sat :: [LToken] -> (LToken -> Bool) -> Result (LToken,[LToken])
sat [] _ = Failed "sat is called with an empty input"
sat (t:ts) p = if p t
    then Ok (t,ts) 
    else Failed "sat is called with an input that does not satisfy the input predicate."

We may want to combine these basic parsing functions to form a larger parsing task, e.g.

aBitMoreComplexParser :: [LToken] -> Result (LToken,[LToken])
aBitMoreComplexParser toks = case item toks of 
    { Failed msg -> Failed msg 
    ; Ok (_, toks2) -> sat toks2 (\t -> case t of 
          { AsterixTok -> True
          ; _          -> False}) "Expecting an asterix."
    }

In the above, we define a parsing task which skips the first token and searches for the following asterix. We could imagine that to build a practical parser, we would need many of the basic parsing functions like sat and item, and combine them.

What we can observe from the above is that there are some similarity between sat and item, i.e. they both take in a list of tokens and returns the remaining tokens. If we view the lists of tokens as states, we could think of using the State Monad. We would also need this top-down parser to be able to backtrack in case of parsing failures. Recall the MonadError type class

class Monad m => MonadError e m | m -> e where
    throwError :: e -> m a 
    catchError :: m a -> (e -> m a) -> m a

We define the Parser data type as follows, similar to the State case data type in the State Monad, except that we replace the state by the (parametric) parser environment type env

type Error = String 

data Result a = Ok a
    | Failed String
    deriving (Show, Eq)

newtype Parser env a =
    Parser { run :: env -> Result (a, env) }

newtype is like a special data type definition where there is no alternative.

Next we provide the required type class instances to make Parser env into a MonadError instance.

instance Functor (Parser env) where
    fmap f (Parser ea) = Parser ( \env -> case ea env of
    { Failed err -> Failed err
    ; Ok (a, env1) -> Ok (f a, env1)
    })


instance Applicative (Parser env) where
    pure a = Parser (\env -> Ok (a, env))
    (Parser eab) <*> (Parser ea) = Parser (\env -> case eab env of
        { Failed msg -> Failed msg
        ; Ok (f, env1) -> case ea env1  of
            { Failed msg -> Failed msg
            ; Ok (a, env2) -> Ok (f a, env2)
            }
        })

instance Monad (Parser env) where
    (Parser ea) >>= f = Parser (\env -> case ea env of
        { Failed err   -> Failed err
        ; Ok (a, env1) -> case f a of
            { Parser eb -> eb env1 }
        })

instance MonadError Error (Parser env) where
    throwError msg = Parser (\env ->  Failed msg)
    catchError (Parser ea) handle = Parser (\env -> case ea env of
        { Failed msg -> case handle msg of
            { Parser ea2 -> ea2 env }
        ; Ok v ->  Ok v
        })

With the Monad instance in-place, let's consider the "requirements" of the parser environment type env. We use the following type class to constraint any implementation of the parser environment must fulfill these requirements.

class ParserEnv env tok | env -> tok
    where
        getTokens :: env -> [tok]
        getLine :: env -> Int
        getCol :: env -> Int
        setTokens :: [tok] -> env -> env
        setLine :: Int -> env -> env
        setCol :: Int -> env -> env
        isNextTokNewLine :: env -> Bool

Now we can re-define the item and sat function by making use of the monad (error) abstraction.

-- | The `item` combinator unconditionally parse the leading token. 
item :: ParserEnv env tok => Parser env tok
item = Parser (\env ->
    let toks = getTokens env
        ln   = getLine env
        col  = getCol env
    in case toks of
        { [] -> Failed "item is called with an empty token stream."
        ; (c : cs) | isNextTokNewLine env ->
            let env1 = setLine (ln+1) env
                env2 = setCol 1 env1
                env3 = setTokens cs env2
            in Ok (c, env3)
                   | otherwise ->
            let env1 = setCol (col+1) env
                env2 = setTokens cs env1
            in Ok (c, env2)
        })

-- | The `sat` combinator consume the leading token if it satifies the predicate `p`.
sat :: ParserEnv env tok => (tok -> Bool) -> Error -> Parser env tok
sat p dbgMsg = Parser (\env ->
    let toks = getTokens env
        ln   = getLine env
        col  = getCol env
    in case toks of
        { [] -> Failed ("sat is called with an empty token stream at line " ++ show ln ++ ", col " ++ show col ++ ". " ++ dbgMsg)
        ; (c:cs) | p c && isNextTokNewLine env ->
            let env1 = setLine (ln+1) env
                env2 = setCol 1 env1
                env3 = setTokens cs env2
            in Ok (c, env3)
        ; (c:cs) | p c ->
            let env1 = setCol (col+1) env
                env2 = setTokens cs env1
            in Ok (c, env2)
        ; (c:cs) -> Failed ("sat is called with an unsatisfied predicate at line " ++ show ln ++ ", col " ++ show col ++ ". " ++ dbgMsg)
        }
    )

The | condExp defines a pattern guard in Haskell. For instance, in the item function above, the pattern case (c : cs) | isNextTokNewLine env is triggered when the incoming tokens toks are of shape c:cs and the isNextTokenNewLine env is True. Compared to the naive version of item defined earlier, this version abstracts away the details of the tokens retrieval and update. It also handles the book-keeping of the current token position in the source file. Like-wise sat function is updated to support the same features.

More importantly, we could define more generic and useful combinators

choice :: Parser env a -> Parser env a -> Parser env a
choice p q = catchError p (\e -> q)

The choice combinator takes two parsers p and q. It tries to run p. In case p fails, it backtracks (by restoring the original state) and runs q.

Now we can make use of choice to define an optional combinator

optional :: Parser env a -> Parser env (Either () a)
optional pa =
    let p1 = Right <$> pa
        p2 = return (Left ())
    in choice p1 p2

optional takes a parser pa and tries to execute it with the current input. If it fails, it restores the original state and returns ().

Let's try to write a parser for the simple arithmetic expression

Recall the nice property of a top-down parser is that the parser is correspondent to the top-down traversal of the production rules.

We provide the concrete definition of the parser environment data type as follows,

data PEnv = PEnv { toks:: [LToken]} deriving (Show, Eq) 

instance ParserEnv PEnv LToken where
    getCol penv           = -1 -- ^ not in used. 
    getLine penv          = -1 -- ^ not in used. 
    setTokens ts penv     = penv{toks= ts}
    setLine _ penv        = penv -- ^ not in used.
    setCol _ penv         = penv -- ^ not in used. 
    isNextTokNewLine penv = False -- ^ not in used. 
    getTokens             = toks   

Recall the grammar 4 of Math expression with left recursion.

E::= T + E
E::= T
T::= T * F 
T::= F
F::= i

parseExp :: Parser PEnv Exp 
parseExp = choice parsePlusExp parseTermExp

parsePlusExp :: Parser PEnv Exp
parsePlusExp = do
    t    <- parseTerm 
    plus <- parsePlusTok
    e    <- parseExp 
    return (PlusExp t e)

parseTermExp :: Parser PEnv Exp 
parseTermExp = do 
    t <- parseTerm 
    return (TermExp t)

Up to this point we are ok as production rules with E on the LHS are not left recursive. It gets tricky when paarsing T which contains left recursion. Recall the modified grammar of T having left-recursion eliminated.

1
2
3

T  ::= FT'  
T' ::= *FT'
T' ::= epsilon

In terms of Haskell data type, we refer to them as TermLE and TermLEP. Hence the parser parserTerm has to be defined in terms of parseTermLE, then convert the result of TermLE back to Term

parseTerm :: Parser PEnv Term 
parseTerm = do 
    tle <- parseTermLE 
    return (fromTermLE tle) 

Where parseTermLE can be implemented using parsec,

parseTermLE :: Parser PEnv TermLE 
parseTermLE = do 
    f  <- parseFactor
    tp <- parseTermP 
    return (TermLE f tp)

parseTermP :: Parser PEnv TermLEP 
parseTermP = do 
    omt <- optional parseMultTermP
    case omt of 
        Left _ -> return Eps
        Right t -> return t


parseMultTermP :: Parser PEnv TermLEP 
parseMultTermP = do 
    asterix <- parseAsterixTok 
    f       <- parseFactor
    tp      <- parseTermP 
    return (MultTermLEP f tp)

parseFactor :: Parser PEnv Factor 
parseFactor = do 
    i <- parseIntTok 
    f <- justOrFail i ( \ itok -> case itok of 
        { IntTok v -> Just (Factor v)
        ; _ -> Nothing 
        }) "parseFactor fail: expect to parse an integer token but it is not an integer."
    return f


parsePlusTok :: Parser PEnv LToken
parsePlusTok = sat ( \x -> case x of 
    { PlusTok -> True
    ; _       -> False
    }) "parsePlusTok failed, expecting a plus token."

parseAsterixTok :: Parser PEnv LToken 
parseAsterixTok = sat ( \x -> case x of 
    AsterixTok -> True
    _          -> False ) "parseAsterixTok failed, expecting an asterix token."


parseIntTok :: Parser PEnv LToken 
parseIntTok = sat ( \x -> case x of 
    IntTok v -> True
    _        -> False ) "parseIntTok failed, expecting an integer token."

Finally the TermLE to Term conversion is an inversed in order traversal, as the parse tree of TermLE

    T
   / \
  f   T'
     /|\
    * f T'
       /|\
      * f T'
          |
         eps

and the parse tree of Term is

        T
       /|\
      T * f
     /| \
    T * f 
    |
    f

The implementation can be found as follows,

fromTermLE :: TermLE -> Term 
fromTermLE (TermLE f tep) = fromTermLEP (FactorTerm f) tep 

fromTermLEP :: Term -> TermLEP -> Term 
fromTermLEP t1 Eps = t1 
fromTermLEP t1 (MultTermLEP f2 tp2) = 
    let t2 = MultTerm t1 f2
    in fromTermLEP t2 tp2

And here is some test cases

it "test parse 1+2*3" $
    let toks = [IntTok 1, PlusTok, IntTok 2, AsterixTok, IntTok 3]
        result = run parseExp (PEnv toks) 
        expected = Ok (PlusExp (FactorTerm (Factor 1)) (TermExp (MultTerm (FactorTerm (Factor 2)) (Factor 3))), PEnv [])
    in result `shouldBe` expected

Parser Combinator without backtracking (In spirit of LL(1))¶

Let's extend our parser combinator to support LL(1) parsing without backtracking.

We introduce the following algebraic datatype to label an (intermediate) parsing result

data Progress a = Consumed a
    | Empty a
    deriving Show

The partial is is Consumed when there has been input tokens consumed, otherwise, Empty.

We adjust the definition of the Parser case class as follows

newtype Parser env a =
    Parser { run :: env -> Progress (Result (a, env)) }

Next we update the type class instances accordingly.

instance Functor (Parser env) where
    fmap f (Parser ea) = Parser ( \env -> case ea env of
    { Empty (Failed err) -> Empty (Failed err)
    ; Empty (Ok (a, env1)) -> Empty (Ok (f a, env1))
    ; Consumed (Failed err) -> Consumed (Failed err)
    ; Consumed (Ok (a, env1)) -> Consumed (Ok (f a, env1))
    })



instance Applicative (Parser env) where
    pure a = Parser (\env -> Empty (Ok (a, env)))
    (Parser eab) <*> (Parser ea) = Parser (\env -> case eab env of
        { Consumed v ->
            let cont = case v of
                { Failed msg -> Failed msg
                ; Ok (f, env1) -> case ea env1 of
                    { Consumed (Failed msg) -> Failed msg
                    ; Consumed (Ok (a, env2)) -> Ok (f a, env2)
                    ; Empty (Failed msg) -> Failed msg
                    ; Empty (Ok (a, env2)) -> Ok (f a, env2)
                    }
                }
            in Consumed cont
        ; Empty (Failed msg) -> Empty (Failed msg)
        ; Empty (Ok (f, env1)) -> case ea env1  of
            { Consumed (Failed msg) -> Consumed (Failed msg)
            ; Consumed (Ok (a, env2)) -> Consumed (Ok (f a, env2))
            ; Empty (Failed msg) -> Empty (Failed msg)
            ; Empty (Ok (a, env2)) -> Empty (Ok (f a, env2))
            }
        })


instance Monad (Parser env) where
    (Parser ea) >>= f = Parser (\env -> case ea env of
            { Consumed v ->
                let cont = case v of
                    { Failed msg -> Failed msg
                    ; Ok (a, env1)  -> case f a of
                        { Parser eb -> case eb env1 of
                            { Consumed x -> x
                            ; Empty x    -> x
                            }
                        }
                    }
                in Consumed cont
            ; Empty v -> case v of
                { Failed err   -> Empty (Failed err)
                ; Ok (a, env1) -> case f a of
                    { Parser eb -> eb env1 }
                }
            })

Let's look at the >>=, the Monadic bind. It takes the first parser (Parser ea) and pattern-matches it against Parser(p). In the output parser, we first apply ea to the input environment env.

If the result's progress is Empty v, we check whether v is an error or Ok. When it is an error, it will be propogated, otherwise, we apply f to the output of a. That will give us the second parser to continue with, Parser eb. We then apply eb to env1 which should be the same as env since nothing has been consumed.
If the result's progress is Consumed v, some part of input tokens in the environment env has been parsed. The parser's behaviour here should be similar to the previous case, except that eb env1 progress result will always be updated as Consumed regardless whether eb has consumed anything. Thanks Haskell lazy evaluation, the result held by variable cont is delayed until it is actually needed. Such an optimization allows us to return the progress information Consumed without actually executing eb env1 when its result is not needed.

Similar observation applies to <*> function in the applicative instance.

The defintion of the MonadError type class instance is adjusted to recognize the Consumed and Empty data.

instance MonadError Error (Parser env) where
    throwError msg = Parser (\env -> Empty (Failed msg))
    catchError (Parser ea) handle = Parser (\env -> case ea env of
        { Consumed v -> Consumed v -- we don't backtrack when something is already consumed.
        ; Empty (Failed msg) -> case handle msg of
            { Parser ea2 -> ea2 env
            }
        ; Empty (Ok v) -> -- LL(1) parser will also attempt to look at f if fa does not consume anything 
            case handle "faked error" of
                {  Parser ea2 -> case ea2 env of
                    { Empty _ -> Empty (Ok v) -- if handle also fails, we report the same error.
                    ; consumed -> consumed
                    }
                }
        })

The highlight is in the catchError function, we do not backtrack when the the first parser (Parser ea) has consumed some input. The error recovery method handle is applied only when the progress of the parsing so far is Empty. In other words, given a choice of two parsers choice p1 p2, it will not backtrack to p2 if p1 has consumed some token. This is in-sync with predictive parsing.

All other combinators such as choice, item, sat, optional can be adjusted in the same fashion. We refer to the cohort exercise template code Parsec.hs for details.

We know that not all languages are in LL(1). It is undecidable to find out which k of LL(k) that a language is in. Thus, the above parser might not be very useful since it only works if the given language is in LL(1).

Thanks to the Monadic design, it is very easy to extend the parser to accept non LL(1) language by supporting backtracking on-demand. That is, the parser by default is not backtracking, however, it could if we want it to backtrack explicitly.

-- | The `attempt` combinator explicitly tries a parser and backtracks if it fails.
attempt :: Parser env a -> Parser env a
attempt (Parser ea) = Parser (\env -> case ea env of
    { Consumed (Failed err) -> Empty (Failed err) -- undo the consumed effect if it fails. 
    ; other -> other
    })

The attempt combinator takes a parser Parser ea and runs it. If its result is Consumed but Failed, it will reset the progress as Empty.