Recently I came across a nice Lisp-based Monadic Parser Combinator library written by Massung. Unlike the traditional parser generators such as ANTLR, this library allows us to embed the parser in Lisp. Similar libraries are available for other languages too (see, for example ParsecJ for Java). The original idea of Monadic parser is from Haskell’s Parsec library.
For a very readable discussion on Functor, Applicative, and Monadic approaches, see this article. There is a lot of material on the internet about this topic, so I am not going to get into the basics here.
The important point to remember about the library (and those that fall in this category) is that it follows a top-down, recursive-descent strategy, and hence cannot handle left-recursive grammars (however, you can rewrite left-recursive grammars by eliminating left recursion).
The main usage scenario for this type of parsers would be mini embedded Domain Specific Languages (DSLs). Because of the inherent ability to backtrack, unless used carefully, these parsers can have performance issues. Memoization can help somewhat.
Massung gives nontrivial parsing examples (JSON, XML, etc.) using this library. I played with the library for a while, and as part of trying to understand it, applied it to build a simple English grammar. In today’s post I would like to discuss my example, and along the way, give you a glimpse of the nice library.
The main parse function is parse. It takes two arguments, a parser and a lexer. The task of the lexer is to keep supplying tokens to the parser as long as the input stream is non-empty.
Take a look at the read-words function below.
It takes a list argument and returns a function that takes no arguments. This returned function is the lexer that the parser will use. As per the library’s documentation, the lexer must return 2 values, namely, the class of the in-coming token and the token itself (the latter is optional). However, in my case, because an English word can have multiple parts-of-speech (for example, sleep can be a Noun or Verb depending on the context), my lexer returns a list of potential POS categories, not just one.
Another point worth mentioning is that the lexer could read the input from any source and not just from a list, as is the case in my example. As good design dictates, the parser is shielded from this logic.
Let us turn our attention to the parser. In a parser combinator, we can build complex parser objects out of simpler ones. Perhaps the simplest parser in the library is .any. It matches any token and returns its value.
PARSE-TEST 1 > (parse (.any) (read-words ‘(the dog is sleeping)))
THE
T
The first value printed is the value of the token and the second value indicates whether the parse was successful or not.
Another parser is .many1. It takes a parser as its single argument and applies that parser one or more times. Let us combine .any and .many1 thus:
PARSE-TEST 2 > (parse (.many1 (.any)) (read-words ‘(the dog is sleeping)))
(THE DOG IS SLEEPING)
T
As expected, .many1 causes the parser to consume .any token as long as there is input. So the complete input is read and returned by the parser.
Let us look at a parser combinator I wrote. .is-cat? takes a POS type as argument and checks whether it matches the POS type of the in-coming token.
PARSE-TEST 3 > (parse (.is-cat? :pron) (read-words ‘(my cat)))
MY
T
The above command checks if the first word is of type :pron (pronoun), and since my is a pronoun, the parser succeeds. We could also give a list of possible POS types, in which case, the input token must match one of the given POS types.
PARSE-TEST 4 > (parse (.is-cat? ‘(:verb :noun)) (read-words ‘(sleeping is good)))
SLEEPING
T
You may be wondering how the lexer knows about the POS category of the different words it encounters. The answer is simple: I have a toy lexicon for this purpose (shown later).
By the way, the .is-cat? parser I have written is slightly different from the library-provided .is, in that the latter only expects a single token type from the lexer, whereas .is-cat? permits multiple types (in our case, POS types).
I wrote another parse combinator, a logical extension of the library-provided .do parser. Whereas .do applies multiple parsers in sequence and returns the result of the last parser, my .seq parser applies multiple parsers in sequence, collects their individual result and finally returns that as a list. As you will see later, this makes some parsing tasks easy.
PARSE-TEST 5 > (parse (.seq (.is-cat? ‘(:det)) (.is-cat? ‘(:adj)) (.is-cat? ‘(:noun)))
(read-words ‘(the cute cat)))
(THE CUTE CAT)
T
In the above example, we are trying to match three tokens, a determiner, an adjective and a noun, in that order.
The library has a macro called define-parser that allows us to define a parser combinator by stitching together other parsers. This comes in quite handy.
With this background, let us turn our attention to the English parser I have put together for this post. To stress, the grammar is extremely simplified in order to illustrate the use of the library. Besides, there is no semantic check or action associated with the grammar.
Let us start with the Lexicon, which is used by the Lexer:
For each word, the lexicon associates a list of possible parts of speech corresponding to that word. The get-categories function, called by the lexer, looks up a word in the lexicon, and if found, returns the corresponding POS list.
To follow a uniform convention, I use define-parser to define a parser combinator for each of the POS categories, and use these to build the next level of Non-terminals.
I have defined noungroup to encompass Noun, Pronoun and Proper Noun.
The .either+ parser combinator is a simple extension of Massung’s built-in .either, in that it allows more than 2 parsers to be specified. Here is the code for .either+:
Here are the other elements of our toy grammar of English:
.many invokes a parse combinator zero or more times, while .many1 invokes it one or more times.
Now that our grammar is defined, let us try to parse some sentences.
PARSE-TEST 6 > (parse ‘sentence (read-words ‘(my son is studying)))
(MY SON IS STUDYING)
T
PARSE-TEST 7 > (parse ‘sentence (read-words ‘(my son is studying and I am talking to my friend)))
(MY SON IS STUDYING AND I AM TALKING TO MY FRIEND)
T
PARSE-TEST 8 > (parse ‘sentence (read-words ‘(the big dog is playing with the small cute cat)))
(THE BIG DOG IS PLAYING WITH THE SMALL CUTE CAT)
T
What happens if we give an ungrammatical sentence? The parse function, by default, signals parse failure.
PARSE-TEST 9 > (parse ‘sentence (read-words ‘(all men are mortal)))
Error: Parse failure
1 (abort) Return to top loop level 0.
Type :b for backtrace or :c <option number> to proceed.
Type :bug-form “<subject>” for a bug report template or “:?” for other options.
We can use the errorp and error-value keyword arguments to control this behavior.
PARSE-TEST 10 > (parse ‘sentence (read-words ‘(all men are mortal)) :errorp nil :error-value nil)
NIL
NIL
The way we have defined our parser functions, no parse tree is getting built during the parsing process. In a real parsing scenario, however, the parser needs to construct an intermediate representation of the parsed sentences in order to facilitate subsequent analysis. Massung’s library has several functions for keeping track of parsed state at every stage, but in order to keep the discussion simple, I have not used those in this simple example. If time permits, I will discuss this part in a future post.
You can download the source for my example here .
Have a great weekend!
Recent Comments