In making Janet, I was inspired by the REBOL Parse module and LPeg to include a versatile library for parsing text and any sequence of bytes based on Parsing Expression Grammars, or PEGs. I was also willing to exclude other forms of parsing tools such as regular expressions from the core library, as PEGs are more general than regular expression, or easier to read, and can capture error locations when matching text. The development of PEGs in Janet went particularly smoothly, made possible by the simple PEG formalism, Janet's basic data structures, and a very straight-forward subroutine threaded interpreter. While much of Janet's PEG module could be implemented in any language, Janet's C API and rich set of core data structures made it easy to implement PEGs in Janet.
This post is about how the Janet PEG engine is implemented; it is not a guide on using pegs. Although the code is written in Janet, It can be translated directly to most programming languages as we make no special use of any data structures besides arrays and hashtables. For information about using Janet's PEG module, see the Janet PEG documentation.
A Quick PEG Overview
The Parsing Expression Grammar formalism has it's roots in a
2004 Paper by Bryan Ford, who
presents both the model and an algorithm for recognizing any PEG grammar in
linear time. Our PEG engine will not parser all patterns in linear time, but
will use similar semantics for defining our grammar. One difference between
PEGs and more traditional parser generators, such as those used by YACC and
Bison, is the ordered choice operator. For the purposes of this post, +
is the ordered choice operator. For example, using a pseudo-syntax for PEG
grammars where the MAIN
rule is the entry point of the grammar, we can
write a grammar that matches any text that starts with a, b, or c.
MAIN: "a" + "b" + "c"
This grammar will match "apple pie", "banana split", and "coconut cluster", but not "orange soda", "watermelon taffy", or "lemon snap".
It's also easy to write redundant grammars with the ordered choice operator. The following grammar will match on the first "a" only, and never "ab" or "abc". Anything that starts with "ab" or "abc" also starts with "a", so they could be removed from the grammar without changing its behavior.
MAIN: "a" + "ab" + "abc"
The ordered property of the +
operator means that all PEG grammars are
deterministic - if they match a string, they will only match it one way. This
also means there is a direct correspondence between a PEG and a
parser. This is very convenient when writing PEGs, as your specification
and parser can often be one and the same.
Traditional parser generators are non-deterministic, and additional
rules are needed to resolve ambiguity.
To complete our PEG model, we only need a few more operators. Below are all operators and features needed for the PEG formalism to be as general as possible.
"..." - string literal
+ - choice operator
* - sequence operator (implicit in our pseudo-syntax)
! - Boolean inversion
The above features, along with recursion introduced by a top level grammar, can express all PEGs. This is a trimmed down version of the what's presented in the original paper, but other operators can be substituted for combinations of the above operators. The other very important feature needed for matching context free grammars is recursion, or otherwise our model would be limited to matching regular expressions.
Using this formalism, we can write a simple grammar that matches dates in ISO 8601 format. Such a date looks like
2019-06-10
A naive grammar that matches dates (but does not validate dates - 9999-99-99 would match) is presented below. This also requires years to be specified with 4 digits.
DIGIT: "0" + "1" + "2" + "3" + "4" + "5" + "6" + "7" + "8" + "9"
YEAR: DIGIT DIGIT DIGIT DIGIT
MONTH: DIGIT DIGIT
DAY: DIGIT DIGIT
MAIN: YEAR "-" MONTH "-" DAY
This looks very similar to a context free grammar, and in simple cases, the grammar is identical.
A simple PEG engine in code
Using our PEG definition above, we can write a simple
match-peg
function that takes a rule and a string to match against
and tests if string matches the rule. If there is a match, the number of bytes matched will
be returned to help us find where to begin the next match. Otherwise, we will
return nil, or some token to indicate that there is no match. This code will be
written in Janet, but could be very easily written in any language. We are using
Janet's match macro here, which does pattern matching on arguments.
(defn match-peg
[peg text]
(match peg
@[:! x] (unless (match-peg x text) 0)
@[:+ x y] (or (match-peg x text) (match-peg y text))
@[:* x y] (if-let [lenx (match-peg x text)]
(if-let [leny (match-peg y (string/slice text lenx))]
(+ lenx leny)))
# default is peg is a string literal
(and (string/has-prefix? peg text) (length peg))))
This is a very simple but effective peg implementation. It does not support captures, various other operators, but captures the basics of PEGs in about 10 lines of code, and can even be used for recursive grammars.
(def digits [:+ "0" [:+ "1" [:+ "2" [:+ "3"
[:+ "4" [:+ "5" [:+ "6" [:+ "7" [:+ "8" "9"]]]]]]]]])
(def year [:* digits [:* digits [:* digits digits]]])
(def month [:* digits digits])
(def day month)
(def iso-date [:* year [:* "-" :* month [:* "-" day]]])
(match-peg iso-date "2019-06-10") # -> 10
(match-peg iso-date "201-06-10") # -> nil
This implementation is very similar to a tree walk interpreter, and is the simplest way to implement a PEG. Janet's peg implementation does something very similar to this, although the syntax tree is first compiled to a bytecode form and validate for better speed.
Making Operators Variadic
Although not strictly necessary, we would like operators to look and behave more like lisp operators that are variadic. This will make long sequences or chains of choices easier to write.
(defn match-peg
[peg text]
(match peg
@[:! x] (unless (match-peg x text) 0)
@[:+] (some (fn [x] (match-peg x text)) (tuple/slice peg 1))
@[:*] (do
(var len 0)
(var subtext text)
(var ok true)
(loop [x :in (tuple/slice peg 1)
:let [lenx (match-peg x subtext)
_ (set ok lenx)]
:while ok]
(set subtext (string/slice subtext lenx))
(+= len lenx))
(if ok len))
# default is peg is a string literal
(and (string/has-prefix? peg text) (length peg))))
Our ISO-date grammar from can be modified to look a bit more natural.
(def digits [:+ "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"])
(def year [:* digits digits digits digits])
(def month [:* digits digits])
(def day month)
(def iso-date [:* year "-" month "-" day])
(match-peg iso-date "2019-06-10") # -> 10
(match-peg iso-date "201-06-10") # -> nil
Adding Recursion
Earlier I stated that recursion is important to make PEGs useful, and although
we can do recursion in ad hoc manner, we can make our recursive grammars look
more like our original grammar using Janet tables. We can create a table
mapping keywords to PEG expressions, and if peg
is a keyword, we lookup
its definition in the grammar table. We can add an extra argument to
match-peg
that will be our grammar table.
(defn match-peg
[peg text grammar]
(match peg
@[:! x] (unless (match-peg x text grammar) 0)
@[:+] (some (fn [x] (match-peg x text grammar)) (tuple/slice peg 1))
@[:*] (do
(var len 0)
(var subtext text)
(var ok true)
(loop [x :in (tuple/slice peg 1)
:let [lenx (match-peg x subtext grammar)
_ (set ok lenx)]
:while ok]
(set subtext (string/slice subtext lenx))
(+= len lenx))
(if ok len))
(x (keyword? x)) (match-peg (grammar x) text grammar)
# default is peg is a string literal
(and (string/has-prefix? peg text) (length peg))))
With our grammar table, we can define our grammar in a single structure.
(def grammar
{:digit [:+ "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"]
:year [:* :digit :digit :digit :digit]
:month [:* :digit :digit]
:day [:* :digit :digit]
:main [:* :year "-" :month "-" :day]})
(match-peg (grammar :main) "2019-06-10" grammar) # -> 10
Adding Rules
So far, our match-peg
function has not been very expressive. Where
are our character classes, optional matches, and other fancy features from regular
expressions?
While we could simply add terms to our grammar for each character class
(:d for digits, :w for alphanumerics), there are other features we would like as
as well, such as the Kleene star, character sets, etc.
Fortunately, it is very easy to add functionality to tree-walk interpreters.
We can add as many operators as we like. For example, we can add the :set
operator
that will match 1 character if it occurs in a set.
(defn match-peg
[peg text grammar]
(match peg
@[:set chars] (if (string/check-set chars text) 1)
@[:! x] (unless (match-peg x text grammar) 0)
@[:+] (some (fn [x] (match-peg x text grammar)) (tuple/slice peg 1))
@[:*] (do
(var len 0)
(var subtext text)
(var ok true)
(loop [x :in (tuple/slice peg 1)
:let [lenx (match-peg x subtext grammar)
_ (set ok lenx)]
:while ok]
(set subtext (string/slice subtext lenx))
(+= len lenx))
(if ok len))
(x (keyword? x)) (match-peg (grammar x) text grammar)
# default is peg is a string literal
(and (string/has-prefix? peg text) (length peg))))
Our date grammar will then become clearer as well.
(def grammar
{:digit [:set "0123456789"]
:year [:* :digit :digit :digit :digit]
:month [:* :digit :digit]
:day [:* :digit :digit]
:main [:* :year "-" :month "-" :day]})
(match-peg (grammar :main) "2019-06-10" grammar) # -> 10
Most PEG libraries have several more rules to make it easier to write useful
grammars, but all rules can be written pretty much the same way, as cases in
our top level match expression. Our PEG engine is also missing captures, is
rather slow, does not do tail-call optimization, and lacks many features that
are present in other PEG engines such as those in Janet or REBOL. But for all
its issues, match-peg
is a good start to a general purpose pattern
matching function.