How Janet's PEG module works

2019-06-10


In making Janet, I was inspired by the REBOL Parse module and LPeg to include a versatile library for parsing text and any sequence of bytes based on Parsing Expression Grammars, or PEGs. I was also willing to exclude other forms of parsing tools such as regular expressions from the core library, as PEGs are more general than regular expression, or easier to read, and can capture error locations when matching text. The development of PEGs in Janet went particularly smoothly, made possible by the simple PEG formalism, Janet's basic data structures, and a very straight-forward subroutine threaded interpreter. While much of Janet's PEG module could be implemented in any language, Janet's C API and rich set of core data structures made it easy to implement PEGs in Janet.

This post is about how the Janet PEG engine is implemented; it is not a guide on using pegs. Although the code is written in Janet, It can be translated directly to most programming languages as we make no special use of any data structures besides arrays and hashtables. For information about using Janet's PEG module, see the Janet PEG documentation.

A Quick PEG Overview

The Parsing Expression Grammar formalism has it's roots in a 2004 Paper by Bryan Ford, who presents both the model and an algorithm for recognizing any PEG grammar in linear time. Our PEG engine will not parser all patterns in linear time, but will use similar semantics for defining our grammar. One difference between PEGs and more traditional parser generators, such as those used by YACC and Bison, is the ordered choice operator. For the purposes of this post, + is the ordered choice operator. For example, using a pseudo-syntax for PEG grammars where the MAIN rule is the entry point of the grammar, we can write a grammar that matches any text that starts with a, b, or c.

MAIN: "a" + "b" + "c"

This grammar will match "apple pie", "banana split", and "coconut cluster", but not "orange soda", "watermelon taffy", or "lemon snap".

It's also easy to write redundant grammars with the ordered choice operator. The following grammar will match on the first "a" only, and never "ab" or "abc". Anything that starts with "ab" or "abc" also starts with "a", so they could be removed from the grammar without changing its behavior.

MAIN: "a" + "ab" + "abc"

The ordered property of the + operator means that all PEG grammars are deterministic - if they match a string, they will only match it one way. This also means there is a direct correspondence between a PEG and a parser. This is very convenient when writing PEGs, as your specification and parser can often be one and the same. Traditional parser generators are non-deterministic, and additional rules are needed to resolve ambiguity.

To complete our PEG model, we only need a few more operators. Below are all operators and features needed for the PEG formalism to be as general as possible.

"..." - string literal
+ - choice operator
* - sequence operator (implicit in our pseudo-syntax)
! - Boolean inversion

The above features, along with recursion introduced by a top level grammar, can express all PEGs. This is a trimmed down version of the what's presented in the original paper, but other operators can be substituted for combinations of the above operators. The other very important feature needed for matching context free grammars is recursion, or otherwise our model would be limited to matching regular expressions.

Using this formalism, we can write a simple grammar that matches dates in ISO 8601 format. Such a date looks like

2019-06-10

A naive grammar that matches dates (but does not validate dates - 9999-99-99 would match) is presented below. This also requires years to be specified with 4 digits.

DIGIT: "0" + "1" + "2" + "3" + "4" + "5" + "6" + "7" + "8" + "9"
YEAR: DIGIT DIGIT DIGIT DIGIT
MONTH: DIGIT DIGIT
DAY: DIGIT DIGIT
MAIN: YEAR "-" MONTH "-" DAY

This looks very similar to a context free grammar, and in simple cases, the grammar is identical.

A simple PEG engine in code

Using our PEG definition above, we can write a simple match-peg function that takes a rule and a string to match against and tests if string matches the rule. If there is a match, the number of bytes matched will be returned to help us find where to begin the next match. Otherwise, we will return nil, or some token to indicate that there is no match. This code will be written in Janet, but could be very easily written in any language. We are using Janet's match macro here, which does pattern matching on arguments.

(defn match-peg
 [peg text]
 (match peg
   @[:! x] (unless (match-peg x text) 0)
   @[:+ x y] (or (match-peg x text) (match-peg y text))
   @[:* x y] (if-let [lenx (match-peg x text)]
                (if-let [leny (match-peg y (string/slice text lenx))]
                 (+ lenx leny)))
    # default is peg is a string literal
   (and (string/has-prefix? peg text) (length peg))))

This is a very simple but effective peg implementation. It does not support captures, various other operators, but captures the basics of PEGs in about 10 lines of code, and can even be used for recursive grammars.

(def digits [:+ "0" [:+ "1" [:+ "2" [:+ "3"
   [:+ "4" [:+ "5" [:+ "6" [:+ "7" [:+ "8" "9"]]]]]]]]])
(def year [:* digits [:* digits [:* digits digits]]])
(def month [:* digits digits])
(def day month)
(def iso-date [:* year [:* "-" :* month [:* "-" day]]])

(match-peg iso-date "2019-06-10") # -> 10
(match-peg iso-date "201-06-10") # -> nil

This implementation is very similar to a tree walk interpreter, and is the simplest way to implement a PEG. Janet's peg implementation does something very similar to this, although the syntax tree is first compiled to a bytecode form and validate for better speed.

Making Operators Variadic

Although not strictly necessary, we would like operators to look and behave more like lisp operators that are variadic. This will make long sequences or chains of choices easier to write.

(defn match-peg
 [peg text]
 (match peg
   @[:! x] (unless (match-peg x text) 0)
   @[:+] (some (fn [x] (match-peg x text)) (tuple/slice peg 1))
   @[:*] (do
       (var len 0)
       (var subtext text)
       (var ok true)
       (loop [x :in (tuple/slice peg 1) 
              :let [lenx (match-peg x subtext)
                    _ (set ok lenx)]
              :while ok]
          (set subtext (string/slice subtext lenx))
          (+= len lenx))
       (if ok len))
    # default is peg is a string literal
   (and (string/has-prefix? peg text) (length peg))))

Our ISO-date grammar from can be modified to look a bit more natural.

(def digits [:+ "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"])
(def year [:* digits digits digits digits])
(def month [:* digits digits])
(def day month)
(def iso-date [:* year "-" month "-" day])

(match-peg iso-date "2019-06-10") # -> 10
(match-peg iso-date "201-06-10") # -> nil

Adding Recursion

Earlier I stated that recursion is important to make PEGs useful, and although we can do recursion in ad hoc manner, we can make our recursive grammars look more like our original grammar using Janet tables. We can create a table mapping keywords to PEG expressions, and if peg is a keyword, we lookup its definition in the grammar table. We can add an extra argument to match-peg that will be our grammar table.

(defn match-peg
 [peg text grammar]
 (match peg
   @[:! x] (unless (match-peg x text grammar) 0)
   @[:+] (some (fn [x] (match-peg x text grammar)) (tuple/slice peg 1))
   @[:*] (do
       (var len 0)
       (var subtext text)
       (var ok true)
       (loop [x :in (tuple/slice peg 1) 
              :let [lenx (match-peg x subtext grammar)
                    _ (set ok lenx)]
              :while ok]
          (set subtext (string/slice subtext lenx))
          (+= len lenx))
       (if ok len))
   (x (keyword? x)) (match-peg (grammar x) text grammar)
    # default is peg is a string literal
   (and (string/has-prefix? peg text) (length peg))))

With our grammar table, we can define our grammar in a single structure.

(def grammar
 {:digit [:+ "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"]
  :year [:* :digit :digit :digit :digit]
  :month [:* :digit :digit]
  :day [:* :digit :digit]
  :main [:* :year "-" :month "-" :day]})

(match-peg (grammar :main) "2019-06-10" grammar) # -> 10

Adding Rules

So far, our match-peg function has not been very expressive. Where are our character classes, optional matches, and other fancy features from regular expressions? While we could simply add terms to our grammar for each character class (:d for digits, :w for alphanumerics), there are other features we would like as as well, such as the Kleene star, character sets, etc. Fortunately, it is very easy to add functionality to tree-walk interpreters. We can add as many operators as we like. For example, we can add the :set operator that will match 1 character if it occurs in a set.

(defn match-peg
 [peg text grammar]
 (match peg
   @[:set chars] (if (string/check-set chars text) 1)
   @[:! x] (unless (match-peg x text grammar) 0)
   @[:+] (some (fn [x] (match-peg x text grammar)) (tuple/slice peg 1))
   @[:*] (do
       (var len 0)
       (var subtext text)
       (var ok true)
       (loop [x :in (tuple/slice peg 1) 
              :let [lenx (match-peg x subtext grammar)
                    _ (set ok lenx)]
              :while ok]
          (set subtext (string/slice subtext lenx))
          (+= len lenx))
       (if ok len))
   (x (keyword? x)) (match-peg (grammar x) text grammar)
    # default is peg is a string literal
   (and (string/has-prefix? peg text) (length peg))))

Our date grammar will then become clearer as well.

(def grammar
 {:digit [:set "0123456789"]
  :year [:* :digit :digit :digit :digit]
  :month [:* :digit :digit]
  :day [:* :digit :digit]
  :main [:* :year "-" :month "-" :day]})

(match-peg (grammar :main) "2019-06-10" grammar) # -> 10

Most PEG libraries have several more rules to make it easier to write useful grammars, but all rules can be written pretty much the same way, as cases in our top level match expression. Our PEG engine is also missing captures, is rather slow, does not do tail-call optimization, and lacks many features that are present in other PEG engines such as those in Janet or REBOL. But for all its issues, match-peg is a good start to a general purpose pattern matching function.