Parsing expression grammar

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s.

Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

Unlike CFGs, PEGs cannot be ambiguous; a string has exactly one valid parse tree or none. It is conjectured that there exist context-free languages that cannot be recognized by a PEG, but this is not yet proven.

Definition

A parsing expression is a kind of pattern that each string may either match or not match. In case of a match, there is a unique prefix of the string (which may be the whole string, the empty string, or something in between) which has been consumed by the parsing expression; this prefix is what one would usually think of as having matched the expression. However, whether a string matches a parsing expression may (because of look-ahead predicates) depend on parts of it which come after the consumed part. A parsing expression language is a set of all strings that match some specific parsing expression. can be closer to using a programming-language native encoding of abstract syntax parsing expressions as their concrete syntax.

Atomic parsing expressions

The two main kinds of parsing expressions not containing another parsing expression are individual terminal symbols and nonterminal symbols. In concrete syntax, terminals are placed inside quotes (single or double), whereas identifiers not in quotes denote nonterminals:

"terminal" Nonterminal 'another terminal'

</syntaxhighlight>

In the abstract syntax there is no formalised distinction, instead each symbol is supposedly defined as either terminal or nonterminal, but a common convention is to use upper case for nonterminals and lower case for terminals.

The concrete syntax also has a number of forms for classes of terminals:

A <code>.</code> (period) is a parsing expression matching any single terminal.
Brackets around a list of characters <code>[abcde]</code> form a parsing expression matching one of the numerated characters. As in regular expressions, these classes may also include ranges <code>[0-9A-Za-z]</code> written as a hyphen with the range endpoints before and after it. (Unlike the case in regular expressions, bracket character classes do not have <code>^</code> for negation; that end can instead be had via not-predicates.)
Some dialects have further notation for predefined classes of characters, such as letters, digits, punctuation marks, or spaces; this is again similar to the situation in regular expressions.

In abstract syntax, such forms are usually formalised as nonterminals whose exact definition is elided for brevity; in Unicode, there are tens of thousands of characters that are letters. Conversely, theoretical discussions sometimes introduce atomic abstract syntax for concepts that can alternatively be expressed using composite parsing expressions. Examples of this include:

the empty string ε (as a parsing expression, it matches every string and consumes no characters),
end of input E (concrete syntax equivalent is <code>!.</code>), and
failure <math>\bot</math> (matches nothing).

In the concrete syntax, quoted and bracketed terminals have backslash escapes, so that "line feed or carriage return" may be written <code>[\n\r]</code>. The abstract syntax counterpart of a quoted terminal of length greater than one would be the sequence of those terminals; <code>"bar"</code> is the same as <code>"b" "a" "r"</code>. The primary concrete syntax assigns no distinct meaning to terminals depending on whether they use single or double quotes, but some dialects treat one as case-sensitive and the other as case-insensitive.

Composite parsing expressions

Given any existing parsing expressions e, e1, and e2, a new parsing expression can be constructed using the following operators:

Sequence: e1 e2
Ordered choice: e1 / e2
Zero-or-more: e*
One-or-more: e+
Optional: e?
And-predicate: &e
Not-predicate: !e
Group: (e)

Operator priorities are as follows, based on Table 1 in: but the primary concrete syntax instead has the implicit rule that the first nonterminal defined is the starting expression.

It is worth noticing that the primary dialect of concrete syntax parsing expression grammars does not have an explicit definition terminator or separator between definitions, although it is customary to begin a new definition on a new line; the <code>LEFTARROW</code> of the next definition is sufficient for finding the boundary, if one adds the constraint that a nonterminal in an <code>Expression</code> must not be followed by a <code>LEFTARROW</code>. However, some dialects may allow an explicit terminator, or outright require

<syntaxhighlight lang="peg">S ← &(A !('a'/'b')) 'a'* B !.

A ← ('a' A 'b')?

B ← ('b' B 'c')?</syntaxhighlight>

Implementing parsers from parsing expression grammars

Any parsing expression grammar can be converted directly into a recursive descent parser. Due to the unlimited lookahead capability that the grammar formalism provides, however, the resulting parser could exhibit exponential time performance in the worst case.

It is possible to obtain better performance for any parsing expression grammar by converting its recursive descent parser into a packrat parser, which always runs in linear time, at the cost of substantially greater storage space requirements. A packrat parser

This analysis of the performance of a packrat parser assumes that enough memory is available to hold all of the memoized results; in practice, if there is not enough memory, some parsing functions might have to be invoked more than once at the same input position, and consequently the parser could take more than linear time.

It is also possible to build LL parsers and LR parsers from parsing expression grammars, with better worst-case performance than a recursive descent parser without memoization, but the unlimited lookahead capability of the grammar formalism is then lost. Therefore, not all languages that can be expressed using parsing expression grammars can be parsed by LL or LR parsers.

Bottom-up PEG parsing

A pika parser uses dynamic programming to apply PEG rules bottom-up and right to left, which is the inverse of the normal recursive descent order of top-down, left to right. Parsing in reverse order solves the left recursion problem, allowing left-recursive rules to be used directly in the grammar without being rewritten into non-left-recursive form, and also confers optimal error recovery capabilities upon the parser, which historically proved difficult to achieve for recursive descent parsers.

Advantages

No compilation required

Many parsing algorithms require a preprocessing step where the grammar is first compiled into an opaque executable form, often some sort of automaton. Parsing expressions can be executed directly (even if it is typically still advisable to transform the human-readable PEGs shown in this article into a more native format, such as S-expressions, before evaluating them).

Compared to regular expressions

Compared to pure regular expressions (i.e., describing a language recognisable using a finite automaton), PEGs are vastly more powerful. In particular they can handle unbounded recursion, and so match parentheses down to an arbitrary nesting depth; regular expressions can at best keep track of nesting down to some fixed depth, because a finite automaton (having a finite set of internal states) can only distinguish finitely many different nesting depths. In more theoretical terms, <math> \{a^n b^n\}_{n \geqslant 0} </math> (the language of all strings of zero or more <math>a</math>'s, followed by an equal number of <math>b</math>s) is not a regular language, but it is easily seen to be a parsing expression language, matched by the grammar

start ← AB !.

AB ← ('a' AB 'b')?

</syntaxhighlight>

Here <code>AB !.</code> is the starting expression. The <code>!.</code> part enforces that the input ends after the <code>AB</code>, by saying “there is no next character”; unlike regular expressions, which have magic constraints <code>$</code> or <code>\Z</code> for this, parsing expressions can express the end of input using only the basic primitives.

The <code>*</code>, <code>+</code>, and <code>?</code> of parsing expressions are similar to those in regular expressions, but a difference is that these operate strictly in a greedy mode. This is ultimately due to <code>/</code> being an ordered choice. A consequence is that something can match as a regular expression which does not match as parsing expression:

: <code>[ab]?[bc][cd]</code>

is both a valid regular expression and a valid parsing expression. As regular expression, it matches <code>bc</code>, but as parsing expression it does not match, because the <code>[ab]?</code> will match the <code>b</code>, then <code>[bc]</code> will match the <code>c</code>, leaving nothing for the <code>[cd]</code>, so at that point matching the sequence fails. "Trying again" with having <code>[ab]?</code> match the empty string is explicitly against the semantics of parsing expressions; this is not an edge case of a particular matching algorithm, instead it is the sought behaviour.

Even regular expressions that depend on nondeterminism can be compiled into a parsing expression grammar, by having a separate nonterminal for every state of the corresponding nondeterministic finite automaton (NFA for short) and encoding its transition function into the definitions of these nonterminals —

A ← 'x' B / 'x' C / 'y' D

</syntaxhighlight>

is effectively saying "from state A transition to state B or C if the next character is x, or to state D if the next character is y". It would not make use of the parsing expression variants of the repetition operations.

For accepting states of the NFA, the definition of the nonterminal should be surrounded by <code>?</code>. Concerning the entry states of the NFA, they should all be listed, separated by <code>/</code>, in the definition of the starting non-terminal.

Note that while it's possible to transform a regular expression into a deterministic finite automaton (DFA for short) instead of an NFA, this should be avoided because the DFA equivalent of a regular expression can be exponentially larger. In fact, there is a sequence of regular expressions for which all of the DFA equivalents are exponentially larger.

Compared to context-free grammars

PEGs can comfortably be given in terms of characters, whereas context-free grammars (CFGs) are usually given in terms of tokens, thus requiring an extra step of tokenisation in front of parsing proper. An advantage of not having a separate tokeniser is that different parts of the language (for example embedded mini-languages) can easily have different tokenisation rules.

In the strict formal sense, PEGs are likely incomparable to CFGs, but practically there are many things that PEGs can do which pure CFGs cannot, whereas it is difficult to come up with examples of the contrary. In particular PEGs can be crafted to natively resolve ambiguities, such as the "dangling else" problem in C, C++, and Java, whereas CFG-based parsing often needs a rule outside of the grammar to resolve them. Moreover any PEG can be parsed in linear time by using a packrat parser, as described above, whereas parsing according to a general CFG is asymptotically equivalent to boolean matrix multiplication (thus likely between quadratic and cubic time).

One classical example of a formal language which is provably not context-free is the language <math> \{ a^n b^n c^n : n \ge 0 \} </math>: an arbitrary number of <math>a</math>'s are followed by an equal number of <math>b</math>'s, which in turn are followed by an equal number of <math>c</math>'s. This, too, is a parsing expression language, matched by the grammar: to eliminate redundant parsing steps. Packrat parsing requires internal storage proportional to the total input size, rather than to the depth of the parse tree as with LR parsers. Whether this is a significant difference depends on circumstances; if parsing is a service provided as a function then the parser will have stored the full parse tree up until returning it, and already that parse tree will typically be of size proportional to the total input size. If parsing is instead provided as a generator then one might get away with only keeping parts of the parse tree in memory, but the feasibility of this depends on the grammar. A parsing expression grammar can be designed so that only after consuming the full input will the parser discover that it needs to backtrack to the beginning, which again could require storage proportional to total input size.

For recursive grammars and some inputs, the depth of the parse tree can be proportional to the input size, so both an LR parser and a packrat parser will appear to have the same worst-case asymptotic performance. However in many domains, for example hand-written source code, the expression nesting depth has an effectively constant bound quite independent of the length of the program, because expressions nested beyond a certain depth tend to get refactored. When it is not necessary to keep the full parse tree, a more accurate analysis would take the depth of the parse tree into account separately from the input size.

Computational model

In order to attain linear overall complexity, the storage used for memoization must furthermore provide amortized constant time access to individual data items memoized. In practice that is no problem — for example a dynamically sized hash table attains this – but that makes use of pointer arithmetic, so it presumes having a random-access machine. Theoretical discussions of data structures and algorithms have an unspoken tendency to presume a more restricted model (possibly that of lambda calculus, possibly that of Scheme), where a sparse table rather has to be built using trees, and data item access is not constant time. Traditional parsing algorithms such as the LL parser are not affected by this, but it becomes a penalty for the reputation of packrat parsers: they rely on operations of seemingly ill repute.

Viewed the other way around, this says packrat parsers tap into computational power readily available in real life systems, that older parsing algorithms do not understand to employ.

Indirect left recursion

A PEG is called well-formed For example, the following left-recursive CFG rule:

string-of-a ← string-of-a 'a' | 'a'

</syntaxhighlight>

can be rewritten in a PEG using the plus operator:

string-of-a ← 'a'+

</syntaxhighlight>

The process of rewriting indirectly left-recursive rules is complex in some packrat parsers, especially when semantic actions are involved.

With some modification, traditional packrat parsing can support direct left recursion, but doing so results in a loss of the linear-time parsing property

The class of parsing expression languages is closed under set intersection and complement, thus also under set union.

The jq programming language uses a formalism closely related to PEG.

The Lua authors created LPeg, a pattern-matching library that uses PEG instead of regular expressions, as well as the re module which implements a regular-expression-like syntax utilizing the LPeg library.

References

External links

Converting a string expression into a lambda expression using an expression parser
The Packrat Parsing and Parsing Expression Grammars Page
The constructed language Lojban has a fairly large PEG grammar allowing unambiguous parsing of Lojban text.
An illustrative implementation of a PEG in Guile scheme

Definition

Atomic parsing expressions

Composite parsing expressions

Implementing parsers from parsing expression grammars

Bottom-up PEG parsing

Advantages

No compilation required

Compared to regular expressions

Compared to context-free grammars

Computational model

Indirect left recursion

See also

References

External links