Canonical LR parser

A canonical LR parser (also called a LR(1) parser) is a type of bottom-up parsing algorithm used in computer science to analyze and process programming languages. It is based on the LR parsing technique, which stands for "left-to-right, rightmost derivation in reverse."

Formally, a canonical LR parser is an LR(k) parser for k=1, i.e. with a single lookahead terminal. The special attribute of this parser is that any LR(k) grammar with k>1 can be transformed into an LR(1) grammar. However, back-substitutions are required to reduce k and as back-substitutions increase, the grammar can quickly become large, repetitive and hard to understand. LR(k) can handle all deterministic context-free languages. HYACC, and LRSTAR.

History

In 1965 Donald Knuth invented the LR(k) parser (Left to right, Rightmost derivation parser) a type of shift-reduce parser, as a generalization of existing precedence parsers. This parser has the potential of recognizing all deterministic context-free languages and can produce both left and right derivations of statements encountered in the input file. Knuth proved that it reaches its maximum language recognition power for k=1 and provided a method for transforming LR(k), k > 1 grammars into LR(1) grammars. LALR(1) parsers have been the most common implementations of the LR Parser.

However, a new type of LR(1) parser, some people call a "Minimal LR(1) parser" was introduced in 1977 by David Pager who showed that LR(1) parsers can be created whose memory requirements rival those of LALR(1) parsers. Recently, some parser generators are offering Minimal LR(1) parsers, which not only solve the memory requirement problem, but also the mysterious-conflict-problem inherent in LALR(1) parser generators. In addition, Minimal LR(1) parsers can use shift-reduce actions, which makes them faster than Canonical LR(1) parsers.

Overview

The LR(1) parser is a deterministic automaton and as such its operation is based on static state transition tables. These codify the grammar of the language it recognizes and are typically called "parsing tables".

The parsing tables of the LR(1) parser are parameterized with a lookahead terminal. Simple parsing tables, like those used by the LR(0) parser represent grammar rules of the form

: A1 → A B

which means that if we go have input A followed by B then we will reduce the pair to A1 regardless of what follows. After parameterizing such a rule with a lookahead we have:

: A1 → A B, a

which means that the reduction will now be performed only if the lookahead terminal is a. This allows for richer languages where a simple rule can have different meanings depending on the lookahead context. For example, in a LR(1) grammar, all of the following rules perform a different reduction in spite of being based on the same state sequence.

: A1 → A B, a

: A2 → A B, b

: A3 → A B, c

: A4 → A B, d

The same would not be true if a lookahead terminal was not being taken into account. Parsing errors can be identified without the parser having to read the whole input by declaring some rules as errors. For example,

: E1 → B C, d

can be declared an error, causing the parser to stop. This means that the lookahead information can also be used to catch errors, as in the following example:

: A1 → A B, a

: A1 → A B, b

: A1 → A B, c

: E1 → A B, d

In this case A B will be reduced to A1 when the lookahead is a, b or c and an error will be reported when the lookahead is d.

The lookahead can also be helpful in deciding when to reduce a rule. The lookahead can help avoid reducing a specific rule if the lookahead is not valid, which would probably mean that the current state should be combined with the following instead of the previous state. That means in the following example

Input sequence: A B C
Rules:

:: A1 → A B

:: A2 → B C

the sequence can be reduced to

: A A2

instead of

: A1 C

if the lookahead after the parser went to state B wasn't acceptable, i.e. no transition rule existed. Reductions can be produced directly from a terminal as in

: X → y

which allows for multiple sequences to appear.

LR(1) parsers have the requirement that each rule should be expressed in a complete LR(1) manner, i.e. a sequence of two states with a specific lookahead. That makes simple rules such as

: X → y

requiring a great many artificial rules that essentially enumerate the combinations of all the possible states and lookahead terminals that can follow. A similar problem appears for implementing non-lookahead rules such as

: A1 → A B

where all the possible lookaheads must be enumerated. That is the reason why LR(1) parsers cannot be practically implemented without significant memory optimizations.