In previous chapters we have seen several parsers, evaluators and translators. All used recursive descent, one of the top down methods. This chapter introduces one very simple form of bottom up parsing and translating.
Consider the following fragment of a Pascal program:
The basic idea is that just as expressions can be evaluated by several passes which evaluate subexpressions, so sequences of symbols can be parsed by several passes which parse subexpressions. First, consider the following equivalent formulas in prefix, in fully parenthesised infix, and in postfix notation: All three are to be evaluated, by rewriting leftmost subformulas by their values when possible; note that time flows from top to bottom:
F to indicate that they are formulas, like this:
As described, the method would require us to always start at the beginning of the symbol sequence and perform the first replacement, just as we did for evaluation by replacement. But from our truth table program you will remember that a very good method for evaluating uses an auxiliary stack which contains whatever needs to be remembered about the values computed so far. We could even imagine a method which is less efficient than the one we used there, but which is interesting; as we pass through the postfix formula, we push all symbols onto the stack, operands and operators. If at any time the top few elements of the stack consist of a few operands topped by an operator, we interrupt the pushing and replace the operator and its operands by the result value. For comparison, here are the two evaluation methods; note that time flows from left to right:
F (for formula),
we would have a parsing method which avoids having to
restart at the beginning of the formula every time.
This is how it will look:
The less efficient method lends itself to becoming a very general parsing method which can be used for the other notations, too. Here it is used for fully parenthesised infix notation:
1..5.
Here is a brief formulation of the algorithm:
IF part.
To improve the algorithm,
we can make partial reductions
and also make the applicability of reductions depend on the next
input symbol.
This idea is used in most bottom up parsers,
including the two very specialised ones to follow.
In the remainder of this chapter we shall look at two related bottom up parsing methods which are very simple and efficient, but are applicable only to a rather narrow class of languages.
Consider arithmetic or logical expressions in minimally parenthesised notation as written in the first line and their fully parenthesised form in the second line.
* binds more strongly than +,
and that & binds more strongly than v.
Hence y is an operand to * and not to +,
and q is an operand to & and not to v.
It helps to think of the operators on the left and right of y and q
tugging at it,
and * and & win.
The technical notion is that * has precedence over +,
and that & has precedence over v.
For binary operators that are semantically associative
it does not matter whether they have precedence over themselves,
but for others it is important to distinguish those that are
syntactically left-associative (such as subtraction and implication),
and those that are syntactically right-associative
(such as exponentiation in arithmetic).
We shall now design a parser for logical formulas in
minimally parenthesised infix notation.
In the following table,
the row headings are for the operator on the left of the disputed operand,
and the column headings are for the operator on the right of the operand.
The table entries > indicate that the row operator, when on the left,
has precedence over the column operator, when on the right.
The table entries < indicate the reverse.
.
The notion of a symbol having precedence over another now has to
be extended to cover these other symbols.
It so turns out that apart from < and >
a third entry, =, is needed.
Finally, for illegal combinations,
the entries can be error codes.
A minor problem arises about v being used as an infix connective
and as an atomic formula.
In the case of recursive descent parsing there was no problem,
even a formula such as v v v v v is correctly recognised
as the three-fold disjunction of v with itself.
However, for operator precedence parsing this will not work,
and henceforth we shall use # as the disjunction symbol.
That larger table can then be used to drive an algorithm which
is a descendant of the one outlined at the end of the last section.
Instead of pushing symbols from the input onto the stack
and then comparing them with what is below,
the comparison is done between the current input symbol
and the top element.
Instead of reducing only when the full right hand side
of a production of the grammar has been seen,
partial reductions take place.
Instead of reducing non-terminals (such as F in the example),
the only symbols on the stack are terminals.
The net effect of the three changes is a very simple and very efficient
bottom up parsing method:
LET part of the loop.
Alternatively, the algorithm can be left quite general
and the LET part implemented as a lookup of a two dimensional array.
In this way it is this array or table which specifies the language;
such table driven parsing is a popular implementation technique
for bottom up parsers.
No matter which implementation is chosen,
the hard part is to specify the precedence relations between
symbols other than the infix operators.
A very minor addition to the program turns it into an infix
to postfix translator.
The following is the record of an interaction with the program.
The program does not echo its input,
which is ideal for interactive use.
After each prompt ready,
it reads a formula typed by the user.
If the formula is preceded by ?,
then the program goes into tracing mode.
Note that the first three formulas are being parsed and translated
in tracing mode,
where the stack is now written horizontally.
For the remaining formulas tracing is switched off.
The last formulas contain intentional errors.
The following is the Pascal source program
for the operator precedence parser/translator
based on precedence relations.
Note that the VALUE declaration is not standard Pascal,
it initialises the (two dimensional) ARRAY of precedence relations.
Note that it this ARRAY which specifies the syntax of the input language,
the remainder of the program is quite general.
(The VALUE declaration could be replaced by some other
initialisation mechanism;
one possibility is 64 assignment statements,
another is 8 calls of a procedure with 8 parameters,
another is reading the matrix from a file.)
This section describes a variant of the algorithm which replaces the two dimensional (and hence potentially large) array of precedence relations by two one-dimensional arrays.
The precedence relations between binary operators need not be
transitive.
It is quite possible to have three infix operators O1, O2 and O3
such that O1 on the left has precedence over O2 on the right,
that O2 on the left has precedence over O3 on the right,
and yet O3 on the left has precedence over O1 on the right.
But for most purposes this possibility is not required at all,
indeed users would probably find it confusing.
If the precedence relation can be made transitive and even linear,
then it becomes possible to assign
to each symbol a numeric precedence strength.
More precisely, for each symbol s two numbers are needed,
conventionally called f(s) for the precedence on the left,
and g(s) for the precedence on the right.
The algorithm now becomes:
0 and 1
into the postfix constants T and F.)
The following is a record of an interaction with the program.
Like the preceding program,
this one does not echo its input.
After each prompt ready,
it reads a formula typed by the user.
If the formula is preceded by ?,
then the program goes into tracing mode.
The following is the standard Pascal source program
for the parser/translator based on precedence functions.
Note that the input language and the translation
to the output language are specified by the four string constants
in the CONST declaration.
The remainder of the program is completely general.
Ad hoc error detection:
The precedence function algorithm cannot detect errors,
because for any symbol on top of the stack
and any input symbol,
one of the three relations >, =, > must hold
for their f- and g-values.
Hence there is no way of encoding error conditions,
as was done with the matrix of precedence relations.
For example, the algorithm will happily shift two adjacent
atoms on the stack.
Find a way of adding extra ad hoc code
to the algorithm so that errors can be handled correctly.
You might experiment with a further function, say h.
Parsing without tables: Both programs are general, they are driven by tables which specify the input language and the translation. It would be possible to write both programs without the tables. Essentially this amounts to building the information which is now in the tables into the programs. Rewrite either the precedence relations program or the precedence functions program in this way.
Changing the tables:
Experiment with changing the table in the VALUE part of the
precedence relations program,
or the tables in the CONST part of the precedence functions program.
Can you change them so that
the input language is in fully parenthesised infix
or in prefix? If not, why not?
Evaluator:
For either the precedence relations or the precedence functions program,
modify the algorithm and then the program and the tables
so as to make an evaluator for formulas without variables
but only with constants 0 and 1.
Write a parser which uses the method described towards the end of the first section of this chapter. Your program could be, but need not be, driven by some kind of data structure playing the role of the tables used here. Augment your parser so that it becomes a translator.
Non-recursive truth table program: Our truth table program had essentially three tasks to perform: 1) to translate from infix to postfix, 2) to generate all combinations of truth values, and 3) to evaluate the postfix. Our program used recursion for 1) and 2), but used an explicit stack for 3). Now this chapter has shown how to do 1) without recursion. Can you think of a non-recursive way of doing 2)? This would give you a truth table program which does not use recursion at all. It should even be possible to write the entire program without any procedures; but many people would argue that such a style is unclear.
Reading: For a different exposition of operator precedences and a translator from arithmetic infix to postfix which also uses a two dimensional table but with slightly different stack operations, see Collins (1986, pp 117 - 145). For an operator precedence program which translates from infix to postfix but which is not table driven, see McCracken (1987, pp 111 - 120). For a detailed theoretically oriented exposition of operator precedence parsing, see Aho and Ullman (1977, Section 5.3). A short but usable exposition of LR parsing is given in Capon and Jinks (1988, p 95). A more detailed exposition is in Sudkamp (1988, Chapters 15 and 16). For a very comprehensive treatment of LR-parsing, see Chapman (1987).