Operator precedence parsing

In previous chapters we have seen several parsers, evaluators and translators. All used recursive descent, one of the top down methods. This chapter introduces one very simple form of bottom up parsing and translating.

Bottom up parsing

Consider the following fragment of a Pascal program:

IF x = 0 THEN y := 1 You can see immediately that this is a statement, and not a program or an expression or a factor or anything else. A recursive descent compiler for Pascal will recognise it as a statement, but only if the compiler is currently expecting a statement. If it is currently expecting a program or a type or an expression or something else, then it will respond with error messages. Note that you were able to recognise the fragment because you had no expectations. This is an essential difference between a top down parsing method like recursive descent, and the bottom up method you were using in your head. In this chapter we shall study bottom up methods applied to parsing and translating. In all of these the parser does not start off with definite expectations but reads symbols as they come along and makes sense of them.

The basic idea is that just as expressions can be evaluated by several passes which evaluate subexpressions, so sequences of symbols can be parsed by several passes which parse subexpressions. First, consider the following equivalent formulas in prefix, in fully parenthesised infix, and in postfix notation: All three are to be evaluated, by rewriting leftmost subformulas by their values when possible; note that time flows from top to bottom:

prefix: infix: postfix: > - = 0 - & 1 1 0 (-(0 = -(1 & 1)) > 0) 0 1 1 & - = - 0 > > - = 0 - 1 0 (-(0 = - 1 ) > 0) 0 1 - = - 0 > > - = 0 0 0 (-(0 = 0 ) > 0) 0 0 = - 0 > > - 1 0 (- 1 > 0) 1 - 0 > > 0 0 (0 > 0) 0 0 > 1 1 1 If we ignore the difference in meaning between the various operands and operators and merely consider syntax, then we could rewrite all well-formed subformulas by F to indicate that they are formulas, like this: prefix: infix: postfix: > - = 0 - & 1 1 0 (-(0 = -(1 & 1)) > 0) 0 1 1 & - = - 0 > > - = F - & 1 1 0 (-(F = -(1 & 1)) > 0) F 1 1 & - = - 0 > > - = F - & F 1 0 (-(F = -(F & 1)) > 0) F F 1 & - = - 0 > > - = F - & F F 0 (-(F = -(F & F)) > 0) F F F & - = - 0 > > - = F - F 0 (-(F = - F ) > 0) F F - = - 0 > > - = F F 0 (-(F = F ) > 0) F F = - 0 > > - F 0 (- F > 0) F - 0 > > F 0 (F > 0) F 0 > > F F (F > F) F F > F F F

As described, the method would require us to always start at the beginning of the symbol sequence and perform the first replacement, just as we did for evaluation by replacement. But from our truth table program you will remember that a very good method for evaluating uses an auxiliary stack which contains whatever needs to be remembered about the values computed so far. We could even imagine a method which is less efficient than the one we used there, but which is interesting; as we pass through the postfix formula, we push all symbols onto the stack, operands and operators. If at any time the top few elements of the stack consist of a few operands topped by an operator, we interrupt the pushing and replace the operator and its operands by the result value. For comparison, here are the two evaluation methods; note that time flows from left to right:

efficient method: less efficient method: 0 1 1 & - = - 0 > 0 1 1 & - = - 0 > & 1 1 1 - = > 1 1 1 0 0 1 1 1 1 1 0 0 - 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 For either method, if instead of 0 and 1 on the stack we always had F (for formula), we would have a parsing method which avoids having to restart at the beginning of the formula every time. This is how it will look: efficient method: less efficient method: 0 1 1 & - = - 0 > 0 1 1 & - = - 0 > & F F F - = > F F F F F F F F F F F F - F F F F F F F F F F F F F F F F F F F F F F F F F

The less efficient method lends itself to becoming a very general parsing method which can be used for the other notations, too. Here it is used for fully parenthesised infix notation:

( - ( 0 = - ( 1 & 1 ) ) > 0 ) ) F F & & & F F F F ( ( ( ( ( F ) - - - - - - - F F = = = = = = = = = = ) F F F F F F F F F F F F F ( ( ( ( ( ( ( ( ( ( ( ( F > > > - - - - - - - - - - - - - - F F F F ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( F 1 2 3 4 5 Note that most steps push the current input symbol onto the stack, but there are five reduction steps, marked 1..5. Here is a brief formulation of the algorithm: start with an empty stack REPEAT IF the top few symbols on the stack can be replaced by a single one in accordance with the grammar THEN do the replacement ELSE push the next symbol onto the stack UNTIL the end of the string has been reached At the end of this, the stack should contain a single non-terminal of the grammar, and this is what has been recognised. As described, the algorithm is not very efficient, because for example in the case of fully parenthesised infix formulas up to five symbols on the stack have to be examined in the IF part. To improve the algorithm, we can make partial reductions and also make the applicability of reductions depend on the next input symbol. This idea is used in most bottom up parsers, including the two very specialised ones to follow.

Precedence relations

In the remainder of this chapter we shall look at two related bottom up parsing methods which are very simple and efficient, but are applicable only to a rather narrow class of languages.

The algorithm

Consider arithmetic or logical expressions in minimally parenthesised notation as written in the first line and their fully parenthesised form in the second line.

x + y * z p v q & r x + (y * z) p v (q & r) The usual convention is that * binds more strongly than +, and that & binds more strongly than v. Hence y is an operand to * and not to +, and q is an operand to & and not to v. It helps to think of the operators on the left and right of y and q tugging at it, and * and & win. The technical notion is that * has precedence over +, and that & has precedence over v. For binary operators that are semantically associative it does not matter whether they have precedence over themselves, but for others it is important to distinguish those that are syntactically left-associative (such as subtraction and implication), and those that are syntactically right-associative (such as exponentiation in arithmetic).

We shall now design a parser for logical formulas in minimally parenthesised infix notation. In the following table, the row headings are for the operator on the left of the disputed operand, and the column headings are for the operator on the right of the operand. The table entries > indicate that the row operator, when on the left, has precedence over the column operator, when on the right. The table entries < indicate the reverse.

"&" "v" ">" +---------------- "&" | > > > "v" | < > > ">" | < < < This table is part of a larger table which has rows and columns for symbols that are not binary operators: atomic formulas, negation, the two parentheses, and a terminator, say the period . The notion of a symbol having precedence over another now has to be extended to cover these other symbols. It so turns out that apart from < and > a third entry, =, is needed. Finally, for illegal combinations, the entries can be error codes.

A minor problem arises about v being used as an infix connective and as an atomic formula. In the case of recursive descent parsing there was no problem, even a formula such as v v v v v is correctly recognised as the three-fold disjunction of v with itself. However, for operator precedence parsing this will not work, and henceforth we shall use # as the disjunction symbol.

That larger table can then be used to drive an algorithm which is a descendant of the one outlined at the end of the last section. Instead of pushing symbols from the input onto the stack and then comparing them with what is below, the comparison is done between the current input symbol and the top element. Instead of reducing only when the full right hand side of a production of the grammar has been seen, partial reductions take place. Instead of reducing non-terminals (such as F in the example), the only symbols on the stack are terminals. The net effect of the three changes is a very simple and very efficient bottom up parsing method:

Push a marker onto an otherwise empty stack WHILE stack contains more than one symbol OR the current input symbol is not the terminator DO LET R be the precedence relation between the symbol on top of the stack and the next input symbol IF R is < OR R is = THEN shift the next input symbol onto the stack ELSE IF R is > THEN REPEAT pop the top element off the stack UNTIL the precedence relation between the symbol on top of the stack and the most recently popped symbol is < ELSE abort with error R With the information given so far, it should not be too difficult to write the program. The precedence relations can be built into the algorithm at the LET part of the loop. Alternatively, the algorithm can be left quite general and the LET part implemented as a lookup of a two dimensional array. In this way it is this array or table which specifies the language; such table driven parsing is a popular implementation technique for bottom up parsers. No matter which implementation is chosen, the hard part is to specify the precedence relations between symbols other than the infix operators. A very minor addition to the program turns it into an infix to postfix translator.

The following is the record of an interaction with the program. The program does not echo its input, which is ideal for interactive use. After each prompt ready, it reads a formula typed by the user. If the formula is preceded by ?, then the program goes into tracing mode. Note that the first three formulas are being parsed and translated in tracing mode, where the stack is now written horizontally. For the remaining formulas tracing is switched off. The last formulas contain intentional errors.

$ RUN 15OPPRER.EXE ready ? a & b. stack infix-char postfix-char a a & a & & b &b . b & . & POSTFIX CODE : ab& ready ? (a & b) # (c & d). stack infix-char postfix-char ( ( a (a & a ( & (& b (&b ) b (& ) & ( ) () # # # ( #( c #(c & c #( & #(& d #(&d ) d #(& ) & #( ) #() . # . # POSTFIX CODE : ab&cd&# ready -a & -b # -(c > d) > e > f. POSTFIX CODE : a-b-&cd>-#ef>> ready a&b&c&d&e&f&g&h&i&j. POSTFIX CODE : ab&c&d&e&f&g&h&i&j& ready a>b>c>d>e>f>g>h>i>j. POSTFIX CODE : abcdefghij>>>>>>>>> ready ((a=b) # (c>d)) & -(e=f). POSTFIX CODE : ab=cd>#ef=-& ready ? a & b b. stack infix-char postfix-char a a & a & & b &b b error 0 ready (((((((a&-b)). error 2 ready .

The program

The following is the Pascal source program for the operator precedence parser/translator based on precedence relations. Note that the VALUE declaration is not standard Pascal, it initialises the (two dimensional) ARRAY of precedence relations. Note that it this ARRAY which specifies the syntax of the input language, the remainder of the program is quite general. (The VALUE declaration could be replaced by some other initialisation mechanism; one possibility is 64 assignment statements, another is 8 calls of a procedure with 8 parameters, another is reading the matrix from a file.)

PROGRAM opprer(input,output); (* OPerator PREcedence Relations *) LABEL 1; CONST marker = '.'; maxstack = 100; maxcode = 300; s_header = 'stack '; i_header = 'infix-char '; p_header = 'postfix-char '; big_tab = ' '; TYPE symbol = (atomsy,notsy,andsy,orsy,implsy,lpar,rpar,period); VAR p : ARRAY[symbol,symbol] OF char; (* relations *) s : ARRAY[0..maxstack] OF char; (* parsing stack *) t : integer; (* top of stack *) sy : ARRAY[char] OF symbol; ch : char; tracing,finished : boolean; i : integer; code : ARRAY[1..maxcode] OF char; cx : integer; VALUE p := ( (* atm not and or imp lp rp per *) (* atm *) ('0','>','>','>','>','1','>','>'), (* not *) ('<','<','>','>','>','<','>','>'), (* and *) ('<','<','>','>','>','<','>','>'), (* or *) ('<','<','<','>','>','<','>','>'), (* imp *) ('<','<','<','<','<','<','>','>'), (* lp *) ('<','<','<','<','<','<','=','2'), (* rp *) ('3','>','>','>','>','4','>','>'), (* per *) ('<','<','<','<','<','<','5','6') ); PROCEDURE getch; BEGIN REPEAT read(ch) UNTIL ch > ' ' END; (* getch *) PROCEDURE gen(c:char); BEGIN (* gen *) IF NOT (c IN ['(',')']) THEN BEGIN IF tracing THEN writeln(big_tab,big_tab,c); cx := cx + 1; code[cx] := c END END; (* gen *) BEGIN (* main *) (* initialise table SY *) FOR ch := chr(0) TO chr(255) DO IF ch IN ['A'..'Z','a'..'z','0','1'] THEN sy[ch] := atomsy ELSE CASE ch OF '-' : sy[ch] := notsy; '&' : sy[ch] := andsy; '#' : sy[ch] := orsy; '>','=' : sy[ch] := implsy; '(' : sy[ch] := lpar; ')' : sy[ch] := rpar; '.' : sy[ch] := period; OTHERWISE sy[ch] := period END; (* CASE *) 1: finished := false; REPEAT writeln('ready'); tracing := false; cx := 0; getch; IF ch = '.' THEN finished := true ELSE BEGIN IF ch = '?' THEN BEGIN tracing := true; getch END; (* BEGIN operator precedence parsing algorithm *) IF tracing THEN writeln(s_header,i_header,p_header); s[0] := marker; t := 0; WHILE (t > 0) OR (ch &lt;> marker) DO BEGIN IF tracing THEN BEGIN FOR i := 1 TO t DO write(s[i]); FOR i := t+1 TO 15 DO write(' '); writeln(' ',ch) END; (* IF *) CASE p[sy[s[t]],sy[ch]] OF '<', '=' : BEGIN (* shift *) t := t+1; s[t] := ch; getch END; '>' : BEGIN (* reduce *) REPEAT gen(s[t]); t := t-1 UNTIL p[sy[s[t]],sy[s[t+1]]] = '<' END; OTHERWISE BEGIN writeln('error ',p[sy[s[t]],sy[ch]]); readln; GOTO 1 END END (* CASE *) END; (* WHILE *) (* END operator precedence parsing algorithm *) write('POSTFIX CODE : '); FOR i := 1 TO cx DO write(code[i]); writeln END (* ELSE *) UNTIL finished END. (* main *)

Precedence functions

This section describes a variant of the algorithm which replaces the two dimensional (and hence potentially large) array of precedence relations by two one-dimensional arrays.

The algorithm

The precedence relations between binary operators need not be transitive. It is quite possible to have three infix operators O1, O2 and O3 such that O1 on the left has precedence over O2 on the right, that O2 on the left has precedence over O3 on the right, and yet O3 on the left has precedence over O1 on the right. But for most purposes this possibility is not required at all, indeed users would probably find it confusing. If the precedence relation can be made transitive and even linear, then it becomes possible to assign to each symbol a numeric precedence strength. More precisely, for each symbol s two numbers are needed, conventionally called f(s) for the precedence on the left, and g(s) for the precedence on the right. The algorithm now becomes:

Push a marker onto an otherwise empty stack WHILE stack contains more than one symbol OR the current input symbol is not the terminator DO IF f(the symbol on top of the stack) &lt;= g(the next input symbol) THEN shift the next input symbol onto the stack ELSE REPEAT pop the top element off the stack UNTIL f(the symbol on top of the stack) < g(the symbol most recently popped) The most difficult part is to find, for each input symbol, numeric values for the f and g functions. When suitable values have been found, it is easy enough to put them into two arrays, say f and g. Thus, just as the precedence relations program can be driven by a two dimensional array of relations, so the precedence functions program can be driven by two numeric arrays f and g. There is no way to build error detection into the algorithm, see Aho and Ullman (1977, p 171), except by ad hoc methods. But it is again an easy matter to extend the program just slightly so that it becomes a translator into postfix. It is not essential that the postfix version uses exactly the same symbols, it is easy enough translate into a different symbolism. (The sample program translates the infix constants 0 and 1 into the postfix constants T and F.)

The following is a record of an interaction with the program. Like the preceding program, this one does not echo its input. After each prompt ready, it reads a formula typed by the user. If the formula is preceded by ?, then the program goes into tracing mode.

$ SET VERIFY $ RUN 15OPPREF.EXE ready ? a & b. stack infix-char postfix-char a a & a & & b &b . b & . & POSTFIX CODE : ab& ready ? (a & b) # (c & d). stack infix-char postfix-char ( ( a (a & a ( & (& b (&b ) b (& ) & ( ) () # # # ( #( c #(c & c #( & #(& d #(&d ) d #(& ) & #( ) #() . # . # POSTFIX CODE : ab&cd&# ready -a & -b # -(c > d) > e > f. POSTFIX CODE : a-b-&cd>-#ef>> ready a&b&c&d&e&f&g&h&i&j. POSTFIX CODE : ab&c&d&e&f&g&h&i&j& ready a>b>c>d>e>f>g>h>i>j. POSTFIX CODE : abcdefghij>>>>>>>>> ready ((a=b) # (c>d)) & -(e=f). POSTFIX CODE : ab=cd>#ef=-& ready (0 # 1) & (--1 > 0) = 1 # 0 & 1. POSTFIX CODE : FT#T--F>&TFT&#= ready .

The program

The following is the standard Pascal source program for the parser/translator based on precedence functions. Note that the input language and the translation to the output language are specified by the four string constants in the CONST declaration. The remainder of the program is completely general.

PROGRAM oppref(input,output); (* OPerator PREcedence Functions *) (* CANNOT HANDLE ERRORS - SEE Aho and Ullman (1977, p 171) *) CONST numterminals = 36; i0 = '=>#&-()abcdefghijklmnopqrstuvwxyz01.'; o0 = '=>#&- abcdefghijklmnopqrstuvwxyzFT.'; f0 = '114660666666666666666666666666666660'; g0 = '223577077777777777777777777777777770'; marker = '.'; maxstack = 100; maxcode = 300; s_header = 'stack '; i_header = 'infix-char '; p_header = 'postfix-char '; big_tab = ' '; VAR o : ARRAY[char] OF char; (* outputs *) f,g : ARRAY[char] OF integer; (* precedences *) s : ARRAY[0..maxstack] OF char; (* parsing stack *) t : integer; (* top of stack *) ch : char; code : ARRAY[1..maxcode] OF char; cx : integer; tracing,finished : boolean; i : integer; PROCEDURE initialise; VAR i : integer; i1,o1,f1,g1 : PACKED ARRAY [1..numterminals] OF char; BEGIN i1 := i0; o1 := o0; f1 := f0; g1 := g0; (* this nonsense was necessary because silly PASCAL does not allow indexed access into constant strings *) FOR ch := chr(0) TO chr(255) DO BEGIN f[ch] := -1; g[ch] := -1 END; FOR i := 1 TO numterminals DO BEGIN f[i1[i]] := ord(f1[i]) - ord('0'); g[i1[i]] := ord(g1[i]) - ord('0'); o[i1[i]] := o1[i] END; END; (* initialise *) PROCEDURE getch; BEGIN REPEAT read(ch) UNTIL ch > ' ' END; (* getch *) PROCEDURE putch; VAR c : char; BEGIN c := o[s[t]]; IF c &lt;> ' ' THEN BEGIN cx := cx + 1; code[cx] := c; IF tracing THEN writeln(big_tab,big_tab,c); END END; (* putch *) BEGIN (* main *) initialise; finished := false; REPEAT writeln('ready'); tracing := false; getch; IF ch = '.' THEN finished := true ELSE BEGIN IF ch = '?' THEN BEGIN tracing := true; getch END; IF tracing THEN writeln(s_header,i_header,p_header); (* BEGIN precedence function algorithm *) s[0] := marker; t := 0; cx := 0; WHILE (s[t] &lt;> marker) OR (ch &lt;> marker) DO BEGIN IF tracing THEN BEGIN FOR i := 1 TO t DO write(s[i]); FOR i := t+1 TO 15 DO write(' '); writeln(' ',ch) END; IF f[s[t]] &lt;= g[ch] THEN BEGIN (* shift *) t := t+1; s[t] := ch; getch END ELSE REPEAT (* reduce *) putch; t := t-1 UNTIL f[s[t]] < g[s[t+1]] END; (* WHILE *) write('POSTFIX CODE : '); FOR i := 1 TO cx DO write(code[i]); writeln (* END precedence function algorithm *) END (* ELSE *) UNTIL finished; END. (* main *)

Exercises and reading

Ad hoc error detection: The precedence function algorithm cannot detect errors, because for any symbol on top of the stack and any input symbol, one of the three relations >, =, > must hold for their f- and g-values. Hence there is no way of encoding error conditions, as was done with the matrix of precedence relations. For example, the algorithm will happily shift two adjacent atoms on the stack. Find a way of adding extra ad hoc code to the algorithm so that errors can be handled correctly. You might experiment with a further function, say h.

Parsing without tables: Both programs are general, they are driven by tables which specify the input language and the translation. It would be possible to write both programs without the tables. Essentially this amounts to building the information which is now in the tables into the programs. Rewrite either the precedence relations program or the precedence functions program in this way.

Changing the tables: Experiment with changing the table in the VALUE part of the precedence relations program, or the tables in the CONST part of the precedence functions program. Can you change them so that the input language is in fully parenthesised infix or in prefix? If not, why not?

Evaluator: For either the precedence relations or the precedence functions program, modify the algorithm and then the program and the tables so as to make an evaluator for formulas without variables but only with constants 0 and 1.

Write a parser which uses the method described towards the end of the first section of this chapter. Your program could be, but need not be, driven by some kind of data structure playing the role of the tables used here. Augment your parser so that it becomes a translator.

Non-recursive truth table program: Our truth table program had essentially three tasks to perform: 1) to translate from infix to postfix, 2) to generate all combinations of truth values, and 3) to evaluate the postfix. Our program used recursion for 1) and 2), but used an explicit stack for 3). Now this chapter has shown how to do 1) without recursion. Can you think of a non-recursive way of doing 2)? This would give you a truth table program which does not use recursion at all. It should even be possible to write the entire program without any procedures; but many people would argue that such a style is unclear.

Reading: For a different exposition of operator precedences and a translator from arithmetic infix to postfix which also uses a two dimensional table but with slightly different stack operations, see Collins (1986, pp 117 - 145). For an operator precedence program which translates from infix to postfix but which is not table driven, see McCracken (1987, pp 111 - 120). For a detailed theoretically oriented exposition of operator precedence parsing, see Aho and Ullman (1977, Section 5.3). A short but usable exposition of LR parsing is given in Capon and Jinks (1988, p 95). A more detailed exposition is in Sudkamp (1988, Chapters 15 and 16). For a very comprehensive treatment of LR-parsing, see Chapman (1987).