In previous chapters we have always designed top down, in this chapter we shall design at least in part bottom up. We are given a machine with just three instructions, callv>, return and one other having to do with matching an input character with one of a set of allowed characters. Our task is to design a high level front end for this machine. The result is a fairly general parser which reads a grammar to be translated into machine code. The machine then reads strings and determines whether they are in the language defined by the grammar.
One aim of this chapter is to make the notion of procedure call and return explicit. We have used procedures in all our programs, and we have implemented languages in which recursion was possible. But in almost all our implementations we have always relied on the recursion facility offered by our implementation language. The one exception was the predictive parser used in the semantic tableaux program in Chapter~10. In this chapter we shall see how recursion really works, by studying a system in which there is call and return and almost nothing else. Of course there has to be something else, otherwise there would be no point at all to the calls and returns. The third instruction will be for matching characters which are being read from a file. If the currently visible character in the file is one of a set of allowed characters, then execution continues at one place, otherwise at another place.
In detail, the instructions consist of up to four fields: an operation, two addresses and a set of characters. Only the matching instruction uses all four fields.
The input string from the file is accepted if the program counter has a special accept value at termination and the end of the input string has been reached. So, unlike the parser of Chapter~11, acceptance requires the whole input string to be well-formed.
Our task is to design a high-level BNF-like language and to write a compiler from this high level language to the machine language of three instructions. The general parser of Chapter~11 was able to handle any context free grammar. The one to be developed in this chapter is less general, it can only handle deterministic grammars in which there is no need to backtrack. This restriction results from the inability of the parsing machine to go back to a previous position in the input file. The machine has another potential shortcoming: there is no way for a call to return and to signal to the callee whether it was successful. For the grammar this has the consequence that in a group of alternatives all but possibly the last has to start with a terminal. In the version to be developed here, such terminals even have to be explicitly given at the point of choice. (See one of the exercises for the design of a less restrictive version.)
What follows is a record of two runs of the program LL1GEN.
In the first run a grammar for logical expressions is given;
for variety negation is not handled by recursion.
The first version of that grammar contains an intentional
context sensitive error which is promptly reported.
The second version of the same grammar is acceptable.
It is followed by a listing of the internal code
that has been generated.
Note that some match instructions have an empty
characterset, and since no current input character can match,
these instructions are effectively GOTOs.
Then follow some input strings to be parsed,
first some acceptable input strings and then some
unacceptable input strings.
For the first acceptable string and for the first
unacceptable string the tracing facility is turned on.
The tracing shows the current character from the input file,
the top of the stack indicating the depth of the calls,
the value of the program counters
and then the relevant fields of the current instruction:
the operation field
and the remaining three fields as required.
In the second run a grammar for a simple imperative language is given. Note that statements are optional, this allows semicolons to be followed by an empty statement --- and hence semicolons can be used as statement terminators instead of separators. Since the internal code for this grammar is quite large, it is not shown. Then follow three programs, two with an error and one without. Even though the programs are quite short, the trace of their parses would be quite long, so it is not shown.
The implementation owes a great deal to Wirth's (1976) general parser (pp 304 - 307) and to a later variant (Wirth 1977). The parser presented here differs from Wirth's in the following respects: As seen by the user, Wirth's has single letters as non-terminals for the grammars, whereas the one here has multi-letter terminals. Both are for languages in which the symbols are single characters, but Wirth's uses single letter terminals in the grammar, whereas the one here allows sets of single letter terminals in the grammar. Internally, not seen by the user, Wirth's parsing machine is recursive, the one here is not. This one can also write the internal representation of the grammar because it uses an array rather than pointers, and it can trace execution.
The program consists of a number of procedures for reading the grammar and generating the internal code, and one procedure which implements the parsing machine described at the beginning. The main program is very simple: it calls a procedure to read a grammar, and then it repeatedly prompts for input strings to be parsed by the machine.
The scanner:
The individual symbols to be recognised by the scanner
are almost the same as the ones that were required for the
general parser for context-free languages in Chapter~11.
They are identifiers for the non-terminals of the grammar,
together with a few single character symbols.
Note again that the non-terminal identifiers are looked up
and, if they are new, entered into the symbol table.
In addition this scanner has to recognise sets of terminal characters
listed within double quotes.
So, when the opening double quote is seen,
any following characters are collected into a global variable
of type SET OF char, up to the closing double quote.
That global variable needs to be available
to procedure factor for the case when the current symbol
is a set of terminals.
Reading the Grammar: The syntax of the input grammar is so similar to that in Chapter~11 that only a brief description of the parsing procedures is necessary. The productions for the non-terminals factor, term and expression are recursive, so they have to be handled by procedures of the same name. It would be possible to handle productions and the grammar entirely in the main program, as it was done in Chapter~11. However, several special purpose variables are required which are best left hidden inside procedures which handle productions and the grammar. Visibility requirements are satisfied by the following nesting pattern:
Procedure grammar has to initialise the symbol table and then read
one or more productions.
Following that it has to check that all non-terminals are indeed
defined.
Procedure production expects a leading identifier,
it is a good idea to let it check that the identifier has not
been declared before --- even if it might have been used before.
One way to do the check
is to see that in the symbol table no code for it has been recorded.
Following the identifier, an equal sign
= is expected and then an expression.
Procedures expression, term and factor
present no difficulties.
In all our previous programs code generation was a relatively simple matter
because the conceptual distance between the external source language
and the internal target language was so small.
This is no longer true here,
because so many constructions of the source language
have no counterpart in the target language.
In particular,
the two binary operations of alternation and concatenation,
and the two unary operations of repetition and option
have no counterparts.
Instead they will have to be implemented
by explicit GOTOs in the instruction for matching sets of terminals.
The situation is similar for translating high level imperative
languages containing IF and WHILE statements
into low level machine languages that only have GOTOs.
The IF statement requires a GOTO which will cause execution to skip over
some code in case the condition is false,
and the WHILE statement requires an additional backwards GOTO
to create a loop.
This is not particularly difficult,
since there is a fairly simple correspondence
between the GOTOs and the required target addresses.
We shall encounter an example in Chapter~14.
In the present case, however, the calculation of the target addresses is not so simple. It is best to conceive of code generation to consist of two steps: 1) generating the op-codes and 2) calculating the addresses.
Generating op-codes:
If addresses are ignored for the time being,
generating op-codes is the easiest part.
At the end of every production a return instruction
has to be generated,
and since this does not need an address,
it need not be changed or fixed up later.
The other three instructions
are generated inside procedure factor.
The case for an identifier,
representing a non-terminal,
requires a call instruction to be generated,
together with the position of the identifier in the symbol table.
The case of a set of terminal symbols
requires a match instruction to be generated,
with the two address fields left unspecified.
For the repetition and option cases, enclosed in brackets or braces,
an odd match instruction is generated ---
one in which the set of characters is empty,
and which hence will never match.
Hence such instructions are in effect unconditional jump instructions.
The purpose of this strange instruction is to become
the target address of some of the implicit GOTOs
that will be generated by the expression enclosed in the brackets or braces.
Calculating addresses: The method employed here borrows heavily from Wirth (1976, pp 302 - 307) and a later version in Wirth (1977). When the entire grammar has been read and the internal code has been generated, a single pass through the entire code is made. Its purpose is to change the second address of the strange match instructions to the first address which by now contains the next success address to be continued with. (Note that Wirth's general parser manages without this oddity.)
Remember, when a call instruction is initially generated, the non-terminal being called may not yet have been defined; so at most the address in the symbol table can be recorded in the instruction. In the general parsing program of Chapter~11 the fix-up pass replaces the address in the symbol table recorded in the instruction by the start address of the code for the non-terminal which is now known. We could do the same here, to make the interpreter slightly more efficient. But to do so would interfere with one of the exercises.
The interpreter is not recursive, so it could easily be made part of the main program. However, it contains a few variables only used here, so for modularity it is better to have it as a separate procedure.
The necessary variables are an array of integers serving as the stack
of return addresses,
and two integers for the top of stack and for the program counter.
It is also useful to have a procedure which reads characters
and skips non-printing characters.
For tracing, if the first character of the input string is a ?
then the header line has to be written out.
The machine is then initialised by setting the program counter to 1 and by pushing a dummy return address -1 onto an otherwise empty stack. The purpose of this dummy is to signal successful termination, provided the end of the input string has been reached. Otherwise failure is reported, together with the last character seen.
The principal REPEAT loop of the machine
(in the middle of procedure parse) has already been described
at the beginning.
When tracing is on,
every step through the REPEAT loop
has to write out the current input character,
the top of the stack,
the program counter
and the relevant fields of the current instruction.
The following is the standard Pascal source program for LL1GEN:
Another grammar:
The grammar in the second example used LOGICAL
instead of BOOLEAN
because the B of BOOLEAN would
interfere with the B of BEGIN.
Also, it used PRINT instead of WRITE because the W of WRITE
would interfere with the W of WHILE.
Can you rewrite the grammar using BOOLEAN and WRITE?
Manual: Write a user manual. You need not mention anything strictly internal to the program, and you need not mention the tracing facility either.
More generality:
The scanner cannot handle the double quote character
as a terminal because it uses it as a delimiter.
Can you find a way of allowing double quotes
as terminals, preferably as a member of a set of terminals?
Similarly for . which terminates the input string,
and the ? which switches on tracing if it precedes the input string.
Post Mortem: Some programming language implementations give a dump of the run-time stack when an error has occurred. The dump lists the procedure and function calls that were accumulated on the stack. A similar facility would be useful for our parsing machine. To implement this, the call instruction should push onto the stack not the address of the next instruction to be executed, but its own address. This is the reason why the call instruction has been given as its second address field not the code address as fixed up after the entire grammar has been read but the original symbol table address. Then, for the post mortem the stack will contain a sequence of addresses into the code left there by executing call instructions. >From each address one can determine the code record to which it points, and the second address field of that record is an address into the symbol table where the name of the non-terminal is to be found. Clearly the return instruction will have to be changed so that it sets the program counter not to the code address popped from the stack but to the first address field in the code at the address popped. See also Setzer (1979) for some ideas.
Eliminating redundant instructions: The match instructions with an empty character set were really just unconditional jumps. It is possible to make a further pass through the code, and if any address in a proper match instruction points to such a dummy match instruction, it could be made to point to what that dummy points to. The replacement should be done with a loop to catch those cases in which a dummy points to a dummy and so on. Note that a fix up of this kind will save execution time, but it will not shorten the code because it does not eliminate the instructions. An optimisation which also shortens the code is much more difficult (optimising compilers do it routinely).
Cleaner code generation: There is something quite unsatisfactory about the way the code is generated and then fixed up. It may well be that it is better to generate tree code first and then produce an optimised linear code from that.
Reducing the number of instructions:
(A machine with just three kinds of instructions is rather minimal,
but there are machines with less.
The extreme is a machine with only one instruction
which always consists of two addresses,
but some special addresses are used for gaining diversity.)
Delete the return instruction:
if next pc = 0 then get it from the stack,
but if that is empty then exit.
Multi-character words:
Modify the translator and the interpreter
so that the program can handle symbols consisting
of several characters.
Then the second grammar could use
"BEGIN" instead of
"B" "E" "G" "I" "N",
"IF" instead of "I" "F",
and so forth.
But this redesign should not interfere with the facility
for sets of single characters, such as "abc".
So you will need another matching operation.
Note also that it will then be possible
to have WRITE rather than PRINT,
because the W will not interfere with WHILE.
But you will need a buffer in the parsing machine.
Status return: The parsing machine has no way for a call to return and to signal to the callee whether the call was successful. Consequently in the second example given, the set of lower case letters (as names of potential variables) had to be specified literally in four different places in the grammar. It would have been neater to be able to use a defined non-terminal here, but since variables are only one way of beginning statements or factors, the calls to that non-terminal have to return a status signal to the callee. Implement such an addition to the parsing machine.
Reading: Wirth's general parser is described in (1976, pp 295 - 307), and a later version is in his (1977, in German). It is implemented recursively and uses a VAR parameter for a status return. Setzer (1979) discusses a quite different method of using the original internal code from Wirth's parser to drive a non-recursive parsing machine.
A Translating Machine: This is an exercise in the design of a specialised pseudo machine language and a software interpreter that can be used for simple translations from and to single character languages. To keep it as simple as possible, restrict yourself to translation from prefix to fully parenthesised infix, or from fully parenthesised infix to postfix.
One way to do this is to follow the prefix to infix translator
of Chapter~2 by implementing instructions for each of the basic
steps which that translator needed.
About 10 or 12 instructions are needed:
1) write instructions for literal characters,
for a local variable and for messages,
2) read instructions for a local variable
and for skipping the remainder of the input line on error,
3) flow of control instructions such as an unconditional GOTO,
a call and a return instruction which also have
to take care of a local variable,
and a conditional GOTO which tests whether
the local variable is in a given set,
4) an instruction to clear the stack initially and on errors, and
5) an instruction to do character translations.
The same instruction set should be adequate for translating
from fully parenthesised infix to postfix.
However, for translation from minimally parenthesised infix
a few different instructions will be needed.
There are at least three ways in which the required program can be placed into the code array before it is executed. One way is to write the assembly program by hand and to use an (undoubtedly ugly) initialisation section to fill the required fields of the code. Another way is to write an assembler which takes a readable assembly language program from a file and places it into an array for execution. A third way is to elaborate the grammar translator written in this chapter by getting procedure factor to recognise special action symbols which are needed for the translations. Of course all of the above can also be done just for parsing without any translation.