This chapter aims to give more fluency in the use of block structure and mutual recursion, and more confidence in writing larger programs than those in Part One of this book. The program described here is a compiler interpreter for a very small imperative language. None of the concepts of the language are new, so the language is first described by a user manual.
The language TYPROC is a very small imperative language. It has assignment statements, I-O statements and structured flow of control statements including recursive calls of previously declared procedures. Variables are global, and they have to be declared as being of either type Boolean or type integer. Expressions are built from variables and literal constants by means of a small number of inbuilt operators. Strict type checking is maintained throughout. The language is not intended for serious use but as an illustration in language implementation and documentation.
The language is described by the following BNF productions for the context free syntax, and by the accompanying text for the context sensitive syntax and for the semantics.
BEGIN and END.
A statement sequence consisting of one or more statements
separated by semicolons is executed by
executing the statements one after another.
v = e,
where v is a variable that has been declared
and e is an expression of the same type.
Execution has the effect of evaluating e and
assigning its value to v.
Procedure calls are of the form P,
where P is a procedure that has been declared;
execution has the effect of executing the body of P.
Compound statements provide a means of executing
a sequence of statements in IF and WHILE statements.
In IF and WHILE statements the expression must be of type Boolean.
In an IF statement the THEN part is executed only if the expression
evaluates to true.
In a WHILE statement the DO part is executed zero or more times,
as long as the expression evaluates to true.
In READ statements the variable has to have been declared,
execution has the effect of reading from
the input file either an integer or a Boolean value
and assigning this to the variable.
WRITE statements will evaluate the expression,
which may be of either type,
and write its value to the output file.
LINE statements start a new line,
TAB statements write a tab.
- is overloaded: as an infix operator
it means subtraction, as a prefix operator it means negation.
A factor can also be one of the two Boolean constants,
giving result type Boolean.
Finally, a factor can be the ORD function applied to a factor
of any type, the result type is integer;
if the contained factor is of type integer,
then ORD is the identity function,
and if the contained factor is of type Boolean,
then ORD evaluates to 0 for false and to 1 for true.
Comments are written just inside curly braces like this:
{ this is a comment }.
The grammar as given does not explain this,
and indeed most documentations for languages
leave such a detail to the informal explanations.
Comments are generally treated like white space.
For the implementation of this language we shall distinguish syntax and semantics rather thoroughly.
In all our previous programs the basic symbols were single
printing characters.
In this language symbols consist of either
single letters for user declared variables and procedures,
or inbuilt single letter special characters,
or inbuilt multi capital letter words.
The procedure for reading symbols, called getsym,
has to be able to detect these multi character symbols.
After skipping any non-printing characters,
it has to examine the current character:
If it is a capital and the next character waiting in the input
is also a capital,
then the two capitals and any following capitals are collected
in a short string.
If the current character is an opening brace for a comment,
characters are skipped up to the closing brace
and the procedure starts again.
In all other cases the procedure does not read beyond
the current single character.
The language was designed to make this simple technique possible.
The procedure getsym is an unusually simple scanner
for a language with multi-character symbols.
Indeed, it does not have to know much about the language
to be parsed.
Other scanners that we shall see in later chapters
are much more complex, and some even make use of auxiliary tables.
But often this will have the advantage that the parser
does not need to know much about how symbols are written.
The simple scanner given here has to delegate most of the
recognition of symbols to the parser.
Consequently, much surface detail has to be spread
throughout the program.
For example, the maximum length of the string for
multi-character symbols is spread all over the parser.
Hence there cannot be a constant declaration
to set the maximum length (to 8 or whatever).
For the parsing procedures we now distinguish various steps in the design. If you are writing the program yourself, you are advised to maintain a discipline of steps similar to the ones used here. The steps continue in the next section on semantics.
Step 1:
Visibility requirements:
A detailed examination of the grammar shows that the productions are
already ordered in a convenient way:
each non-terminal needs access either to one defined earlier,
or to itself,
or to the one immediately following.
Hence all visibility requirements will be satisfied
if all the parsing procedures are nested successively
as in the grammar, programme outermost, factor innermost.
Step 2:
Bodies of parsing procedures:
Most of this should be routine by now,
except that a CASE statement can no longer be used
for the non-terminals which have alternatives.
Because some symbols are single characters
and some are short strings,
a cascade of IF-THEN-ELSE-IF-.. has to be used
in the parsing procedures for factor and for statement.
Wrong symbols produce a call to the by now familiar error handler.
Step 3:
Declarations:
Inside programme, whenever a variable is declared,
then this fact should be noted, and the type also needs recording.
In this simple language we can manage by entering declared variables
into an appropriate set ---
one for the integers, and one for the Booleans.
Whenever a procedure is declared,
it is entered into a set of procedures.
Then inside statement,
for the variable in an assignment statement
it has to be checked that it is in one of the two sets
of declared variables.
Also inside statement,
calls of procedures have to be checked.
Finally, inside factor any variables have to be checked.
Step 4:
Type checking:
Inside factor all variables, the numeric constants
and the Boolean constants have one of two types,
and these types have to be transmitted to the procedure
which called factor.
Also inside factor,
a check has to be made that the negand was of type Boolean,
and the result will be of type Boolean.
For ORD no type check is made,
but the result will be integer.
For terms, simple expressions and expressions
containing infix operators
a check has to be made that the operands are appropriate
for the infix operator.
For terms and simple expressions the result type
is the same as that of the operands,
for expressions the result type Boolean is returned.
The appropriate implementation mechanism is similar
to the one for returning values in the infix evaluator
of Chapter 3:
each of the above procedures needs a VAR parameter
whose possible values are of two types,
and each of the procedures for infix operators
need a similar local variable for the second operand.
A local variable is also needed inside statement
for calls of expression.
This occurs inside assignment statements
for which type agreement of the variable and the expression
has to be checked.
It also occurs in IF and WHILE statements for which the
expression must be Boolean.
In WRITE statements no type check is needed for the expression,
but the type of the expression will determine what code
is to be generated later.
Up to this point we have been concerned with syntax only.
Some earlier authors would have called steps 3 and 4
static semantics,
but this was based on the misconception that syntax has to be
context free.
Semantics assigns meanings to the symbols of the language
and to the more complex constructions.
A simple example is that & means AND.
Assignment of meaning was done directly in the infix evaluator
of Chapter 3, but here it has to be done indirectly.
Loops may result in their bodies being executed several times,
so for the same reason as for the truth table generator
of Chapter 5, the code has to be stored internally.
So the semantics requires an indirect assignment of meaning
to the language seen by the user,
by first specifying a translation scheme
and then assigning meaning to translations.
For example, & is translated internally to conj which means AND.
Step 5: Designing the internal code: Whereas postfix code as in the truth table generator is good for evaluating expressions, something else is needed for the execution of loops or of conditionals. Essentially we should like to say something like this:
F1 & F2 and WHILE E DO S,
as they occur in the first lines,
belong to the object language whose meaning is being explained;
we are not expected to know the meaning of either.
Of course we and the Pascal compiler are supposed to know
what AND, WHILE and DO mean
where they occur unquoted near the end.
We now have to think of a way of representing such translations.
A uniform way of doing this is to think of conj and while
as operators with two arguments which are pointers to other,
possibly complex things.
Those other things can also consist of operators with
two pointers,
but ultimately there must be operands which do not point to anything.
Pointers indicate an address where something is to be found,
we can use integer pointers into an ARRAY of three-part RECORDs
with an operator and two integers.
The integers are also used to hold addresses
of variables or the code of procedures,
and to represent numeric and Boolean constants.
Step 6:
Translating into internal code:
Once the internal code has been designed, it is an easy matter
to produce the translations during the parsing.
The translation is very similar to translation into postfix
in the truth table generator.
Each translation step is handled by a call to a code generating
procedure, which is similar to the one in the truth table generator.
But instead of one parameter,
it has three: an operator and two integers,
and it deposits the values of the three in another RECORD.
The two integers have to be known at the time the call is made.
For variables and constants in factor this is no problem,
for the arithmetic and Boolean operators it has to be done
when the address of the code for the operands is known,
after the second operand has been processed.
So the order of generation is the same as that for postfix code.
The address of the code for the second operand will be
the most recently generated code, so that is easily accessible.
But the address of the code for the first operand will be
lost by now; so that has to be saved in a local variable
when the infix operator is seen.
Since the procedures for infix operators can each process
several, in fact three, different operators each,
the source operators also have to be saved
in another local variable so that the correct internal code can
be generated.
The mechanism is already familiar from the truth table generator
in procedure formula.
Essentially the same mechanism is used in the parsing
procedure for statements and for statement sequences;
for the latter it should be noted that semicolons
produce code in the same way that + does.
Inside programme no code is generated for declarations,
but for a procedure the last code generated by its body
has to be attached to the name of the procedure,
so that a call to this code can be generated
when the procedure is called inside statement.
Step 7:
Interpreting the internal code:
The code for the body of the (main) programme
will be the last code that has been generated.
This can now be interpreted by either real hardware
or by a software interpreter.
Both would need space for the variables of the program;
the way it has been designed here all 26 potential variables
need to be given space.
For a hardware interpreter the internal code
would have to be loaded into another space,
but for our software interpreter it is already there.
So when the entire programme has been read, checked for errors
and translated,
the main program should pass it onto an interpreter
which has an ARRAY of 26 integers for the potential variables.
Then the body of the programme has to be executed.
During execution expressions will have to be evaluated,
both Boolean and integer.
So a natural way to think of the entire interpretation
is that it involves recursive calls of
a procedure for executing statements
and two functions for evaluating expressions.
The procedure and the functions each take an integer parameter
which is the index of a RECORD in the code ARRAY,
and each makes a CASE decision
based on the operator part of the RECORD.
The cases resemble those in the infix evaluator
and the truth table generator,
and when there are recursive calls
then one or both of the integer parts of the RECORD
are used as parameters of the call.
If a program is preceded by ?,
then the code for every body will be displayed
when the body has been read.
If a program is not terminated by . but by ?,
then the execution will be traced.
These two facilities were not mentioned in the user manual,
they are supposedly secret information,
known only to the implementor who is checking
the internal workings.
The following is a sample run from TYPROC:
The following is the standard Pascal source program for the imperative language TYPROC.
The exercises below are divided into two groups: those that leave the language as it is and merely change the implementation, and those that change the language.
Changing the implementation:
Improve the error reporting
so that when an error has occurred,
a marker is placed under the currently visible symbol
and the error message is written next.
You will need to use an input line buffer,
so that the whole line is written out when an error
has been seen.
This affects procedure getch ---
it now has to maintain this buffer,
extract characters from it sequentially,
and read a whole new line when the buffer is empty.
Rewrite the scanner so that it recognises
reserved words like BEGIN and IF
rather than leaving the recognition to the parser.
This will mean that procedure getsym
reports to the parsing procedures that it has
recognised a begin-symbol, or an if-symbol, and so on.
You will need to define an enumeration type for these symbols.
This enumeration type should include symbols
for the single characters, too.
As a consequence,
the parsing procedures do not have to know anything about the
surface syntax of the language.
Study the notion of error recovery, and implement it in this compiler. See Wirth (1976, p 320 - 322) for a description of error recovery.
Instead of implementing the binary tree code
in ARRAY, use pointers.
Do not forget to dispose of unwanted pointers
when a programme has been executed.
In the interpreter, add a check which prevents attempted division by zero.
Rewrite the interpreter without using recursion.
Redesign the internal code so that it is closer to a conventional machine language. The simplest kind is code for a stack machine for evaluating expressions and for holding return addresses for procedures that have been called. You will have to write a completely new interpreter.
Rewrite the program in a different language such as C or Lisp or Basic.
Changing the language: Just 26 lowercase variables and 26 uppercase procedures do not make programs very readable. Change this so that variables and procedures can be any (perhaps lowercase) identifier --- starting with a letter optionally followed by further letters, digits (and perhaps underscores). You will need to implement a symbol table in which such identifiers are stored when they are being declared.
The third little program in the sample run uses procedures, but they are not recursive. Write a little program in the language which uses recursion. Without actually running it, but by inspecting the Pascal program in the previous section, determine whether your program would work correctly. If yes, explain how; if no, fix it.
Add the type CHAR to the language.
Define a method which allows users to write character strings.
This is particularly useful for obtaing readable output.
Do not attempt to implement string variables
of arbitrary length --- this is quite difficult.
Implement ELSE, REPEAT, FOR and CASE statements.
Note that the latter two are much harder than the first two.
Add declarations for ARRAYs of integers
or of Booleans or of characters.
The declaration should specify their size,
do not attempt to implement dynamically varying sizes.
Add local variables to procedures. Since procedures should allow recursion, the local variables will have to be allocated on a stack.
Allow procedures to be called before they have been defined, but ensure that they have been defined before the body of the main program.
Add a facility for defining (parameterless) functions. In one way functions are like procedures in that they have an executable code body. In another way functions are like variables in that they have a type. The type is given in the declaration, and it has to be recorded in the symbol table for later checking.
Implement value parameters for procedures (and functions). Such parameters are just like local variables except that they are being initialised at the time of call. So they will have to live on the stack, too.
Reading: Allison (1986, pp 52 - 59) gives a denotational semantics for a small imperative language, and (pp 120 - 127) a Pascal interpreter for the language. Note that the interpreter is a close, almost literal, translation of the semantics into Pascal. The closeness of the translation is intentional, it is bought at the price of efficiency. You might like to write a more efficient version, but do try to understand why Allison made his translation so close.
If you wish to pursue the topic of compilers, you may wish to skip to Chapter 14 to study a somewhat more complex but still quite small compiler for another, more useable language. Alternatively, you may wish to pursue the reading given in that chapter.