Datbas is a small language based on classical predicate logic, with predicates of any number of parameters. It allows users to describe a finite interpretation and then to ask the system questions about this interpretation. The system incorporates features of the Datalog system of Chapter 16 and the theorem prover for monadic logic developed in Chapter 15. The implementation makes heavy use of the utilities developed in Chapter 17.
In this section we design a language that is to be based on classical predicate calculus. In this respect it is similar to the monadic logic system of Chapter 15. But this system will allow predicates not with at most one parameter but with any number of parameters. Whereas the monadic logic system determined whether a formula is a logical truth (or true in all interpretations), this system determines whether a formula is true in a particular interpretation.
The idea is that the user first specifies a finite interpretation and then repeatedly asks the system whether particular formulas are true in the interpretation or which individuals in the interpretation satisfy particular formulas. In this respect the system resembles Prolog (and its small cousin, Datalog, of Chapter 16). However, the system to be designed here is entirely classical and hence its treatment of negation is quite different from that of Prolog's. In particular, it will be possible to ask which individuals do not satisfy an open formula. Such a query would not be possible in Prolog because the domain is infinite (technically, it is the Herbrand universe of all terms), and hence the list of individuals which do not satisfy a formula will generally be infinite. To make this list finite, the system to be developed here will require all individuals in the domain to be specifically listed. Hence the domain of individuals and the extensions of predicates are relatively fixed, in a way that is similar to the Datalog system, and unlike the monadic logic system.
One decision that had to be made concerned whether all individuals should be of the same undifferentiated kind or sort, or whether there should be a system of sorts, somewhat similar to the typing system of languages such as Pascal or C. Most of the familiar logics have just one kind or sort of individual, and hence quantifiers range over all individuals. But there are advantages in having several sorts of individuals, and making all quantified variables range over some specified sort.
1. One advantage lies in additional security: the system can check whether the constraints imposed by the sorts are violated and then produce error messages.
2. The other advantage lies in execution speed. Consider an unsorted and a sorted system in which there are many things which are not people. Now consider the query whether all people are healthy, wealthy and wise. In the unsorted system this question would be put as a universally quantified implication formula. To determine the answer, the program would have to check through all the individuals in the domain to ascertain whether they are either not persons or are healthy, wealthy and wise. In the sorted system the same question would be expressed as a universally quantified formula, with the variable of quantification being typed to range over persons. The remainder of the formula is just the consequent of the implication formula needed for the unsorted system. So the antecedent of the unsorted formula becomes absorbed into the simultaneous quantification and typing of the variable. To determine whether the formula is true, the program would merely have to check through the individuals of sort person to determine whether they are healthy, wealthy and wise. If there are many individuals that are not persons, this difference in search space can be substantial.
The introduction of sorts requires that the variables in open formulas
be typed in a way that is similar to the typing in quantifiers.
One solution is to introduce quasi-binders
which are like quantifiers in that they assign a type to a
variable, but which leave the variable unquantified and hence free.
One such quasi-binder is WHICH,
and to a question beginning with WHICH
the system should then respond with a list of those
persons which satisfy the open formula.
Another quasi-binder is FIRST, used in a similar way to obtain not
all individuals having a particular property but just one,
the first that can be found.
3. The third advantage is minor, it has to do with finding individuals which do not have a particular property, i.e. individuals which satisfy a negated formula. Consider the query
4. Finally, the introduction of sorts can save memory. Happiness is a property of some persons, but never of cities, rivers or planets. So it is never necessary to store such information about those other things.
Sorts are very similar to types, and often the two terms are used synonymously. But there are good reasons to distinguish them. In the literature one sometimes speaks of individuals in the domain having one or the other sort, and expressions, which refer to individuals, as having one or the other type. This makes the notion of sort model-theoretic and the notion of type syntactic. I shall speak of individuals having a particular sort, and variables and formal parameters as having a particular type.
The usual predicates of logic are often called truth functions. This is because when supplied with parameters they yield a truth value. For example, a unary predicate is either true of an individual or it is not --- there is no third possibility. The resultant truth value can then become an operand to the truth functional operators.
For many purposes it is useful to have a family of mutually exclusive and jointly exhaustive predicates. One might want to speak of men, women and children, and one may want it understood that every person belongs to exactly one of these categories. What is being understood here can be expressed explicitly by meaning postulates which state that the three categories are mutually exclusive and, within the sort person, jointly exhaustive. Then from the fact that somebody is neither a man nor a woman we may infer that we are dealing with a child. But the meaning postulates can complicate the deduction mechanism considerably.
In the case of unary predicates like the above, another solution is to define the three categories as sorts, and then to define the type person as the union of these three sorts. If sort names can also be used as predicates, this method takes care of the required deductions without explicit meaning postulates.
But the method cannot work with families of relations, predicates taking more than one parameter. For example, one might want to categorise the attitude of one person to another as loving, friendly, indifferent, disdainful or hostile. As we all learnt when young, often to our sorrow, the relationships are not symmetric. No system of sorts of individuals can cope with this, since sorts are classes of individuals, and not classes of ordered pairs of individuals. Hence families of mutually exclusive and jointly exhaustive predicates require a separate treatment. Indeed, one can have families without sorts, or vice versa, or one can have neither or both.
Another possible way of dealing with the problem is to have a sort comprising these five attitudes and to regard human relationships not as a binary relation between two people, but as a ternary relation between two persons and an attitude. But this turns the five attitudes into individuals, and this is at least philosophically suspect.
So it seems that families of predicates need a separate treatment. Following an older usage, I shall call the human relationship a determinable, and it has five determinate values. An ordinary predicate has just two values, the truth values, and a determinable predicate can have any positive number of values. The determinates of a given determinable are just two-valued predicates, but they are automatically mutually exclusive and jointly exhaustive.
Once there are determinables,
it seems natural to allow variables to range over
the determinates of a determinable.
These variables can be universally or existentially
quantified, or they can occur in WHICH or FIRST queries.
The variables have to be typed just like individuals variables,
and the type of such a predicate variable is the determinable
over whose determinates the variable is to range.
Syntactically there need not be anything to distinguish
the declaration of a predicate variable from the
declaration of an individual variable.
But within the scope of such a declaration,
the predicate variable behaves just like a determinate predicate.
Semantically an existential quantification using a predicate
variable behaves just like a disjunction of several copies of the
formula being quantified,
with each of the determinate predicates substituted
for the predicate variable.
Universal quantifications behave just like conjunctions.
For WHICH queries the system should
respond by listing the determinates
that make the formula true,
and for FIRST queries it should list only
the first such determinate that it can find.
Syntax: The predicate language to be used will need the usual truthfunctional connectives, predicate constants and individual constants, individual variables and quantifiers. In addition there will have to be sort constants, which are similar to predicate constants except that they are defined by enumerations of the individuals they comprise, and later used in quantifications and questions. To anticipate the concrete syntax to be used later, here are some examples:
x and y to be used
with any type name, and the variable acquires its type
at the point of quantification and retains this type
throughout the scope of that quantification except where
overridden by more local quantifications.
Another method is to type the variables once and for all,
as in, say,
x and y to range over these persons.
Then one would write:
x and y would be automatically typed.
But this method will not be used here.
All predicates will have to be typed by the types of their parameters. Assuming that persons and cities have been declared as sorts, one might declare happy to be a predicate which takes a person as a parameter, lives-in as a predicate which takes two parameters, a person and a city, and bigger as a predicate which takes two cities as parameters. In the concrete syntax these declarations might appear as:
Semantics --- Interpretation: An interpretation consists of a domain of individuals and for each predicate an extension of the predicate. The domain is subdivided into one or more exclusive subdomains or sorts. Each sort is a set of individuals, and each individual belongs to just one sort. The domain comprises the union of all the sorts. A type is a union of one or more sorts. Each predicate has a fixed number of formal parameters, and each parameter is of a particular type. Each predicate of n parameters has as its extension a set of n-tuples of individuals, where the sort of each individual is a subset of the type of the corresponding formal parameters.
The system to be implemented here does not use individual constants that are different from the names of the individuals. So, every individual constant has as its extension the individual which it names, and every individual has exactly one individual constant of which it is the extension.
Semantics --- Satisfaction: This is more complicated than for monadic logic because there can be any number of variables in a formula. These variables have to be given extensions, and hence satisfaction is not a relation between an individual and a formula but a relation between a sequence of individuals and a formula. Such a sequence of individuals is always used to interpret variables in a formula. More formally, a sequence is a mapping from the variables in a formula to individuals in the domain. (To anticipate the implementation, the sequence is the content of the stack.)
First, the definition of satisfaction between a sequence of individuals
and an atomic formula.
In the notation to be used here,
the formula might be p(a,y,b,x),
which is formed from a predicate,
here the 4-place predicate p,
followed by a parenthesised 4-tuple,
of constants, here a and b, and variables, here x and y.
First we form a 4-tuple of individuals:
For each constant such as a and b we use the individual
which it denotes.
For each variable such as x and y we use the individual
which the sequence assigns to them ---
they might be a for x and c for y.
This gives a 4-tuple of individuals of individuals.
Then we check whether this 4-tuple of individuals
is in the extension of the predicate p.
If it is, then the sequence satisfies the atomic formula,
otherwise it does not.
There is also the special case of an atomic formula,
say s(x) or s(a)
in which s is a sort.
Then a given sequence satisfies s(x)
if and only if it assigns to x an individual of sort s,
and any sequence satisfies s(a) if an only if
the constant a denotes an individual of sort s.
Next, satisfaction for truth-functionally compound formulas: A sequence satisfies a negation if and only if it does not satisfy the negand, it satisfies a conjunction if and only if it satisfies both conjuncts, it satisfies a disjunction if and only if it satisfies at least one disjunct, it satisfies a conditional if and only if either it does not satify the antecedent or it satisfies the consequent, and it satisfies an equivalence if and only if either it satisfies both or satisfies neither parts.
Finally, satisfaction for universally or existentially quantified
formulas.
These are of the form ALL v:t F or SOME v:t F
where t is a type and hence v is being declared to
be a variable of type t.
For a given sequence we have to form
the set of v-variants
of the sequence,
the set of those sequences which are like the given sequence
except that they may differ from the given one
in what they assign to the variable v.
A given sequence satisfies the universally quantified formula
if and only if all v-variants of the sequence satisfy F.
A given sequence satisfies the existential quantified formula
if and only if at least one v-variant of the sequence satisfies F.
The following is a sample run of the program. The first part consists of a very small database about people and countries, and some queries. The second consists of a larger database concerning the geography of Australia, and some queries.
The DATBAS system is a database and reasoning system based on classical logic over many-sorted domains.
The system permits declarations of sorts and the individuals they comprise, of predicates and the types of their parameters, and of extensions of predicates. The system does not distinguish between individuals and individual constants. Hence in a declaration of a sort the names listed are simultaneously the metalanguage names of the individuals in the interpretation and the object language individual constants which have those individuals as their extension.
The grammar of the input language is as follows:
SORTs or of PREDICATEs or
it is a specification of EXTENSIONs,
or it is a request to SHOW the declared identifiers
or to QUIT the session,
or it is a query.
Each declaration consists of one of three keywords
followed in each case by zero or more identifiers being declared.
Any identifier being declared here
must not have been previously declared.
For declarations of SORTs,
each identifier is followed by = and then a type.
For a specification of PREDICATEs,
each identifier is optionally followed
by a parenthesised comma-separated list of types.
For declarations of EXTENSIONs
each identifier is followed by = and then an expression;
this is then followed by the specification of a set
enclosed in braces.
Such a specification is given either intensionally
or extensionally.
An intensional specification consists
of the word THE followed by a sequence
of (lambda-bound) typed variables enclosed
in corners, and then an expression.
An extensional specification consists
of any number of tuples of individuals
enclosed in corners.
| and another type,
but that has to be of the second kind.
WHICH
(or FIRST) question or just an expression.
A WHICH question such as WHICH x:person likes(x,john)
produces those persons who like John.
A WHICH (or FIRST) question binds a variable to a type
and otherwise consists of a query.
These bindings are similar to the quantifier bindings produced by
ALL and SOME.
The syntax of the two kinds of bindings is chosen
so that any combination is allowed
except that a WHICH binding cannot occur in the scope
of a quantifier binding and it cannot occur in a truth functionally
compound formula.
The reason for this restriction is semantic:
there seems to be no sensible meaning one could assign
to aberrant expressions such as
WHICH and FIRST bindings occur outermost,
as it is enforced by the above production.
: and a type.
The identifier will be bound to the type.
IMP or IFF,
for material implication and material equivalence.
A simple expression consists of one or more terms separated by
the truth functional OR operator for inclusive disjunction.
A term consists of one or more factors separated by
the truth functional AND operator for conjunction.
ALL, SOME or THE, followed by a binding and a factor.
A factor may also consist of curly braces
enclosing zero or more tuples each consisting
of angle corners enclosing a sequence.
Finally,
a factor may consist of the symbols TRUE or FALSE,
or NOT followed by a factor,
or a parenthesised expression.
The scopes of bindings are as follows:
In declarations of EXTENSIONs any THE-binding
extends to the end of the expression.
In a query any WHICH- or FIRST-binding extends
to the end of the query.
In a factor, any ALL- or SOME-binding
extends to the end of the factor.
The context free syntax is so simple that it needs little discussion except for error recovery. The context sensitive syntax and the semantics is described in detail.
Utilities: Just as the program in the previous chapter, this program makes use of the utilities almost everywhere. So again the utilities have to be processed by the Pascal compiler before the program proper can be processed. Since the utilities are not entirely stand alone, several declarations have to occur first. Most of these declarations are sufficiently similar to those in the previous chapter that it is not necessary to discuss them here.
Main: The main program begins by calling an initialisation procedure whose body consists of calls to procedures in the utilities: one call to initialise the scanner, several calls to enter the reserved symbols, and several calls to enter the standard identifiers. The main program then enters its principal read-execute loop.
Context free parsing and error recovery: As in most programs before, the recursive descent parsing procedures are modelled on the productions of the grammar. Visibility requirements are met by the the following nesting structure:
..
For every procedure which parses infix symbols,
any calls in that procedure augment the set by precisely the infix symbols
parsed by that procedure.
For other procedures,
such as those for parsing factors,
a test is made at the end to check
that the factor that has been parsed
is indeed followed by a legal symbol ---
if not, input symbols are skipped until a legal one is found.
Note that the parsing procedures for infix operators
again do not follow the grammar verbatim,
this helps to generate right linear trees for the code.
The parser has to do more context sensitive checking than
was necessary in previous programs.
All of this will be handled by a symbol table
and another table for the formal parameters of predicates.
To summarise from the manual:
1: In declarations of sorts, individuals and predicates
any identifiers must not have been previously declared.
Note that this does not apply to variables
introduced by queries or by quantifications.
In use, i.e. outside declarations,
any identifier must have been previously declared.
2: In use any identifier must be the right kind of object ---
a constant or a variable, a type or a predicate.
3: In use a predicate must have the right number of
actual parameters,
and each actual parameter must be compatible with its
corresponding formal parameter.
Similarly, in extensions of predicates given by
THE declarations and an expression,
the type of each declaration must be compatible
with the formal parameters of the predicate.
These requirements are rather standard for many programming languages. Requirements 1 and 2 we have encountered earlier, they are most readily implemented by means of a symbol table. Requirement 3 will be implemented by a further table of parameters. The parts of the program needed to satisfy the three requirements are best described in three separate steps.
Step 1 - Symbol Table:
For the handling of user defined identifiers it is necessary
to use a symbol table.
Two procedures are used to manipulate it:
one to enter new symbols into the table,
another to lookup symbols to retrieve information
associated with them.
When a symbol is being declared,
the enter procedure is used to enter the symbol itself.
When a symbol has been read which the scanner
(in the utilities) does not recognise
as a reserved word,
the lookup procedure is used to retrieve information
associated with the symbol.
Users can declare symbols in any order,
so it is not possible to implement the binary search
that works so well with fixed collections of symbols
such as reserved words.
In previous programs the table of user defined symbols
was always implemented as a simple linear list
(or, in the case of Datalog,
as three different linear lists).
The same method was used in the early stages
of the development of this program.
The table was an array with an associated index initially set to zero.
Procedure enter incremented this index and deposited relevant
information at the indexed position,
and procedure lookup performed a linear search (with sentinel)
starting at the most recent entry.
In any but the most trivial applications
the number of individuals in SORT declarations will be quite large.
Since each individual requires an entry in the symbol table,
the linear search method would become too slow.
To gain efficiency,
the linear method can be replaced by a hash method.
For some references, see Wirth (1976, pp 264 - 274),
Tremblay and Bunt (1979, pp 531 - 549),
or any good book on data structures or on compilers.
The method used here is called direct or separate chaining.
Essentially the identifier is used to compute a small
number in a range from 1 to N,
where N may be, say, 100 or 1000, but a prime number is best.
The method used here is dependent on the implementation of Pascal.
The hashing function first copies the identifier,
a string of 16 bytes, into a record which in one of its variants
is such a string.
In another variant the record consists of four integers,
each of four bytes.
The four integers are simply added,
producing a sum (implicitly modulo maxint)
and then the modulus of the prime number N is taken.
The resulting number is then used to index into an ARRAY [0 .. N],
containing the starting point of a (quite short)
linked list of those identifiers in the symbol table
which yield the same hash value.
That list can then be searched linearly,
by passing along link fields in the records.
The hash method can be implemented on top
of the earlier linear search method,
by adding the hash array as a new global data structure
and by adding the link field to the records of the table.
Only the enter and the lookup procedures
needed some small changes.
For debugging it was found useful to make
the initial size of the prime number N
and hence the hashtable absurdly small, say 13,
to maximise the chances of collisions.
The number of symbols that are declared as individuals, sorts and predicates can be large, and the scope of the declarations is always the remainder of the session. Importantly, these declarations never have to be undone, and in such a situation the hash method works very well. However, there are also variables that are introduced in queries and in quantifications. These always have a very limited scope, and at the end of that scope they are not wanted any further. Indeed, it would be good if the space they have used can be re-used. But removing symbols from a hash table can be tedious. Fortunately at any one time the number of active variables with limited scope is very small. Hence they can be searched linearly before the remainder of the symbol table is hash searched. The deletion of these symbols is done by simply resetting an index variable.
Step 2: The symbol table as described above can be used to enter symbols into the table and to lookup to see whether the symbol is there already. This takes care of the first requirement. Each record in the table can then be given an additional field to store what kind of object a symbol is: an individual, a type, a predicate or a bound variable. A parameter to the entering procedure handles assignments to this field, and after lookup the field can be inspected. This takes care of the second requirement.
Step 3: Further refinements are needed to check whether actual parameters of predicates match the formal parameters in number and in type. A similar problem arises in conventional programming languages such as Pascal. In these languages the formal parameters of procedures and functions have names, and these names are handled rather like the names of local variables: they are entered into the symbol table together with their type. Then, inside the body, the names of the formal parameters are visible, but outside the body, for a call, only their number and type is known, and the table is consulted to check agreement in number and type between the actual and the formal parameters. This method could also be used here, even though the formal parameters do not have names. However, one should be reluctant to clutter up the table for nameless entities. So it is probably best to introduce a separate table for parameters. For each predicate the entry in the symbol table contains a link to a sequence of formal parameter types in the parameter table. In that table the records contain, for each parameter, the type of the parameter and a boolean flag which for all but the last parameter is set to false. Thus when a sequence of actual parameters is to be checked for conformity with the formal parameter types, a simple stepping process can compare the actuals with the formals.
The types are just set unions of sorts. For every sort that is declared by enumeration, a counter of sorts is incremented, and the sort being declared has that new integer assigned to it. For every sort that is declared as a union, the union is computed from the constituent types.
The program essentially determines whether formulas are true
or false in an interpretation.
To do so, it has to determine whether atomic formulas are true
in the interpretation.
Hence, at the very least,
an interpretation is a function which for each closed atomic formula,
such as p(a,b,c),
returns one of the two truth values, or perhaps neither.
Predicates as truth functions:
As a first step, we may think of the interpretation
as consisting of several functions, one for each predicate.
For an n-ary predicate there would be an n-ary function.
Thus to compute the truth value of the atomic formula p(a,b,c)
we have to compute the value of the function
associated with the predicate p for the actual parameters (a,b,c).
Next, we consider how these functions are to be implemented.
The number of possible parameters is finite,
in fact quite small,
in particular because of the system of sorts.
So the most natural way to implement the function needed for
an n-ary predicate is as an n-dimensional array.
Thus for p(a,b,c) we need to look up a three dimensional array
of truth values and perhaps the undefined value.
The required value is to be found in this array
at a position which depends on the types of the formal parameters,
and on the position, within those types, of the actual parameters a,b,c.
It is important to note that for a different predicate, say q,
which also takes three parameters
and for which the parameters (a,b,c) are also legal,
the position of q(a,b,c) might be quite different from
what it is for p(a,b,c).
This will happen if one or more of the parameters
of the one predicate is a union
but for the other predicate it is a different union or a basic sort.
In general, each formal parameter to a predicate will be the union of basic sorts. Hence the size of the dimension corresponding to a formal parameter will be the sum of the sizes of the basic sorts in that union. To find the address of an actual parameter, we have to take its ordinal value within the basic sort in which it was declared. But in general its basic sort will not be the first in that union, so the ordinal value will have to added to the sum of the sizes of the preceding sorts in the union:
p(a,b,c),
a triple of addresses has to be computed,
each using the above formula.
Linearising the array: The addresses are for an imaginary array which has as many dimensions as the predicate takes parameters. In practice, of course, the array cannot be implemented like this, because the dimensions needed will vary for different runs of the program. Instead the implementation will have to do what is a routine technique in the implementation of standard programming languages: all arrays of whatever dimensions are taken as consecutive portions in one single long array, the computer memory. To do this, the addresses computed in the preceding paragraphs have to be adjusted so that they become addresses in a linear array.
For the time being we shall assume that the n-dimensional array
for the n-ary predicate is to be implemented as a separate
one-dimensional array starting at virtual address zero.
We must now translate each n-tuple of addresses in the n-dimensional
array into a single address in the one-dimensional array.
The total size of the one-dimensional array is the n-fold product
of the sizes of the dimensions of the n-dimensional array.
For example,
for n = 2
for a predicate p(x,y) in which x
can take 4 values and y can take 5,
the 2-dimensional array is 4 * 5,
and it has to be mapped onto an array of size 20.
To map a pair, say <2,3>, of addresses,
we add the address for the first ordinate, 2,
to the address of the second ordinate, 3,
multiplied by the size of the preceding ordinate,
the first ordinate, which is 4.
This gives as the address 2 + (4 * 3) = 14.
It might help to look at the entire mapping:
The row headings are for the x-ordinate,
the column headings are for the y-ordinate.
The matrix entries are the addresses in the linearised array.
z,
then addresses in this dimension would have to be multiplied
by the product of the sizes of the preceding dimensions, which is 20.
To generalise,
an n-tuple of addresses is translated into the sum
of the addresses of the ordinates,
each multiplied by the product of the sizes of the preceding dimensions.
We can now adjust the addresses for actual parameters given earlier:
the contribution of an actual parameter
to the final sum is given by
A single array for all predicates: The address contributions of all actual parameters eventually have to be added to yield the address in the linear array corresponding to the predicate. In practice it is not possible to have a separate linear array for each predicate, because the sizes of these arrays will vary from run to run. Instead the implementation will again have to do what is standard in the implementation of programming languages: all these linear arrays are taken from one single array. So the sum of the address contributions mentioned earlier will have to be given a further offset which is the beginning of the space for the predicate. Hence the actual address in the single array is given by:
Retracing these steps backwards,
we see that the following has to be done
for predicate declarations:
1: A cumulative sum, starting at 0,
of sizes of predicates has to be kept.
When a predicate is being declared,
the previous value of that sum has to be entered
into the symbol table as the start address of that predicate.
The sum is updated when the predicate has been fully declared,
to be used as the start address of the next predicate, if any.
The sum has to be a global variable
to survive different groups of top-level commands.
2: For each predicate a cumulative product, starting at 1,
of sizes of parameters has to be kept.
When a parameter is being declared,
the previous value of that product has to be entered
into the parameter table as the multiplier for the current parameter.
The product is updated when the parameter has been fully declared,
to be used as the multiplier for the next parameter, if any.
The product can be a variable local to top_level.
3: For each parameter of a predicate
a cumulative sum, starting at 0,
of sizes of sorts has to be kept.
When the type of the parameter is being analysed
into its constituent sorts,
for each sort the previous value of that sum has to be entered
into the parameter type table as the sum of sizes of preceding sorts.
The sum is updated when the sort has been analysed,
to be used for the next sort in the union, if any.
The sum can be a variable local to top_level.
Note that the three accumulating variables just described are being used for making entries in the three tables: the symbol table, the parameter table for predicates, and the sort table for parameters. The entries in the tables are needed for code generation.
The entries in the three tables are used for computing the addresses
required for atomic formulas such as p(a,b,c).
The calculation of an address can be done while the atomic formula
is being read,
and it results in a single code node being generated.
In this way the interpreter does not have to recalculate
the required address each time it needs to evaluate the formula.
The calculation of the address takes place in several steps:
one for the predicate p,
and one for each of the parameters (a,b,c).
The first step, when reading the predicate p,
looks up the symbol table for the start address of the predicate
and generates a code node with that start address.
The additional steps, one for each of the parameters,
perform a fixup on the code node,
by adding the address contributions of each actual parameter.
Extensions:
The method described
is used for generating code for atomic formulas
to be used by the interpreter for looking up the memory array
to determine the truth value of such atomic formulas.
Essentially the same method can be used
for assigning literal EXTENSIONs to predicates,
i.e. of the form
p = { ,
for setting the memory array in the first place.
For both tasks a tuple of actual parameters has to be processed;
for atomic formulas the tuple is of the form (a,b,c),
for extensions it is of the form .
The two different brackets are handled
by different procedures:
( and ) are handled by procedure factor,
{ and } are handled by procedure top_level.
at different places.
But the treatment of the parameters can be made identical
inside procedure tuple.
For atomic formulas p(a,b,c) this results in a fixup
of the node generated when the predicate p was being read.
But for EXTENSIONs the code node
has to be generated for each triple in the extension
of the predicate.
Each such node will be fixed up in a way that is specific
to the triple,
but the node has to be generated in the first place.
The other, non-literal method of assigning extensions,
i.e. those of the form p = {THE ,
is described at the end of the next section.
Requirements:
In this section we consider formulas containing variables
introduced by either quantifiers or by WHICH declarations.
For finite interpretations, but only for finite interpretations,
a universally quantified formula is logically equivalent to a conjunction,
an existentially quantified formula is logically equivalent to a disjunction.
The conjuncts or disjuncts have to be substitution instances
of the given formula in which all occurrences of the variable
that are bound by the quantifier are replaced by a reference
to an individual in the domain.
In the case of a sorted system the substitutions have to range over all
individuals having the type of the quantification.
Much the same holds for variables bound by WHICH declarations:
substitutions have to be made using al individuals
having the type in the declarations.
Then, for each substitution,
the remainder of the formula has to be evaluated.
If it is true, then the variable and its current substitution
have to be printed.
There are many ways in which variables and their binding can be implemented.
One naive implementation would use literal substitutions
to generate long conjunctions and disjunctions
for quantified formulas and use a similar expansion for WHICH
and FIRST queries.
Essentially this would treat quantification as a macro mechanism:
from the compiled internal code it would not be possible
to tell whether the user wrote a quantified formula
or a long conjunction or disjunction.
A similar treatment would be given to WHICH and FIRST queries.
The big disadvantage of this implementation method
is the large amount of internal code that has to be generated.
Another method would be this: instead of performing the substitutions at compile time, they could be postponed till run time. The interpreter would then create each of the disjuncts and conjuncts on the fly, possibly by modifying the code.
Implementing substitution:
But there is another way again,
by using a mechanism which does not perform any literal
substitution but something equivalent.
Again some code would be generated for the formula
and some additional code for the binding.
This latter code is then used by the interpreter
to perform actions equivalent to the substitution.
It may help to think of an analogy in procedural programming.
There a FOR loop generates code for the loop body
and some additional code for sequencing the FOR variable through
its values.
During each execution of the loop body any reference to the
FOR variable looks up its current value.
A similar method can be used to implement quantifier binding
and query binding.
The formula corresponds to the loop body,
and the additional code for the binding corresponds
to the sequencing code of the FOR variable.
In both cases the additional code has to do much the same:
A FOR variable has to be stepped from its low value to its high value,
or vice versa for the DOWNTO version.
A quantified variable or a query variable
also has to be stepped through all its values,
although there is no particular order required.
During each execution of the code for the formula,
any reference to the bound variable looks up its current value.
For an unsorted system
the stepping has to go through the entire domain.
For a sorted system the code has to be executed only for those
individuals whose sort is included in the binding.
The simplest way of implementing this
is to generate the additional code for the binding
in a form which contains the set of those sorts
that are included in the binding.
The interpreter then steps through all the individuals in the interpretation
and if its sort is a member of the sort included in the binding,
then it proceeds to execute the code for the formula.
The disadvantage of this implementation method is that
it fails to make use of one of the promised
efficiency advantages of the sorted system.
By stepping through all individuals in the interpretation
and checking their sort,
the sorted interpreter does much the same as what an
unsorted interpreter would do by testing an additional formula.
For example,
the sorted formula SOME x:t f(x)
would be processed in much the same way as
the unsorted formula SOME x (t(x) AND f(x))
would be processed:
stepping through all individuals and testing whether t(x) holds
before going on to f(x).
In short,
the expected gain in execution speed is largely lost.
It is not lost entirely,
because checking that a particular individual is of a particular
sort will still be faster than checking whether
the formula t(x) is true,
especially if type t is really a union of sorts s1, s2, ..sN,
and what has to be checked is a disjunction
s1(x) OR s2(x) OR .. sN(x).
There is a simple remedy to this. Instead of stepping through all the individuals in the interpretation, the interpreter can first step through all the sorts. If a particular sort is a member of the set of sorts included in the binding, then the interpreter steps through the individuals of that sort and for each of them executes the code for the formula. Since the number of sorts is substantially smaller than the number of individuals, the stepping through all sorts and ignoring those that are not required constitutes only a minimal overhead for the interpreter.
It might be thought
that it is possible to improve this even further.
The sorts that have to be stepped through are known at compile time,
so the stepping could be done then.
The compiler would then have to generate appropriate code
for each sort that is included in the binding.
This would save the interpreter from having to do the stepping,
and if the interpreter had to do the entire stepping
repeatedly, some saving would result.
However, nothing would be saved in the case of WHICH bindings
since they are always outermost and only interpreted once.
For database queries even quantifier bindings
tend not to be embedded deeply in truthfunctionally compound formulas.
Since the overhead of stepping through unnecessary sorts
repeatedly is likely to be small,
this optimisation will not be considered any further.
Verifying existentials:
The above design has the following implications for the interpreter:
For WHICH and FIRST queries,
and for making existentially quantified formulas true
and making universally quantified formulas false,
the interpreter enters a double FOR loop.
The outer loop steps through all the sorts that have been declared.
One of the fields of the instruction contains a pointer to
the symbol table for the bound variable.
The interpreter can now inspect the set of sorts for that variable.
If the sort of the outer FOR loop is in this set,
then the interpreter enters its inner FOR loop,
with the bounds obtained from the table of sorts for the current sort.
The inner FOR loop now steps through all the individuals in that sort.
Information about the current individual has to be put somewhere
to be picked up later when the individual variable is being referenced.
Since bindings can be nested,
the place to put this information is a run time stack.
The information that is needed is in part the same as what was
computed at compile time for atomic formulas with constants
(see preceding section):
The sum of sizes of preceding sorts in the union of sorts being bound,
plus the ordinal value of the current individual in its own sort.
A very simple mechanism
achieves this:
a field of the top of the stack is initialised to 0
and incremented at the end of the inner FOR loop.
It is useful,
mainly for tracing purposes,
to give the stack elements two other fields,
one to record the name of the variable being bound,
another to record the name of the individual that has been
made the value of the variable.
Instead of actual names,
it is better to record indices into the symbol table.
The main business of the inner FOR loop
is to call the interpreter recursively using the remainder
of the formulas a parameter.
If that call produced at least one success,
it is desirable to jump out of the two loops
except for WHICH queries.
Verifying universals:
For making universally quantified formulas true
and for making existentially quantified formulas false,
one cannot use the double FOR loop method.
Instead a method has to be used which we have already
encountered in Chapter 15 for the monadic logic theorem prover.
There we used continuations to obtain the effect
of stepping through all the individuals in the domain.
The situation is more complicated because we have to simulate
two nested FOR loops:
an outer one to step through all the sorts,
and, for those sorts that are included in the binding,
an inner loop to step through all the individuals in that sort.
Again it is not possible to use real loops,
because simulating the conjunction requires
implementation by continuations.
When the interpreter sees a universally quantified formula to be made true (or an existentially quantified formula to be made false), then it has to set up what in effect is the outer loop for stepping through the sorts, and then call a parameterless procedure for finding the next sort, initially the first sort. That procedure will be called repeatedly, but some later calls are indirect, as continuations. The body of that procedure has to check whether all sorts have been stepped through, and in that case it can pop the stack and call the continuation procedure to the interpreter. Otherwise it conducts a (sentinel) search through the sorts to find the next sort included in the binding. If there are no further sorts to be found, then it can pop the stack and call the continuation. But if there is a further sort, then it can set up the simulation of the inner loop and call a procedure for stepping through the individuals of that sort. That procedure takes a continuation procedure, and for this initial call the actual continuation has to be the procedure for stepping through the sorts, itself. The body of that procedure has to check whether there is another individual in the current sort. If not, it directly calls the procedure for taking the next sort. Otherwise it sets up the binding of the variable on the top of the stack to the current individual, and then calls the interpreter recursively. For this call it uses the remainder of the formula as the one parameter. The other parameter has to be a parameterless continuation. Ideally it should be the procedure to take the next individual, but since that procedure takes a continuation this method is not possible. Instead the other actual parameter can be made a local parameterless procedure which when called will call the procedure to take the next individual.
Atoms with variables:
For atomic formulas such as p(a,x,c),
the address contributions of the predicate p
and the two constants a and c have been computed at compile time
and form part of the instruction for the predicate.
But the contribution of the variable x
can only be known at run time,
when (nominally) a constant, say b, has been substituted.
However the contribution of that constant also depends
on the fact that in this atomic formula it is the second parameter,
and hence the address contribution has to be multiplied by the product
of the sizes of preceding parameters.
This product is known at compile time,
and it has to be made one part of the instruction.
The other part is the location in the stack,
which also has to be known at compile time,
and where the additive part for the current constant
is to be found.
Since an atomic formula can contain any number
of occurrences of variables,
it is not possible to fit the code for an atomic formula
into a fixed length instruction.
Instead it is necessary to generate an additional instruction
for each actual parameter that is a bound variable.
So, when the interpreter sees an instruction for a predicate, it places the compile time address contribution into a global accumulator. Then, for each following instruction for a variable it multiplies the product part of the instruction with the contents of the stack at the address part of the instruction and adds this to the accumulator. At this point the accumulator contains the required address in memory. If the memory at that address contains the required value, then the continuation procedure is called.
THE declarations:
In declarations of extensions of predicates,
when the extension is given by a THE-formula,
the binding is treated like an existential.
The interpreter is then run to find those tuples
which make the following expression true.
But instead of showing the tuples,
the interpreter has to set the extension of the predicate.
Logical operations:
For the two constants TRUE and FALSE
and for the unary operator NOT
both the code generation and its interpretation
are entirely obvious.
For the binary operators the code generation is also quite
straightforward.
However, their interpretation requires a small elaboration
that we have not yet encountered in previous programs.
To prevent unnecessary branching when making disjunctions true,
it is desirable to prevent the execution of the second
disjunct if there are no WHICH variables
and the execution of the first disjunct produced at least one solution.
Sorts Used As Predicates:
Atomic formulas can be of the form s(i),
where s is a sort and i is either a constant ---
perhaps not very common ---
or a variable.
If i is a constant,
then the code to be generated has to be a TRUE or FALSE node,
depending whether i is in the sort s.
If i is a variable, then a special node can be generated:
Its left field is the stackaddress of the variable,
its right field is a pointer to the sort in the symbol table.
When this node is interpreted later,
it will succeed just in case the current sort of the stack variable
is a subsort of the sort indicated by the right field.
Identity Statements:
These are of the form i = j,
where i and j are either constants or variables.
In the uninteresting case where both are constants,
the code that has to be generated is a TRUE or FALSE node
depending on whether they are the same constant.
If one is a constant and the other is a variable,
then a special node can be generated:
Its left field is the stack address of the variable,
its right field is the pointer to the constant in the symbol table.
When the code is interpreted,
it will succeed just in case the current value of the stack variable
is identical with the constant.
If both parts of the identity statement are variables,
then another kind of node has to be generated
in which both the left and the right field
are stack addresses.
At run time such a node succeeds just in case
the current values of the two variables are identical.
The internal code used is mainly binary tree code,
generated by calls to a code generating procedure.
The code for the truth functional connectives is entirely familiar.
The code for the quantifiers ALL
and SOME and for the query binders WHICH
and FIRST is like that for the unary operator NOT,
except that information about the variable being bound has to be stored
somewhere, such as the vacant left field of the code nodes.
The really difficult part of the code generation concerns the
calculation of addresses into memory,
as described in the previous section.
The following is the Pascal source file.
It is not quite standard,
because it uses the utilities of the previous chapter
in an INCLUDE file.
If your Pascal does not allow included files,
you will have to physically include that
file at the point where the INCLUDE directive occurs,
about half a page down.
If your Pascal insists that declarations of labels, types, variables
and procedures and functions occur strictly in this order ---
then the various declarations will have to be merged.
>From the utilities of Chapter~17
procedure check is used extensively to produce
an error message in case a required symbol
is not seen.
Importantly, the procedure will then accept
some reasonable approximation to the required symbol ---
a likely typing error ---
to keep the parser in step with the input.
Note that for small values (1..3)
of the utility variable writelisting
additional information is written to both the output file
and the listing file.
For larger values (10..15)
voluminous but sometimes indispensible
tracing information will be written to the listing file only.
Background Reading :
Most intermediate textbooks on logic define the notions of
interpretations,
satisfaction of a formula,
and truth of a formulas in an interpretation.
However, all the ones I know only use a single domain,
because this makes the metatheory simpler.
Also, textbooks on logic invariably use just ordinary predicates,
and no determinable and determinate predicates.
But if you have understood the semantics of vanilla predicate calculus,
then the addition of sorts and the addition of
determinables and determinates should not present any problems.
For some advanced references on sorts and types in logic see Eisinger and Ohlbach (1989, pp 48 - 49). For the use of sorts to make Prolog searches more efficient, see Buekert (1989, pp 194 - 196). Another use of types in a Prolog-like setting is described in Van Hentenryck (1989) and in Ait-Kaci and Podelski (1991). Pletat (1991) gives a detailed description of sorts used in a natural language reasoning system and many recent references.
Syntactic Sugar - condensed binding: If one has several variables, all being bound in the same way, and all of the same type, then it is cumbersome to have to repeat the quantifier and the type for each of them. It would be convenient to be able to write a list of comma-separated variables where currently only one variable is allowed. For example, one could then write
Syntactic Sugar - non-identity: It is easy enough to say that two individuals are not identical, even that three individuals are not identical, but already for four individuals it becomes tedious, since one has to write
a,b,c,d can be individual constants
or, more often, individual variables.
Closures: The following concerns the extensions of binary relations. Find a way of expressing that one relation is the reflexive closure of another, or the symmetric closure of another, or the transitive closure of another. Then implement these additions.
Functions: A function is a special kind of relation. More precisely, an $n$-ary function is an $(n+1)$-ary relation in which no $(n+1)$-tuples differ in just their last ordinate. Find a way of implementing functions in DATBAS. There are two aspects to this: one is syntactic, having to allow formulas such as
hasfather
at for all possible values of person f.
Instead it would look up the value
of the unary function father
at .
Other kinds of formulas that become possible are
Determinables as predicates:
Allow determinables as predicates,
in formulas such as climate(x) = climate(y).
Sparse Memory : The memory array does not have to be implemented as a Pascal array. Indeed, in many applications most of the entries in the array are going to be undefined or false, few will be true. So the array may well be sparse, and then other implementation methods will be preferable to save on memory at the expense of computation time. However, considerations of memory will only become important for large applications, and the topic is left as an exercise.
Constraints instead of Sorts :
Bueckert (1990) uses a logic in which the variables
are typed not by sorts but by constraints which can be
arbitrary formulas.
For example (p 180, but changing the notation just minimally),
ALL x,y : (parent(x) AND child-of(y,x)) loves(x,y)
symbolises All parents love their children.
The parenthesised conjunction following the colon is the constraint.