(818) 986-1436 (818) 986-1360 (FAX)

Copyright (c) 1993 by Nimble Computer Corporation This material is based upon work supported by the National Science Foundation under Grant No. III-9261682.

The Gabriel

We also tested versions of FRPOLY in which squarings were *not* used for
exponentiation. This non-linear FRPOLY does 38,780 conses, and takes only 59%
of the time of the non-linear squaring FRPOLY. The linear non-squaring FRPOLY
takes only 62% of the time of the linear squaring FRPOLY, and cuts the new
consing to 3988 cells. A slightly slower version cuts the new consing to 2590
cells--only 567 cells (28%) more than are used in the result.

"Impure" Lisp is frowned on in beginning Lisp classes, both because the student should know how much can be done without "side-effects"--e.g., assignment and list modification--and because such code can be difficult to understand [Mason86]. However, professional Lispers sometimes mangle lists in the pursuit of time or space efficiency. We were drawn to the FRPOLY benchmark because it utilizes such list "mangling". We wanted to understand its operation, to see whether its speed and space efficiency were achievable in a "linear" fragment of Lisp.

Before discussing "linear" Lisp, we first note that there are two errors in the
FRPOLY code [Gabriel85,3.20.1] that cause the polynomial multiplication routine
`ptimes` to return incorrect results. The first error occurs in the
`pzerop` macro, which evaluates its argument twice (a linearity checker
would have complained about the multiple use of this macro's argument!). The
`pzerop` macro is called almost always with a trivial argument, but in
the routine `ptimes3`, two lines down from the label `b`, it is
called with an argument expression that has a side-effect--namely an
assignment. There are several ways to fix this problem, the simplest being the
following:

The second error occurs in the routine(defmacro pzerop (x) `(let ((x ,x)) (if (numberp x) (zerop x) nil)))

b (if (or (null (cdr u)) (< (cadr u) e)) ; These 2 lines should (rplacd u (cons e (cons c (cdr u)))) (go e)) ; be replaced.

b (cond ((or (null (cdr u)) (< (cadr u) e)) ; The if should be a (rplacd u (cons e (cons c (cdr u)))) (go e))); cond.

A few remarks about the programming style used in the standard FRPOLY code are
in order. The style of programming used for most of the functions is
traditional recursive Lisp programming, as found in any beginning book on Lisp.
The polynomial *addition* routines do not utilize any side-effects, nor do
they take advantage of traditional Lisp's fluid/dynamically-scoped variables.

The style of programming used within the polynomial *multiplication*
routines, however, is dramatically different--it relies heavily on both
dynamically-scoped variables and side-effects to variables and data structures.
According to Gabriel: "[this benchmark] is programmed in an unusual and
somewhat unpleasant programming style" [Gabriel85,3.20.2], an analysis with
which almost every professional Lisp programmer would agree. The
`ptimes3` routine is a nightmare for either a person or a compiler to
understand.

The `ptimes3` code was written by Bill Martin in about 1968 when the
available list space was severely limited by the PDP-10's 256Kword address
space, and before optimizing Lisp compilers became available [Fateman91].
Bill's goals were the fast execution consistent with minimum space [Fateman91].
Minimizing the maximum number of cons cells in use at any instant (the "high
water mark") seemed more of a concern than minimizing the number of calls to
the `cons` routine. Unfortunately, measuring the exact "high water
mark" of a non-linear Lisp computation is problematical, since it theoretically
requires performing a garbage collection for every single call to the
`cons` routine
[Baker92BB].
Thus, we cannot tell for sure whether Bill's `ptimes3` actually
achieved his goals.

The `identity` function is already linear, but `five` must
dispose of its argument before returning the value `5`:

The(defun identity (x) x) (defun five (x) (kill x) 5) ; a true Linear Lisp would use "x" instead of "(kill x)"

The `square` function requires *two* occurrences of its argument,
and is therefore also non-linear. A second copy can be obtained by use of the
`dup` function, which accepts one argument and returns *two*
values--i.e., two copies of its argument. (See
[Baker92LLL]
for a definition of `dup`). The `square` function
follows:

Conditional expressions such as(defun square (x) (multiple-value-bind (x x-prime) (dup x) ; Dylan bind syntax [Shalit92] would be prettier. (* x x-prime)))

The boolean expression part of an `if`-expression requires more
sophistication. Strict linearity requires that any name used in the boolean
part of an `if`-expression be counted as an occurrence. However, many
predicates are "shallow", in that they examine only a small (i.e., shallow)
portion of their arguments (e.g., `null`, `zerop`), and therefore
a modified policy is required. We have not yet found the best syntax to solve
this problem, but provisionally use several new `if`-like expressions:
`if-atom`, `if-null`, `if-zerop`, etc. These
`if`-like expressions require that the boolean part be a simple name,
which does not count towards the "occur-once" linearity condition. This
modified rule allows for a shallow condition to be tested, and then the name
can be reused within the arms of the conditional.[3]

We require a mechanism to *linearly* extract both components of a Lisp
cons cell, since a use of `(car x)` precludes the use of `(cdr
x)`, and vice versa, due to the requirement for a single occurrence of
`x`. We therefore introduce a "destructuring let" operation
`dlet*`, which takes a series of binding pairs and a body, and binds the
names in the binding pairs before executing the body. Each binding pair
consists of a pattern and an expression; the expression is evaluated to a
value, and the result is matched to the pattern, which consists of list
structure with embedded names. The list structure must match to the value, and
the names are then bound to the portions of the list structure as if the
pattern had been *unified* with the value. Linearity requires that a name
appear only once within a particular pattern. Linearity also requires that
each name bound by a `dlet*` binding pair must occur either within an
expression in a succeeding binding pair, or within the body of the
`dlet*` itself. Using these constructs, we can now program the
`append` and factorial (`fact`) functions:

We "warm up" for FRPOLY by programming a simple univariate "dense" polynomial multiplication routine written in a linear fragment of Lisp, in which the polynomials are simply lists of integer coefficients.(defun append (x y) (if-null x (progn (kill x) y) ; trivial kill. (dlet* (((carx . cdrx) x)) (cons carx (append cdrx y))))) ; cons will be optimized. (defun fact (n) (if-zerop n (progn (kill n) 1) ; trivial kill. (multiple-value-bind (n n-prime) (dup n) ; trivial dup. (* n (fact (1- n-prime))))))

(defun pplus (x y) ; Return x+y. (if-null x (progn (kill x) y) ; trivial kill. (if-null y (progn (kill y) x) ; trivial kill. (dlet* (((x0 . x) x) ((y0 . y) y)) ; decompose 1 cell from both x and y. (cons (+ x0 y0) (pplus x y)))))) ; cons will be optimized. (defun pctimes (x0 y) ; Return x0*y. (if-null y (progn (kill x0) y) ; trivial kill. (multiple-value-bind (x0 x0-prime) (dup x0) ; trivial dup. (dlet* (((y0 . y) y)) ; decompose 1 cell from y. (cons (* x0 y0) (pctimes x0-prime y)))))) ; cons will be optimized. (defun ptimes (x y) ; Return x*y. (if-null x (progn (kill y) x) ; possibly nontrivial kill. (dlet* (((x0 . x) x)) ; decompose 1 cell from x. (if-null x (progn (kill x) (pctimes x0 y)) ; trivial kill. (multiple-value-bind (y y-prime) (dup y) ; nontrivial dup. (pplus (pctimes x0 y) (cons 0 (ptimes x y-prime)))))))) ; cons will be optimized. (defun psquare (x) (multiple-value-bind (x x-prime) (dup x) (ptimes x x-prime))) (defun pexptsq (x n) (if-zerop n (progn (kill x) (kill n) (cons 1 nil)) ; nontrivial kill. (if-evenp n (psquare (pexptsq x (floor n 2))) (multiple-value-bind (x x-prime) (dup x) (ptimes x (psquare (pexptsq x-prime (floor n 2))))))))

We also had to program the basic `dup`, `kill`, `dlet*`
and `cons` routines for our linear fragment of Common Lisp. Since the
basic intuition behind Linear Lisp is that *all arguments to a function are
consumed in the production of results from the function*. This means that
`dlet*` not only *binds* names to portions of a substructure, but
also *recycles cons cells that it has destructured*. We provided a
private freelist for these recycled cons cells, and a modified `cons`
routine utilizes these recycled cells first, prior to calling the system
`cons` routine. The `kill` routine dissolves a structure down to
its atoms, and recycles all of its constituent cons cells. The `dup`
routine copies its argument using the modified `cons` routine, and
returns both the original and the copy as results. By utilizing these four
primitives and the linearity restriction, one is guaranteed that no cons cells
can be lost.[4] This is because linearity is
essentially the same as reference counting [Collins60], which additionally
enforces a no-cycles policy. Of course, our reference counts are always
exactly 1, so that no actual reference count updating is ever needed![5]

For our FRPOLY benchmark, we measured `(pexptsq r n)`, for `n`
from `2` to `15`. Consider now `n=15`. We counted the
number of conses in the input r=x+y+z+1 (15 cons cells), the number of conses
in the output r^15 (2038 cons cells), the number of calls to the *system*
`cons` routine, and the number of recycled cells left on the private
freelist after FRPOLY has finished. The numbers tallied: #output - #input +
#free - #consed = 0, indicating that all cells were accounted for.

In our first linear version of FRPOLY with `n=15`, the system
`cons` routine was called 8375 times and 6352 recycled cells were left
on the private freelist at the end. Although 8375 was only a factor of 4
larger than the 2038 cells in the output structure, we felt that we could do
better. We subsequently located several places in the code where `dup`
was called just before a recursive call which performed a test and terminated
the recursion. In the process of termination, it called `kill` on the
same structure which it had just `dup`'ed. By moving the recursion
termination test up into the previous recursion, the extra `dup` and
`kill` were eliminated, which significantly reduced the calls to the
system `cons` routine. After these changes, FRPOLY called `kill`
on a non-atom only once--for the original 15 input cells. The other 112,617
reused cells were recycled by `dlet*`. FRPOLY called the system
`cons` routine only 4821 times, and left 2798 recycled cells on the
private freelist. There were 3831 calls to `dup`, with a total of
47,692 cells copied. The `dup` per-call statistics were: mean 12.45
cells, standard deviation 35.1 cells and max 1717 cells.

By simply interchanging the arguments to the last call to `ptimes` in
the `pexptsq` routine, the following statistics were observed. The
total number of new conses was still 4821, `kill` was called once for
the 15 input cells, and `dlet*` recycled 114,468 cells. There were now
4448 calls to `dup`, with a total of 48,926 cells copied. The
`dup` per-call statistics were: mean 11.0 cells, standard deviation
20.06 cells, and max 346 cells. The interchanged time was 1.8% slower.

In order to put these cell numbers in perspective, we must examine the workings
of polynomial manipulation a bit closer. In the course of computing r^15,
`pexptsq` computes r^14=r^7*r^7. Although space for one of the copies
of r^7 is reused to compute r^14, the space for the other copy is not recycled
until the end. Therefore, `pexptsq` must require space for at least r,
r^7 and r^14, all at the same time. Although these space requirements still do
not account for the total space used, they indicate that we are getting near
the lower bound.

Our first timings were significantly worse than the standard FRPOLY benchmark. However, we quickly isolated a major cause of this inefficiency--the manipulations of the global variable which held the private freelist. Through the use of machine-specific "subprimitives" of our Lisp system, we were able to reduce the cost of accessing a global variable to a single machine instruction. Since this freelist variable was extensively accessed, we achieved a major improvement through this change. Obviously, a true Linear Lisp implementation would utilize a fast register to point to the top of the freelist, rather than a global memory cell.

Another major improvement in performance was achieved through the replacement
of the `rplaca` and `rplacd` instructions by additional
machine-specific "subprimitives". This change allowed the `cons` from
our private freelist to be slightly faster than the system `cons`.

We exploited a valuable optimization suggested by [Wakeling91]--to have the
destructuring pattern-match "compiler" defer the generation of code to put
cells back onto the global free-list until the end of the body of this form.
This laziness allows any consing seen within the body to utilize these garbage
cells immediately rather than allocating them from the freelist at run-time.
This optimization saved about 20% of the running time. Unfortunately, this
optimization works only within a single function, so larger functions can be
somewhat better optimized. For this reason, we inlined `pcoefadd` and
`psimp`.

Minor improvements in performance were obtained through various optimizations
in the `dup` routine, which may do extensive work, even though it is
seldom called. Additional improvements came from making sure that `dup`
and `kill` were called only for non-atoms, and from other similar
restructurings of the code.

The net result of these improvements is that our latest linear version of FRPOLY runs only 6% slower than the standard benchmark on our machine (an 4 Mbyte Apple Macintosh Plus with a 16MHz Radius 68020 processor card) running Coral Common Lisp version 1.2. We believe that this timing is very competitive with the standard benchmark, since our linear FRPOLY timings include garbage collection, while the standard FRPOLY timings do not include garbage collection.

We also ran tests with an exponentiation routine that did *not* use
squaring--i.e., it used straight-forward successive multiplications [Fateman74]
[Fateman91]. We considered a version in which the smaller argument to
`ptimes` was first ("normal") and the smaller argument to
`ptimes` was second ("reversed"). Both the "normal" and "reversed"
versions of the non-linear FRPOLY did 38,780 new conses, or about 79% of the
conses of the squaring non-linear FRPOLY. The running time on the non-linear
FRPOLY was 59% of the squaring non-linear FRPOLY for both "normal" and
"reversed" versions ([Fateman91] got 50% on a similar test).

Using multiplications in our linear FRPOLY instead of squarings reduced the
space requirements. The total number of new conses dropped to 3988 for
"normal" and 2590 for "reversed". In the "normal" run, there were *no*
calls to `kill` non-atoms, and 72,611 cells were recycled by
`dlet*`, while in the "reversed" run, there were also no calls to
`kill`, and 79,709 cells were recycled by `dlet*`. In the
"normal" run, there were 1358 calls to `dup`, with a total of 25,823
cells copied. The per-call statistics for `dup` in the "normal" run
were: mean 19.02 cells, standard deviation 82.92 cells and max 1717 cells. In
the "reversed" run, there were 3724 calls to `dup`, with a total of
30,555 cells copied. The per-cell statistics for `dup` in the
"reversed" run were: mean 8.2 cells, standard deviation 16.2 cells, and max 283
cells. Although the "normal" required more space (3988 new conses) than the
"reversed" (2590 new conses), it was 10% faster. This "normal" linear run took
only 65% of the time of the squaring non-linear FRPOLY!

The most surprising thing to us about all of these timings was not that the
linear timings were slower than the standard FRPOLY timings, but that the
linear timings were even remotely competitive. After all, the general
presumption is that polynomial manipulations require a lot of sharing, and that
extensive copying should be significantly slower. Furthermore, the linear
FRPOLY cannot just simply *examine* a data structure--it must actually
*destroy* it and then *reconstruct* it. That this examination by
destruction and reconstruction is at all competitive was quite a surprise to
us. On newer machines with a true data cache, such examination by destruction
and reconstruction should be even more competitive than on our cacheless
machine.

Unlike the suggestions of some [Wakeling91], we feel that integers themselves
*should* be linear--especially in a computer algebra system. This is
because the manipulation of large-precision integers (i.e., *bignums*)
constitutes a significant fraction of the work of these systems, and it is
essential that any garbage created be quickly recovered. In fact, we believe
that a bignum package would be one of the easiest and most efficacious uses of
linearity in a symbolic algebra system.

While an efficient linear coding style occasionally copies a large structure all at once, it rarely lets a large structure go all at once. The vast majority of the recycled cells are reclaimed one at a time by the primitive which performs a destructuring binding of new names to portions of a structure. (This behavior may shed some light on the question of the efficiency of "generational" garbage collection, as well as on the best mechanisms to update reference counts in a reference-counting system.) This copy-all-at-once but deallocate-one-at-a-time behavior sheds light on the maximum space requirement for a symbolic computation which does not utilize sharing, since the elimination of all explicit copying can reduce this maximum space requirement. A non-copying coding style is different from a more usual style, in that many functions reconstruct and return their some of their arguments as additional results so that explict copies will not be required.

We feel that the particular list data structure used in FRPOLY may not be optimum for a computer algebra system. A sparse polynomial representation should probably use larger nodes which can hold an exponent, a coefficient, a reference count, and a pointer to the next node. While true reference counting on Lisp's small 2-element nodes is probably too expensive, it might be efficient for 4-element nodes. Destructuring these larger nodes would make good use of a computer's "load multiple" capability, an optimization Lisp implementations have so far not utilized.

A somewhat surprising result was our conclusion that an exponentiation
algorithm using successive squarings is *more* expensive in both time and
space than an exponentiation algorithm which simply does repeated
multiplications. Exponentiation by squaring may eventually be cheaper, but the
break-even exponent is larger than 15.

[Baker92BB]
Baker, H.G. "The Buried Binding and Dead Binding Problems of Lisp 1.5: Sources
of Incomparability in Garbage Collector Measurements". ACM *Lisp Pointers
V*,2 (Apr.-June 1992), 11-19.

[Baker92LLL]
Baker, H.G. "Lively Linear Lisp -- 'Look Ma, No Garbage!'". ACM *Sigplan
Notices* *27*,8 (Aug. 1992), 89-98.

Chirimar, J., *et al*. "Proving Memory Management Invariants for a
Language Based on Linear Logic". *Proc. ACM Conf. Lisp & Funct.
Prog*., San Francisco, CA, June, 1992, also ACM *Lisp Pointers V*,1
(Jan.-Mar. 1992), 139.

Collins, G.E. "A method for overlapping and erasure of lists". *CACM
3*,12 (Dec. 1960), 655-657.

Fateman, R. "On the Computation of Powers of Sparse Polynomials". *Studies
Appl. Math. 53*,2 (1974), 145-155.

Fateman, R. "FRPOLY: A Benchmark Revisited". *Lisp & Symbolic Comput.
4* (1991), 155-164.

Gabriel, R.P. *Performance and Evaluation of Lisp Systems*. MIT Press,
Camb., MA 1985.

Girard, J.-Y. "Linear Logic". *Theoretical Computer Sci. 50*
(1987),1-102.

Mason, I.A. *The Semantics of Destructive Lisp*. CSLI LN 5, Stanford,
CA, 1986.

Shalit, A. *Dylan(TM): An object-oriented dynamic language*. Apple
Computer, Camb., MA, 1992.

Strom, R.E. "Mechanisms for Compile-Time Enforcement of Security". *Proc.
10th ACM POPL*, Jan. 1983.

Wadler, P. "Is there a use for linear logic?". *Proc. ACM PEPM'91*, New
Haven, June 1991, 255-273.

Wakeling, D., and Runciman, C. "Linearity and Laziness". *Proc. Funct.
Progr. & Computer Arch.*, LNCS 523, Springer-Verlag, Aug. 1991, 215-240.

[1] The number of conses (79,140) listed in the "Meter for (Bench 15)" table of [Gabriel85,p.249] may be incorrect. These additional conses may be used in bignum arithmetic in the expansion of (100000*(x+y+z+1))^15.

[2]Any use of parallel or *speculative* execution of the
arms of the conditional would require strict linearity, however.

[3]Although this rule seems a bit messy, it is equivalent to
having the shallow predicate return *two* values: the predicate itself and
the unmodified argument. This policy is completely consistent with linear
semantics.

[4]One cannot utilize non-local exits such as
`return-from`, `throw`, `go`, etc., however, because these
violate the "occur-once" linearity constraint for many of the bindings passed
on the stack. One could conceivably use `unwind-protect` to ensure the
linearity constraints, but it could be quite messy. Perhaps a new kind of
"protect" form is needed for Linear Lisp. (Schemers beware: a linear Scheme
can have only return-once continuations.)

[5]A "behind-the-scenes" reference count scheme can be used
to implement fast `dup`'s and `kill`'s
[Baker92LLL],
but we did not utilize such a scheme in our implementation of FRPOLY.