(818) 986-1436 (818) 986-1360 (FAX)

Copyright (c) 1992 by Nimble Computer Corporation

We show how to speed up the Gabriel 'Triangle' Benchmark by more than two orders of magnitude (206 X faster than the Cray-1) on a Common Lisp system running on the 80860-based OKIstation, by using better programming techniques. The resulting program fits nicely within next generation on-chip caches and kills almost all potential parallelism, thus becoming worthless as a general-purpose Lisp benchmark.

We show how to speed up Triangle through a combination of memoizing [Bird80]
and bit-vector techniques
[Baker90],
to yield a time of 0.07 seconds on a
modern RISC workstation (the 80860-based OKIstation(TM)), which is *206 times
faster than the Cray-1* speed in [Gabriel85]. Our techniques eliminate
redundancy from the computation, which has the effect of eliminating potential
parallelism, as well.

There are some straight-forward improvements in programming style that would be
welcome in a Common Lisp version of Triangle. The most obvious is to use
zero-origin addressing for the elements of the `*board*` vector instead
of one-origin addressing; this change has the effect of renumbering the holes
in the puzzle board to one smaller than the numbers shown in the text
[Gabriel85,p.218]. The second is to realize that the last "move" ("6 5 4") in
the arrays has been duplicated; i.e., there are only 36 distinct possible
moves. A third change in the name of "structured programming" would utilize a
single 36-element vector of "moves", each of which would be a 3-element
sequence.

A truly "structured programming" form of this program would hide the representation of the board, the set of possible moves, and the moves themselves. Such a form would allow for quick changes of representation to find the best one.

Once the "board" is a bit-vector, then references to the board must be made
with `ldb` or `logbitp` instead of `aref`. However, a
little thought will show that the representation of the "moves" is sub-optimal,
since we must shift a 1-bit each time a board position is to be queried. A
better solution is to perform the shifts once during initialization, by
changing `*a*`, `*b*`, and `*c*` into vectors of
bit-*masks*, rather than vectors of bit-*positions*. Further
inspection of the `try` function reveals the fact that `*a*` and
`*b*` are always used together, so a single vector of masks containing
both the "a" and "b" bits will suffice.

With these changes, `try` now looks like this:

(defun try (board i depth) (if (= depth 14) (progn (pushnew (1- (integer-length board)) *final*) (push (cdr (coerce *sequence* 'list)) *answer*) t) (let* ((ab (aref *ab* i)) (c (aref *c* i))) (if (and (= (logand board ab) ab) (zerop (logand board c))) (progn (setf (aref *sequence* depth) i) (do ((j 0 (1+ j)) (nboard (logxor board ab c)) (ndepth (1+ depth))) ((or (= j 36) (try nboard j ndepth)) nil)))))))

However, before we can utilize a table of board positions, we must decide what
information should be stored into the table. We could store an entire set of
moves for each board position, but these sets are quite large themselves, so
merely constructing them will dominate the computation. Since most of these
sets will never be required, such constructions are a waste of effort. The
minimum amount of information required for each board position is a single
bit--whether the current board position can be extended into a "winning"
position. An efficient algorithm can be constructed which is based on this
implementation. We will use a slightly different representation, however. Our
table will store for each board position the list of *first* moves which
lead to winning positions; this list will be empty if there are no winning
moves at all. (Note that the number of first moves which lead to a winning
position can be much smaller than the number of moves possible in the current
board configuration.) This choice will lead to a slightly simpler algorithm
for reconstructing the winning move sequences.

We are thus led to a two-stage algorithm--determine the winning moves starting
from the initial board position, and then reconstruct the winning sequences.
By deferring the construction of the winning sequences until the second stage,
we avoid constructing any but the winning sequences.

In the above program, the same 775 solutions are still generated, but in a time of just 0.07 seconds--0.05 seconds for(defun try (board depth &aux (entry (aref *table* board))) (if (not (empty entry)) entry ;emptiness is an arbitrary non-'list' value. (setf (aref *table* board) (if (= depth 14) (cons nil nil) (mapcan #'(lambda (i) (let* ((ab (aref *ab* i)) (c (aref *c* i))) (when (and (= (logand board ab) ab) (= (logand board c) c) (try (logxor board ab c) (1+ depth))) (list i)))) *moves*))))) (defun construct (board depth movelst &aux (entry (aref *table* board))) (if (= depth 14) (progn (pushnew (1- (integer-length board)) *final*) (push movelst *answer*)) ;answers are reversed. (dolist (i entry nil) (construct (logxor board (aref *ab* i) (aref *c* i)) (1+ depth) `(,i ,@movelst))))) (defun gogogo (i) (let* ((*answer* ()) (*final* ()) (board (logxor (- 32767 16) (aref *ab* i) (aref *c* i)))) (try board 2) (gather board 2 nil) *answer*))

We also ran Triangle without choosing the first move ("22"); this run produces
1550 solutions, half of them mirror images of the other half. However, this
run took only 0.12=0.07+0.05 seconds--less than twice as long as the normal
run. The savings become clear when `*table*` is examined; after this
run, `*table*` has 1651 entries, of which 190 are non-null, with an
average list-length of about 2. The constructed winning lists use 9656 cons
cells in addition to the 1550 cells required for the list of winning lists.
Thus, we can generate all of the solutions to Triangle 120 times faster than
the Cray-1 produces only half of them using the old algorithm.

The success of memoization shows that almost all of the parallel processes in the parallel implementations of Triangle are performing identically the same computation, hence the Triangle benchmark is worthless for estimating parallelism in real (i.e., highly optimized) applications. In a sense, memoization is the most powerful of all parallel programming techniques, because one processor simulates the execution of many processes with a single execution!

[Baker90]
Baker, H.G. "Efficient Implementation of Bit-vector Operations in Common
Lisp". ACM *Lisp Pointers 3*,2-3-4 (April-June 1990), 8-22.

Bird, R.S. "Tabulation Techniques for Recursive Programs". *ACM Comp. Surv.
12*,4 (Dec. 1980),403-417.

Gabriel, R.P. *Performance and Evaluation of Lisp Systems*. MIT Press,
Camb., MA, 1985.

Keller, R.M., and Sleep, M.R. "Applicative Caching". *ACM TOPLAS 8*,1
(Jan. 1986),88-108.

[Steele90]
Steele, Guy L. *Common Lisp, The Language; 2nd Ed*. Digital Press,
Bedford, MA, 1990,1029p.

[1] Modern RISC architecture performance numbers are closing in on the Cray-1, and better Cray-1 numbers are available [Anderson87], but these do not invalidate our conclusions.