%% Where should we discuss where this style of computation is
%% useful and interesting?

%% Theme: connection with object-oriented programming.
%% original techniques, Lucy, Janus.

%% Use for distributed computation.
%% Janus

%% The use of constraint languages for problem-solving.
%% Stefik, MOLGEN  planning

\documentstyle{article}
\input{vijay-macros}
\def\tell{{tell}}
\long\def\MetaNote#1{\relax}
\def\strand{{Strand}}
\def\div{{\tt div}}
\def\nil{{\bf 1}}
\def\rname#1\from#2\infer#3{{{\textstyle #2}\over{\textstyle #3}}{\ \textstyle(#1)}}
\def\axname#1\axiom#2{{\textstyle #2}\ \ {\textstyle(#1)}}
\def\ccfd{{\sf cc(FD)}}
\def\true{{\tt true}}
\def\imp{\hbox{${\tt \ :\!-\ }$}}
\def\rep{!\,}
\def\bs{\backslash}
\def\lcc{{\sf lcc}}
\def\cc{{\sf cc}}
\input{lmacros} 
\def\lcc{{\sf Lcc}}
\def\lj{{\sf Linear Janus}}
\def\janus{{\ccfont Janus}}
\def\bisim{\simeq}
\def\fix{{\bf fix}}
\def\defeq{\stackrel{d}{=}}
\def\rep{!\,}
\newcommand{\linarrow}{\;\mbox{$-\!\circ$}\;}
\newcommand{\linbiarrow}{\;\mbox{$\circ\!\!-\!\!\circ$}\;}
\def\hlcc{{\sf HLcc}}

\newcommand{\implies}{\mbox{ :-- }}
\newcommand{\commit}{\;|\;}
\newcommand{\aum}{${\cal A}'\,{\cal U}\!{\cal M}\ $} 
\newcommand{\aumc}{${\cal A}'\,{\cal U}\!{\cal M}$} 

\begin{document}
\title{A retrospective look at concurrent logic programming}
\author{Vijay Saraswat \\
(saraswat@parc.xerox.com)\\
Xerox Palo Alto Research Center \\
3333 Coyote Hill Road\\
Palo Alto, Ca 94304
\and 
Evan Tick \\
(tick@cs.uoregon.edu)\\
Department of Computer Science\\
University of Oregon\\
Eugene, Or 97403
}
\maketitle

\begin{abstract}
We survey the development of the field of concurrent logic
programming from the modernist stance of constraint programming. 
We discuss the basic conceptions of the field, the language
designs it has engendered, areas of applicability of these ideas,
the underlying semantic notions, and the implementation issues
that arise.  The paper assumes the reader is familiar with the
basic ideas of (constraint) logic programming.
\end{abstract}
\setcounter{tocdepth}{2}
\newpage\tableofcontents
\newpage\section{Introduction}

Concurrent logic programming arises in a very natural way from
considerations of parallel execution of (backward-chaining) logic
programs. Unbridled parallel execution very quickly leads to
wasteful nondeterministic computation. Goals that are wildly
(infinitely) non-deterministic (e.g., {$\tt S > 0$)} with {\tt S}
unknown) may be scheduled instead of others that may instantly
prune search (e.g. {\tt S = 1}).  It is not hard to see that
substantial gains in performance can be obtained if some
operational mechanism were to be provided to {\em suspend} the
execution of certain goals (e.g., $\tt S > 0$) until certain
bindings were obtained --- the better to quickly fail some
(possibly infinitely many) non-deterministic branches of that
goal. Concurrent logic programming languages arose from a desire
to explicitly introduce operations to provide this suspension,
and to indeterministically select one of one of many competing
clauses (commit, i.e., generalized ``cut'', operations).

From the time in the early 1980's when these ideas were initially
recognized by Keith Clark and Steve Gregory \cite{rel-lang} to
1987, the group of researchers developing these ideas were
primarily concerned with identifying various plausible
operational mechanisms for specifying these suspension
conditions. The nature of work in this area at that time
consisted of examining in detail the most general unification
(mgu) algorithm as it was used in executing logic programs, and
experimenting with ``control annotations'' to harness the
``eager'' execution it engendered. While some derided this
activity as ``hackers doing logic'', others were fascinated by
the obviously attractive paradigm for concurrent computation that
was thus opened up (particularly by the work of
\cite{CP}).\footnote{See the recent personal accounts by some of
the original participants in the field reported in
\cite{fgcs-cacm}.}

The introduction of constraints into logic programming in 1987
\cite{lassez-const} provided the first opportunity to establish a
more substantial basis for concurrent logic programming.
%  something more than a serendipitous hack. 
Michael Maher introduced in \cite{alps} a suspension condition in
the language ALPS that was based on checking that the constraints
in a clause were {\em entailed} (``validated'') by the
constraints that had been accumulated in the refutation.  This
opened the door to a {\em logical} treatment of the suspension
condition in stark contrast to the operational hacks that had
hitherto occupied the field.  As sometimes happens, by stepping
back to a more general framework (arbitrary constraint solving),
the details of the particular situation (unification ---
constraint solving over finite trees) fell away and the basic
underlying idea emerged in its essential simplicity and
generality.

This basic idea, simply put, is {\em constraint-based
computation}.  What concurrent logic programming provided the
first clear manifestation of was the idea that computation could
be conceptualized as occurring through the interaction of
simultaneously executing agents posting and checking constraints
(on a set of variables) in a shared store \cite{vj-thesis-book}.

But then this style of communication and computation can be
isolated from the context in which it was first discovered, and a
framework for programming languages built around it from ``first
principles'', as in the work on ``process algebras'' based on
similar lines of development due to Hoare, Milner and their
colleagues (\cite{csp-book,ccs} --- see \cite{Milner-handbook}
for an overview).  This was the origin of the conception of
concurrent constraint programming. The move enables the idea to
be treated in its own terms and allows all the standard apparatus
of modern day programming languages (recursion, block-structure,
scoping, modules, types etc.) to be used for developing the
paradigm, without being necessarily encumbered by a somewhat
ill-suiting parent paradigm.  In retrospect, though, it is rather
surprising that many of the natural operations one would use to
obtain a programming language organized around the ideas of
constraint-based communication are already available in logic
programming.

In this paper we shall review the field of concurrent logic
programming from this modernist stance of constraint programming:
constraint logic programming is essentially concurrent constraint
programming instantiated over the \Herbrand{} constraint system
over finite trees that underlies deduction in first-order logic
(see Section~\ref{herbrand}). 
We shall emphasize the connection with the work in theoretical
computer science on process algebras, rather than the view of
concurrent logic programming as a deviant version of
(backward-chaining, Horn clause) logic programming. (The latter
view is adopted, for example, in another survey of this area,
\cite{shapiro-survey}.) At the same time, we shall develop and
emphasize (see below) a major attraction of the CCP framework ---
a sound and {\em complete} view of computation as deduction, a
view that unfortunately is not possible if CCP is viewed as a
special case of logic programming. In particular, we shall
emphasize the three-way connection between the logical
interpretation, operational semantics and denotational semantics
of these languages. The work on algebraic axiomatizations of
various congruences of interest is not sufficiently mature for us
to do a proper survey; we shall, however point out results
wherever there are some.

A theme that runs through this paper is the separation of the
{\em determinate} CCP languages from the {\em in}determinate ---
the determinate languages already offer the power of
constraint-based communications and expose the basic new
conceptual and semantic ideas in this approach. In some sense,
the extension of these semantic ideas to the indeterminate
languages, while technically complicated, is not particularly
specific to constraint programming. Essentially the same ideas
used here have been shown to be applicable to other, different
models of computation \cite{kok-and-others}.

We also discuss in some detail the implementation issues that
arise in realizing this rather high-level computation model.
Over the last decade considerable practical experience has been
gained in developing implementations of concurrent logic
programming languages in uni-processors, shared-memory
multiprocessors and distributed networks. The basic concepts are
reviewed and the major results discussed.

A brief word about some themes and issues that are consciously
not being developed in this survey. First, we have not adopted
the point of view that concurrent logic programming languages
are reasonable ``machine'' or ``kernel'' languages for advanced
parallel and distributed computers. This is the view that
underlay work at ICOT and at the Weizmann Institute in this area
in the past decade, but its validity is still open to
considerable question. For a more sympathetic viewpoint, see
\cite{shapiro-survey} and \cite{fgcs-cacm}. 

We have also chosen not to treat here the
search-non-deterministic concurrent constraint programming
languages, such as \herbrand{} \cite{vj-thesis-book}, and
languages based on the Andorra principle \cite{dhd-andorra}
(e.g., AKL \cite{sverker-AKL}).  There are three primary reasons.
Space. Time. Substance -- the state of our understanding of the
possibilities, languages, and implementation tradeoffs for these
languages still seems immature.

\paragraph{Logical account of CCP languages.} We now describe
briefly the basic intuitions behind the logical account of
concurrent logic programming underlying this paper.  The
interpretation inherited from the view of these languages as
variants of backward-chaining Horn clauses is not particularly
satisfying, primarily because there is no good characterization
of deadlocked computations, and of indeterminate computations. We
shall review an approach that has recently been developed that
provides concurrent constraint programs with an interpretation
that is ``dual'' that of backward-chaining Horn clause logic
programs. Programs are viewed as providing {\em necessary} rather
than {\em sufficient} conditions for a predicate to hold:
defintions of relations (such as {\tt append}) specify what must
necessarily follow {\em if} the relation is known to hold between
some objects, about which some additional pieces of information
may be known. For example, we may state that if {\tt X}, {\tt Y}
and {\tt Z} are in the {\tt append} relation, and {\tt X=[]},
then it must be the case that {\tt Y=Z}.  Inference progresses
forward: from the knowledge that certain pieces of information
are true of the objects of interest, some other pieces of
information are deduced. Thus the operational interpretation of
predicates is that of information transducers: given certain
pieces of information, they produce (potentially) more pieces of
information.  It turns out that this characterization is sound
and {\em complete} for determinate concurrent constraint
programming computations in a precise and important sense.

Obtaining a logical characterization of indeterminate
computations, is however, a more tricky matter. Only a partial
solution --- that of the characterization of ``may'' properties
of indeterminate computations --- seems possible. Even for that,
we have to make the move to a more powerful notion of logic, {\em
linear logic}, recently introdued by \cite{girard-ll}.  The
underlying issue is this: if the programming language is
indeterminate, it must be possible to obtain two different
answers, say $b$ and $c$ from the same initial computational
state $a$. But then it must also be possible to obtain the answer
$b$ {\em and} $c$ from $a$, and this may not be acceptable.  For
example, suppose that the initial computational state describes a
situation in which Alice's account contains \$100, and she has
given a cheque to Bob for \$100, and a cheque to Charlie for
\$100 (and she has no other cheques outstanding).  There are
basically two evolutions of such a system: one in which Bob's
cheque is cashed and the other in which Charlie's cheque is
cashed.  However, if this indeterminate computation is
expressible in logic, then there should be yet a third
computation in which the {\em conjunction} of the two results is
possible: Bob's cheque is cashed, and Charlie's cheque is cashed.
This situation is clearly incoherent.\footnote{The first author
is indebted to Carl Hewitt for making this argument.}

The root of the problem lies in that (classical) logic is not
able to distinguish the case in which a situation $A$ can give
rise to a conclusion $B$ and it can give rise to a conclusion $C$
from the case in which $A$ can give rise to ($B$ and $C$). For,
from an operational perspective, the problem arises because in
logic one is freely allowed to {\em duplicate} formulas: $A
\wedge A$ is semantically identical to $A$. Therefore the
system of atoms 
\begin{ccprogram}
debit(Account, alice(100, AckA), debit(Account, bob(100, AckB),
account(Account, 100)
\end{ccprogram}
\noindent is semantically indistinguishable from 
\begin{ccprogram}
debit(Account, alice(100, AckA), debit(Account, bob(100, AckB),
account(Account, 100), account(Account, 100)
 \end{ccprogram}
The latter system can, however, give rise to a computation in
which {\em both} cheques are cashed. 

It is not suprising that indeterminism and concurrency thus bring
to the surface the fundamental notions of {\em object identity}
and duplication. It is precisely the commitment of classical
logic to deal with ``static'' notions like truth and falsehood
that let us down. What is required here is a system of logic
faithful to a dynamic notion of non-duplicable {\em resources}, a
logic that can distinguish between the two situations above. This
kind of flexibility has recently been placed on the table by the
advent of the so-called sub-structural logics, the most prominent
of which is linear logic.  Linear logic may be seen as arising
from classical logic by dropping certain structural rules that
allow formulas to be arbitrarily duplicated and eliminated during
the course of a derivation. This causes the familiar conjunction
and disjunction operations to split into two: the so-called {\em
additive} versions that copy formulas, and the {\em
multiplicative} versions which do not.  This gives formulas the
nature of {\em resources} that must be accounted for carefully
during proof (computation). In particular, multiplicative
conjunction ($\otimes$) is {\em not} idempotent: $A \otimes A$ is
not logically equivalent to $A$.  The lost power of the
structural rules can be recovered locally by means of the
``modalities'' $!$ and $?$. In particular, the formula $!A$ is
allowed to be duplicated any number of times, derelicted to
obtain $A$, and dropped altogether.\footnote{See
\cite{scedrov-ll,PDLsigact} for more motivation and tutorial
introductions to linear logic.}

Below we shall develop in some more detail the characterization
of indeterminate CCP in linear logic.  For the time being, we
note that linear logic allows one to distinguish the case in
which from $A$ we can get {\em both} $B$ and $C$ (this is
represented as $A \linarrow B \otimes C$) from the situation in
which from $A$ we can get {\em either} $B$ or $C$ (this is
represented as $A \linarrow B \with C$, where $\with$ is the
operator for additive conjunction). (And this distinction arises
precisely because the identification $A \otimes A = A$ is no
longer valid.)

\paragraph{Rest of this paper.} We first review the basic ideas
behind the CCP paradigm. The determinate CCP languages are then
presented --- their syntax and operational semantics, the logical
readings of programs, denotational semantics, and examples  of
computations in this style. A similar treatment is then carried
out for the indeterminate CCP languages. For historical purposes,
we then review the special features of  some of the major
concurrent logic programming languages that have been developed
in the last decade. This also offers an opportunity to discuss
some related sub-themes, such as atomic vs. eventual publicatio
of constraints, failure-free computations, etc. We then survey
the work on implementations of these languages, focusing on
different architectural issues  that arise. Finally we point to
areas of active current work and provide an assessment of the
future of the field. 

\newpage\section{The Basic Paradigm} 

The fundamental move underlying concurrent constraint programming
is to replace the notion of {\em store-as-valuation} behind
imperative programming languages with the notion of {\em
store-as-constraint}.  A constraint may conceptually be taken to
be a (possibly infinite) subset of the space of all possible
valuations in the variables of interest, e.g. {\tt X = f(a,Y)}
denotes the set of all pairs $\tuple{x,y}$ of finite trees such
that the first is a tree with root labelled {\tt f}, and with two
sons, the first of which is a tree with root labelled {\tt a} and
with no sons, and the second is the tree $y$.  Thus the store may
describe only partial pieces of information about the possible
values that variables can take. The situation in which the actual
denotation of a variable (say, $\tt X=a$) is permissible but the
exception rather than the rule.

This paradigm shift renders the usual notions of of (imperative)
``write'' and ``read'' incoherent.  There may be no single,
finitely-describable value left to return as the result of a
``read'' operation on a variable.  A ``write'' operation is only
capable of taking a fully-formed concrete value for a variable,
and updating the store to have this variable take on this new
value. But the new store no longer has any connection with the
old store --- constraints satisfied by the old store may not be
preserved at all by the new store. The underlying principle of
{\em information preservation} and {\em accumulation} we are
trying to build on is irretrievably compromised. 

Instead, \cite{my-thesis-book} proposes the replacement of read
with the notion of {\em ask} operations and write with the notion
of {\em tell} operations.  An ask operation takes a constraint
(say, $c$) and uses it to probe the structure of the store.  It
succeeds if the store contains enough information to entail $c$.
Tell takes a constraint and conjoins it to the constraints
already in place in the store.  That is, the set of valuations
describing the resultant store is the intersection of the set of
valuations describing the original store and those describing the
additional constraint.  Thus, as computation progresses, more and
more information is accumulated in the store---a basic step does
not {\em change} the value of a variable but rules out certain
values that were possible before. Thus, the store is {\em
monotonically refined} --- all the constraints that were true of
the store before the operation (that is, were entailed by the
store), continue to be true of the store {\em after} the
operation.

Central to the notions described above is the idea of constraint
systems as systems of partial information. Indeed, it turns out
to be natural to adopt Scott's theory of information systems 
~\cite{scott-info-sys} for this purpose. For the time being,
suffice to say that all the theory demands is a specification
of the pieces of information (in finitely many variables)
admissible in the system, and a specification of the entailment
relation that holds between them. 

The possibilities for concurrency are now quite apparent.
Instead of a single agent interacting with the store via ask and
tell operations, any number of agents can simultaneously interact
with the store in such a fashion.  Synchronization is achieved
via a blocking ask---an agent blocks if the store is not strong
enough to entail the constraint it wishes to check; it remains
blocked until such time as (if ever) some other concurrently
executing agents add enough information to the store for it to be
strong enough to entail the query.  Note, in particular, that
even though the paradigm is based on the notion of a shared
store, ideas such as ``read/write locks'' do not have to be
introduced for synchronization.  The basic reason is that only a
benign form of ``change''---accumulation of information---is
allowed in the system.\footnote{Which of course, is not to say
that systems with changable state are not describable in the \cc\
paradigm.  State change can be represented without compromising
the basic paradigm by adapting standard techniques from logic and
functional programming.  Namely, ``assignable variables'' are
embedded in the local state of a recursive agent---the agent
``changes'' the value of the variable by merely recurring with a
new value for one of its arguments.  In some languages in the
\cc\ framework (e.g., \janus \cite{janus}) there is considerable
hope that such mechanisms for representing state change can be
competitive in performance with traditional assignment-based
techniques.} If desired, indeterminacy can be introduced by
allowing an agent to block on multiple distinct constraints
simultaneously, and specify distinct actions which must be
invoked if the corresponding ask condition is satisfied by the
store.

Thus, in this view, an agent (``computing station'') is thought
of as an {\em information transducer}.  An agent can either add a
constraint to the store (tell), suspend until there is enough
information in the store to entail a given constraint, or
decompose into a number of other agents running in parallel,
possibly with hidden interconnections, and communicating and
synchronizing via the store.  Of course, as is standard,
computations of unbounded length can be achieved by using
recursion.

\paragraph{Compuational significance of the paradigm.}
While extremely simple, these ideas yield a powerful programming
paradigm which derives its strength from the versatility of its
communication mechanism. Essentially the entire range of ``data
structures'' in computer science and AI opens up to examination.
One can ask of graphs, binary trees, dictionaries, hash tables,
association lists, finite sets, temporal intervals, real-valued
temporal signals, etc.: what are coherent pieces of partial
information about these objects and
interrelationshps?\footnote{As indeed these questions have been
asked over the centuries for such ``classical'' data-structures
as continuous real-valued functions.} Indeed, a tremendous amount
of work in computer science and AI is concerned precisely with
elaborating such systems of partial information about objects of
interest, and investigating their cost vs. expressiveness
tradeoffs. Here, we shall just attempt to convey the basic idea,
and refer the reader for more details to documents such as
\cite{vj-thesis-book} and the review of constraint logic
programming in this volume. 

The essential idea is that variables serve as communication
channels between multiple concurrently executing agents.  The
computational framework does not insist that there be an {\em a
priori} partitioning of the agents into ``producers'' and
``consumers'' of information on the variables (as happens, for
example, in function application, where information flows from an
argument to the function body, or in CSP/CCS-style or actor-style
languages).  In fact, the {\em same} agent may simultaneously add
or check constraints on the same variables.

Also, the computational framework allows for the underlying
``network transport protocol'' to perform (potentially
arbitrarily sophisticated) inferences on the ``messages''
entrusted to them---these, of course, are the deductions
sanctioned by the entailment relation of the constraint system.
This allows each agent to state the constraints as it sees them
and frees up the programmer from having to put in agents for
explicitly collecting these messages and drawing the appropriate
inferences (``reformatting the data'').  This allows for very
``high-level'' languages in which essentially computation and
communication can be expressed merely by posting and checking
constraints.

\paragraph{The \Herbrand{} constraint system.}\label{herbrand}
For concreteness, we consider the \Herbrand{} constraint system
that underlies logic programming languages. Assume given some
underlying domain of denotations, a set $I$ of individuals. The
set of interest to us is the set of labelled, finitely branching,
finitely deep trees over $I$, ${\cal T}(I)$. (Individuals are
identified with the tree with $0$ sons labelled with the name of
the individual.)  We shall be
interested in two syntactic categories, Terms and Constraints,
specified by the  BNF grammar:
\begin{equation}
\begin{array}{llll}
\mbox{(Terms)} & s,t & ::= & X \alt a \alt f(t_1, \ldots, t_n) \\
\mbox{(Constraints)} & c & ::= & (s = t) \alt \exists X. c
\end{array}
\end{equation}
Here $X$ is a meta-variable ranging over some underlying domain
of variables, $a$ ranges over names for the individuals in $I$, 
and $f$ over the permissible labels. A constraint $s=t$ is
said to be satisfiable  if there is an assignment of values to
the variables in $s$ and $t$ such that the two terms denote the
same element of ${\cal T}(I)$. Existential quantification is
interpreted in the usual way. A collection of constraints $c_1,
\ldots, c_n$ entails $c$ if every valuation that satisfies each
of the $c_i$ also satisfies $c$. It is well known that the
satisfiability and entailment algorithms for this constraint
system can be solved essentially by the first-order most general
unification algorithms. (See, e.g. \cite{lassez-unify} for a
review of the basic ideas.)  The fundamental properties of this
constraint system are:
\begin{itemize}
\item two terms $f(t_1,\ldots, t_n)$ and
$g(s_1,\ldots, s_k)$ denote the same tree iff $f=g$, $n=k$, and
recursively, $t_i$ denotes the same tree as $s_i$, for $i=1,
\ldots,n$,
\item a constraint $X=t$ is satisfiable iff $t$ does not
contain $X$
\end{itemize}
As a result, consistency and entailment for this system can be
checked in time proportional to the size of the formulas
involved. In reality, this theoretical result is not of much use
--- rather what one is interested in is an assurance that the
commonly occurring unification problems are solvable in {\em
constant} time. As is the sad customary practice in logic
programming implementations, this is guaranteed by ignoring the
``occurrence check''. In the spirit of constraint programming,
this can be justified by claiming that the implementations are
faithful to a constraint solver that works over the domain of
rational rather than finite trees\footnote{Rational trees are
allowed to contain loops --- the requirement is that they must
have a finite number of {\em distinct} sub-trees.}. For the
purposes of this survey, however, we shall ignore the
complications that this compromise brings. 


Even a constraint system as mundane as \Herbrand{} already offers
powerful possibilities for communication, leading to interesting
idioms such as incomplete messages, short-circuits,
difference-lists, ``messages to the future'' etc.
\cite{my-thesis-book}---the techniques that have colloquially
been referred to as stemming from ``the power of the logical
variable''.  (These idioms shall be illustrated later in this
paper.)  For instance, it is possible to embed a communication
channel (variable) in a message as a ``first-class'' object, in
exactly the same way as data. A process $p$ may post the
constraint $\tt X = f(a, Producer)$. Simultaneously and
separately, another process $q$ may post the constraint $\tt X =
f(Item, Consumer)$.  And now $q$ can discover that in fact the
store implies {\tt Item = a} (thus, the value {\tt a} has been
communicated from $p$ to $q$ --- and in fact, to any other
process that has access to (contains) the variable $X$). Further
communication can occur on the channel accessed by $p$ as {\tt
Producer} and by $q$ as {\tt Consumer} since indeed these have
been identified (as a result of the two constraints on $\tt X$).

This kind of communication of a channel in a message is the heart
of the power of concurrent constraint programming. It allows for
an elegant and simple form of dynamic reconfigurability of
processes. This is difficult to achieve in a simple way within
frameworks such as those of CCS and CSP.

\newpage\section{Determinate concurrent constraint programming}
Kahn's theory of determinate data-flow \cite{kahn} remains the
dominant theory of determinate concurrent computation.  While
simple, elegant and widely applicable, the theory's emphasis on
computing with directional point-to-point communication channels
imposes some expressibility restrictions: channels carry only
fully formed values from the underlying data domain; networks are
not dynamically reconfigurable in that connections cannot
themselves be passed as data on communication channels; and each
channel has only one reader and one writer.  Because of these
restrictions, Kahn's theory cannot model such useful programming
idioms as incomplete messages, difference lists, recursive
doubling, multiple readers/writers of a communication channel,
dynamic reconfiguration, etc.  Although these techniques were
originally developed in the context of indeterminate concurrent
logic programming, they are all both inherently determinate and
usefully expressible in determinate contexts.
s
\subsection{Syntax and operational semantics}
What should a programming language for the paradigm of
store-as-constraints look like? We shall develop here a simple
concrete syntax influenced by the syntax of linear logic. For
the determinate languages, we could use a syntax from classical
logic, since it will turn out that the execution mechanism
(operational semantics) for determinate \cc{} programs is
complete for classical derivability (and hence also linear
derivability). This, however, will not be the case when we
move on to indeterminate languages: only completeness with
respect to linear derivability will be obtained. So we prefer 
to use linear syntax up front. This will involve using $\otimes$
for parallel composition instead of $\wedge$, $\linarrow$ instead
of $\rightarrow$, and explicitly using $!$ to indicate
reusability of formulas. 

First, there has to be a mechanism for posting constraints to the
store.   Constraints are to be regarded as formulas that, once
added to the store, are there forever:
\begin{equation}
\begin{array}{lllll}
(Agents) & A & ::= & !c & \mbox{-- Tell}
\end{array}
\end{equation}

Next, there has to be the possibility for checking that certain
pieces of information are present in the store, and, if so,
reducing to some other agent:
\begin{equation}
\begin{array}{lllll}
& A & ::= & !c \linarrow A & \mbox{-- Ask Method}
\end{array}
\end{equation}
In the following we will sometimes use the concrete syntax $c
\then A$ instead of $!c \linarrow A$ to emphasise that this usage
of $\linarrow$ is consistent with its interpretation as classical
implication. 

There should be a mechanism to execute more than one such agent in
parallel:
\begin{equation}
\begin{array}{lllll}
 & A & ::= & A \otimes A & \mbox{-- Parallel Composition}
\end{array}
\end{equation}
In concrete syntax we shall write $A \otimes B$ as $A,B$. 

It should be possible to introduce new communication channels
private to a scope. 
\begin{equation}
\begin{array}{lllll}
& A & ::= & \exists X. A & \mbox{-- Local Channels}
\end{array}
\end{equation}

Finally, there should be the possibility of {\em replication}, of
invoking an Agent an unbounded number of time.  For this, we
allow atoms with user-defined predicate symbols (called
user-defined atoms; let their set be ranged over by the
meta-variable $H$), and replicated definitions:
\begin{equation}
\begin{array}{lllll}
 & A & ::= &  H   & \mbox{-- Linear Message}\\
&&& \alt ! \forall \bar{X}. H \linarrow A & \mbox{-- Linear Method}
\end{array}
\end{equation}
We shall impose the syntactic restriction designed to avoid
indeterminacy in process invocation. 
\begin{restriction}\label{defin-convention}
We shall require (1) For any
given atom $G$, and linear method $! \forall \bar{X}. H \linarrow
A$, there should be at most one substitution $\theta$ such that
$G=\theta(H)$. (This is true if $G$ and $H$ are \Herbrand{}
terms, but may not be true if they contain interpreted terms,
e.g. $+,\cup$ introduced by an underlying constraint system.)  (2) If
an agent contains more than one linear method for a predicate
letter, then the heads of these two methods shall not be
unifiable, once existential variables are replaced with
distinct constants.
\end{restriction}

The above syntax is very flexible. It does not distinguish
between {\em programs } and {\em agents}, like logic programming
languages have traditionally done. This allows methods to be
defined local to the scope of existentially quantified variables
(e.q.v.'s) and is a power that we shall exploit subsequently.
Furthermore, methods are allowed to have free variables that are
quantified outside the method.

\begin{note}
  Indeed, it is possible to show that with such dynamic
  specification of methods, it is not necessary to have any
  user-defined predicate letters except for one. Essentially,
  e.q.v. can play the role of constant names for new procedures.
  However, we shall not take that extra step here in order to
  remain close to the traditional syntactic presentation of
  concurrent logic programming languages.
\end{note}

\begin{note}
  Henceforth, when presenting programs in concrete syntax, we
  shall adopt the following conventions. We shall assume that
  whether an atom is a constraint atom or a user-definable atom
  is obvious by inspection. Constraints are assumed to be
  ``!''ed, hence they need not be explicitly ``!''ed (both in the
  left and right hand sides of implications).  Also, we adopt the
  lexical convention that universally quantified variables
  (u.q.v) begin with a capital letter, and existentially
  quantified variables (e.q.v) with an underscore. A u.q.v.  is
  considered scoped at the outermost arrow in whose left
  hand-side it occurs, and an e.q.v.  is considered scoped at the
  largest conjunct containing all occurrences of that variable.
  This will allow us to almost always dispense with explicit
  quantifiers in programs.
\end{note}

The translation from the conventional syntax for concurrent logic
programs (e.g., FGHC) to this syntax is straightforward, as the
following example shows.
\begin{example}
The {\tt append/3} FGHC program:
\begin{ccprogram}
\agent append([],Y, Z) :- true | Y=Z..
\agent append([A|B], Y, Z) :- true | Z=[A|C], append(B, Y, C)..
\end{ccprogram}
\noindent is translated to
\begin{ccprogram}
\agent append(X,Y,Z) -o
        X = [] -> Y=Z,
        X=[A|B] -> (Z=[A| \_C], append(B, Y,\_C))..
\end{ccprogram}
\end{example}


\paragraph{Operational semantics.}
To present the operational semantics, it is most convenient to
take a configuration to be merely a multiset of agents. (Multiset
union is to be treated as conjunction.)\footnote{The approach we
use here is a departure from the traditional SOS-style of giving
operational semantics rule. However, it has the advantage of
being extremely simple and drawing on the presentation of
inference rules in a Gentzen-style sequent presentation of the
underlying logic.} Intuitively, a configuration represents the
parallel conjunction of agents that are currently present in the
system. The collection of constraints in a configuration
contitute the store. Below, we shall let $\Gamma$ range over such
a multiset, and $C$ over a multiset of constraints. We shall
assume given a relation $\vdash$ between finite multisets of
constraints and a constraint, specifying the semantics of the
underlying constraint system, in this case \Herbrand. 

There are four transition rules, corresponding to the four
composite combinators in the language. An agent $(A,B)$ 
decomposes instantly into its constituents:
\begin{equation}\label{tensor-trans}
(\Gamma, (A_1,A_2))\derives (\Gamma, A_1, A_2)
\end{equation}

We shall identify formulas that are the same upto renaming of
bound variables. To execute an existentially quantified agent,
then, we simply find a renamed version in which the bound
variable is not free in the configuration:
\begin{equation}\label{exists-trans}
\begin{array}{ll}
(\Gamma, \exists X. A)\derives{(\Gamma, A)} &
\mbox{($X$ not free in $\Gamma$)}
\end{array}
\end{equation}

Computation occurs when a message meets a method. There are two
kinds of reductions that can happen, depending on whether the
formulas involved are reusable or not.
\begin{equation}\label{cons-trans}
\begin{array}{ll}
(\Gamma, C, !d \linarrow A)\derives(\Gamma, C, A) & \mbox{(if $C\vdash
!d$)}
\end{array}
\end{equation}

A message can be consumed, if there is a method that matches,
given the constraints in the store:
\begin{equation}\label{proc-trans}
\from{\begin{array}{ll}
C, G \vdash H[\bar{t}/\{\bar{X}] & 
\mbox{$M=(\forall \bar{X}. H \linarrow A)$}
\end{array}
}
\infer{(\Gamma, C, !M, G)\derives(\Gamma, !M, A[\bar{t}/\bar{X}])} 
\end{equation}

As usual, a configuration $\Gamma$ is said to be {\em stuck}  if
there is no configuration $\Gamma'$ such that $\Gamma \derives
\Gamma'$. A stuck configuration is said to have {\em terminated} if 
it consists of only reusable agents (methods and constraints).
The result of a terminated computation is its store,
existentially quantified with all variables other than those
present in the initial configuration. 

The fundamental property of the operational semantics is Church
Rosser:

\begin{theorem}[$\derives$ is Church Rosser.] Let $\Gamma,
  \Gamma_1, \Gamma_2$ be configurations such that $\Gamma
  \derives \Gamma_i$, $i=1,2$.  Then there are configurations
  $\Gamma_1', \Gamma_2'$ such that $\Gamma_i \derives \Gamma_i'$
  and $\Gamma_1'$ is identical to $\Gamma_2'$ upto renaming of
  e.q.v.'s.
\end{theorem}

The basic reason, as always, is that no two reduction steps can
be in conflict. If a step can be taken at any given stage of the
computation, then the same step can be taken at any subsequent
stage of the computation. Clearly, this is the case for steps
involving Rules~(\ref{tensor-trans}) and (\ref{exists-trans})
(upto e.q.v. renaming). A reduction by Rule~(\ref{cons-trans})
once enabled is always enabled, since constraints added to the
configuration are never deleted. There can be only one reduction
involving a user-defined atom, by
Convention~\ref{defin-convention}, and that reduction once
enabled will stay enabled. Hence any two basic reduction steps
will commute.

The theorem immediately gives us that if an agent has a
terminating reduction sequence then all its reduction sequences
terminate, and in final stores that are equivalent when projected
onto the variables in the initial configuration. Thus,
determinate \cc{} programs are, indeed, determinate. 
 
\subsection{Logical Semantics}\label{det-log}

The fragment of logic specified above is in some sense ``dual''
to the usual definite clause logic underlying logic programming:
a clause is of the form $H \then A$, {\em not} $H \leftarrow A$;
one describes {\em necessary} rather than sufficient conditions.
Intuitively, one thinks of specifying for each predicate some
pieces of information that must follow (for some of the arguments
to the predicate) when some other pieces of information are know
to be true about some arguments to the predicate. Each predicate
is thus thought of as a transducer of (partial) information.  One
thinks of the underlying engine as making explicit pieces of
information that are implicit in the original presentation of an
agent --- as explicit as is needed to answer a (primitive) test.

\begin{lemma} Let $A$ be a linear formula, with free variables 
  contained in the free variables of $\Gamma$. If $\Gamma
  \derives \Gamma'$, then $\Gamma \vdash A$ whenever $\Gamma'
  \vdash A$. 
\end{lemma}
This implies the {\em soundness} of CCP computations for this
interpretation in logic:
\begin{theorem}[Soundness of logical interpration] Let $d$ be
  a constraint with its free variables contained in those of
  $\Gamma$.  If $\Gamma \derives^{\star} (\Gamma',C)$, and $C
  \vdash d$, then $\Gamma \vdash d \otimes \top$.
\end{theorem}

In the following $\top$ is a linear logic constant with the
following proof rule in a Gentzen style presentation of the
logic:
$$ \Gamma \vdash \top$$
In other words, any set of formulas can be used to establish
$\top$. 

\begin{theorem}[Completeness of logical interpretation] 
  If $\Gamma \vdash d \otimes \top$, (with free variables of
  $d$ contained in those of $\Gamma$, then there is a
  configuration $\Gamma'$ such that $\Gamma \derives \Gamma'$ and
  $\Gamma' \vdash d \otimes \top$.
\end{theorem}
The basic idea behind the proof is to examine the proof of
$\Gamma \vdash d \otimes \top$ and from that proof construct an
execution sequence that satisfies the given property.

The soundness and completeness results cited above could have
been cast in intuitionistic and classical logic as well as linear
logic. No essential features of linear logic are being used in
the determinate CCP set-up. However, we are setting the stage for
similar results for the indeterminate CCP languages --- there we
can obtain completeness only with respect to linear derivability.

\subsection{Denotational Semantics}
Above, we have seen that (terminating) determinate computations
give rise to a unique store of constraints. One could then ask
about the nature of the mappings from constraints to constraints
established by the execution of determinate \cc{} programs.
\cite{popl91} established that the right mathematical objects
corresponding to determinate \cc{} programs are {\em closure
operators}.  Closure operators are functions $f$ over a partial
order $L$ that are extensive $f(x) \geq x$, monotone ($f(x) \geq
f(y)$ whenever $x \geq y$) and idempotent ($f(f(x))=x$).
Idempotence says that the range of $f$ is equal to the set of its
fixed-points; extensiveness says that each point in the domain is
mapped to a fixed-point above it, and monotonicity ensures that
this must be the {\em least} such fixed point.  Hence, a closure
operator can be uniquely represented by the set of its
fixed-points (its range).

Closure operators capture precisely the idea of monotonic
accumulation of information inherent in the \cc{} conception.
On the one hand this discovery showed us that the mathematical
objects underlying (at least the determinate) \cc{} languages
were very simple indeed, and had many interesting properties.
(For example, the set of resting points of the process $f \wedge
g$ is the intersection of the resting points of $f$ and $g$ ---
from this it is immediate that parallel composition is
commutative, associative, idempotent etc.) On the other, by
thinking of the \cc{} languages as {\em the} language framework
for computing with closure operators we were lead to a number of
simple extensions with interesting computational content.
(See Section~\ref{other-work}.)
\subsection{Programming idioms}
To make it simpler to write programs over \Herbrand{}, we shall
adopt one more convention. 
Often, we shall want to specify that a parametrized constraint be
checked, and allow the check to return values for the paramter. 
For instance, we may want to check whether {\tt X} has been bound
to a cons, and if so to obtain in {\tt Y} the car and in {\tt Z}
the cdr. This can be done by checking the constraint {\tt X = [Y
| Z]}, where  {\tt Y} and {\tt Z} are parameters.
To allow for this we introduce: 
\begin{equation}
\begin{array}{llll}
A & ::= &  \forall \bar{X}. c \linarrow A & 
\mbox{-- Ask Method}
\end{array}
\end{equation}
If $c$ is a \herbrand{} constraint then $ \forall \bar{X}. c
\linarrow A$ is equivelant to $ (\exists\bar{X}.c) \linarrow
\exists\bar{X}.(c, A)$. So the new form can be regarded merely as
a convenient shorthand.

\subsubsection{Simple data-flow programming.}
It should be clear how Kahn data-flow networks can be programmed
in \cc{} languages. Essentially, these languages directly provide
the ability to compose (recursive) networks concurrently, to
establish connections between communication channels, and to hide
local communication channels.

\begin{ccprogram}
\agent fib(Seq, Count) -o fibb([0,1 |Seq], Count)..
\agent fib_1([X1, X2 |  Rest], Count) -o 
 Count \gt 0 -> (Rest= [X1+X2 |  \_Rest1], fib_1([X2 | Rest], Count-1)),
 Count = 0 -> Rest = []..
\end{ccprogram}

Shapiro's paper \cite{shapiro-systolic} examines many systolic
programming idioms from the concurrent logic programming
viewpoint. 

\subsubsection{Producer consumer interactions.}
We show how to set up a bounded buffer producer-consumer interaction between two
agents. When the two agents are created, a new channel is
generated, and the producer is given the teller for the channel,
and the consumer, the asker. Subsequently, the producer chooses
to generate messages on the channel through the teller in the
usual stream-oriented way, and the consumer ``receives'' these
messages through the asker. 

\begin{ccprogram}
\agent goal -o producer(B, [demand, demand, \ldots, demand | \_D]), consumer(B,  \_D)..
\agent producer(B, X) -o
    X = [demand | D] -> (B = [produced | \_B], producer(\_B, D)),
    X = [] -> B = []..
\agent consumer(In, D) -o
    In = [produced | B] -> (D = [demand | \_N], consumer(B, \_N)),
    In = [] -> D= []..
\end{ccprogram}

\subsubsection{Short circuits}
The standard short-circuit programming technique can be expressed
thus:
\begin{ccprogram}
\agent p(\ldots,L,R) -o 
    \ldots 
    c_1 -> p(\ldots,L, \_M), p(\ldots, \_M, R),
    c_2 -> R = L..
\end{ccprogram}
Here, $c_1$ is the condition under which a new switch must be
spliced into the circuit, and $c_2$ the condition under which 
the circuit must be closed.  An initial query may be {\tt p(done,
D),q(D)}.  ``{\tt q}'' can wait until {\tt D=done} succeeds,
before proceeding.  

In \cite{dist-term}, streams of switches, a programming technique for
repeated use of a short-circuit, was presented in the context of
\herbrand\ programming. 
\begin{ccprogram}
\agent p(\ldots,L,R) -o
   \ldots, 
   c_1 ->  p(\ldots,L, \_M), p(\ldots,\_M,R),
   c_2 -> (L=[\_M | \_L'], R = [\_M|\_R'], p(L',R')),
   c_3 -> R=L..
\end{ccprogram}
When an agent desires to close a switch {\em for the current
phase}, it merely equates its left channel to a cons and its
right channel to a cons, and equates their cars, recurring with
the cdrs. This appears to be an instance where run-time
unification is required, though it is guaranteed that the
unification will succeed.  Unification in this case provides a
degree of decoupling between successive {\tt p} agents: a {\tt p}
agent can decide to close a switch in the current phase without
having to synchronize with any other agent. It records its
decision in the store through the constraints it has imposed on
its left and right channels. When all agents have imposed similar
constraints, the end of the current phase can be detected by the
agent holding both ends of the chain.  Thus the bidirectionality
of unification is put to good use.  Note that the third rule in
one step closes a switch for all future rounds.

\subsubsection{Incomplete messages}
A major source of simplicity and elegance in \cc{} languages is
that servers can respond directly to service requests by
communicating on ports embedded in the service request. This
technique is illustrated by the following ``queue-server''
program:

\begin{ccprogram}
\agent queue(Input)-o (split(Input,  \_Enq,  \_Deq), match(\_Enq, \_Deq))..
\agent split([Message | Rest], En, De) -o
         Message = enqueue(X) ->(En = [X | \_En1],  split(Rest, \_En1, De)),
         Message = dequeue(X) ->(De = [X | \_De1],  split(Rest,  En, \_De1))..
\agent match([X | Enq], [ Y | Deq]) -o Y=X, match(Enq, Deq)..
\end{ccprogram}

These rules define a predicate {\tt queue/1}, which will receive
a sequence of {\tt enqueue} and {\tt dequeue} messages over time
and will be able to respond as if these operations were being
performed on a queue.\footnote{Typically the input stream is
obtained by merging one stream from each client---the handling of
many-to-one communication schemes is discussed in more detail
below.} Operationally, a {\tt queue} agent splits into two
agents. One monitors the input queue and distributes it into two
queues, one consisting of the enqueue requests received
(preserving order of arrival) and the other consisting of dequeue
requests received. Each enqueue request contains the value to be
enqueued.  Each dequeue request contains a {\em teller} for a
variable, the asker for which is retained by the agent sending
the request.  When an enqueue request arrives, it is communicated
via the tell capability directly to the agent which requested
that the dequeue operation be performed.  (Note that the length
of the dequeue list can be greater than the length of the enqueue
list, so that the queue may be ``negative'' for a while.)  If we
wish to give clients separate ``enqueue'' and ``dequeue''
capabilities then only the {\tt match/2} program is needed.

\subsubsection{Recursive doubling}
The parallel  prefix problem is of substantial interest in the area of 
concurrent algorithms (see {e.g.} \cite{kruskal-par-prefix}). We
use it here to  demonstrate how to achieve recursive
doubling by sending askers and tellers as messages, thus
achieving ``transparent forwarding''. 

Given $n$ numbers $a_1, \ldots, a_n$, the problem is to compute the $n$
numbers $b_1, \ldots, b_n$ such that $b_j = a_1 \oplus \ldots
\oplus a_j$ where $\oplus$ is a binary, associative operation.
To be concrete, we shall assume that ``$\oplus$'' is ``$+$'' in
the following. 

To solve the problem, a  chain of $n$ agents is generated at
run-time. That is, each agent ${\tt pp}_i$ shares a variable with
the $(i-1)$st agent (the {\em left}
link for the $i$th agent and the {\em right} link for the $(i-1)$st
agent) and with the $(i+1)$st agent. Each agent ${\tt pp}_i$ also
has access to $a_i$. 

In what follows it is useful to think of computation as proceeding in
cycles---in each ``cycle'' an agent receives one message and sends
another---even though, in reality, there is no central clock and all
agents are executing asynchronously.  Each message is of one of two
types.  First, it may be the constant {\tt []}.  Alternately, it may be
a three-element array: the {\em value}, the {\em left} connection and
the {\em left2} connection. 

An agent ${\tt pp}_i$ stores locally the following pieces of
information: it stores $a_i$, its  left and 
right connections on the chain, a variable that will be equated to the 
{\em final} value of the number at this agent, and another variable,
which will represent a connection to the agent ``two steps to its right''.
The role of the last variable---which is critical to the
algorithm---is explained below.  

When an agent is enabled initially, it sends a message to the agent
on its right.  In this message, the value component field contains the
number at the agent, and the  left component is a variable that may be
used to receive the messages the agent will send subsequently.
The  left2 component contains a variable which will be constrained
{\em in the next cycle} to a variable that may be used to receive the
messages that the agent {\it to the left} of the agent sending
this message, will send in future.

In each cycle, every agent reads the variable that connects it to
the agent to its left.  Each such variable will be instantiated to a
message of the above type.  The agent computes its new local value
by performing the given operation ($\oplus$) on the value it receives
from the left, and its local value.  It also equates the variable in
the  left field in the message with the variable in the 
left2 field {\em in the message it sent out in the previous cycle}.
It recurs with its new left connection being the variable in the 
left2 component field in the message it just read.  

If the agent reads a {\tt []} message then it sends {\tt []} 
on the other two connections that it has locally, declares that
the value it was  supposed to compute is its current value and terminates.  

\begin{ccprogram}
\agent pp(In, Out)-o Right\Hat spawn(In, [0,[],[]],  Right, Out)..

\agent spawn(In, Left, Right, Out) -o
    In = [] -> (Right = Left, Out = []),
    In = [N | Rest], number(N) 
      -> \((Right_1,Right_2,Out_1)\Hat 
          pp(N, Left,  Right_1,  Right_2,  Final), 
          Out = [Final | Out_1],
          spawn(Rest, [N, Right_1, Right_2], Right,  Out_1)\)..

\agent pp(N, Left, Right, Right_2, Final) -o
    Left = [] -> (Final = N, Right = [], Right_2 = []),
    Left = [M, Left_2, Left] 
       -> \(pp(M + N, Left,  Right_1,  Right_{21}, Final), 
            Right = [M + N, Right_1, Right_{21}], Right_2 = Left_2\)..
\end{ccprogram}

Note that an agent does not {\em use} the value of the variables {\tt
Left} and {\tt Right2}: it merely serves to equate them so that {\em
some other agents} can directly communicate in the next round.
This simple connection pattern achieves recursive doubling: in the
$i$th cycle an agent receives messages from another agent $2^i$
steps away in the initial chain.  Transparent forwarding is used
to obtain the desired effect. 

\subsubsection{Translating the $\lambda$-calculus}

Perhaps the most remarkable illustration of the power of
constraint-based communication is the translation of the untyped
$\lambda$-calculus into determinate CCP. There are a host of
related translations; we treat here a translation of the lazy
version of the lambda-calculus \cite{abramsky-ong}.  

Intuitively, an application is treated as a parallel composition,
and the operand and the argument are treated as separate
processes communicating on a channel hidden from the outside
world. The agent corresponding to the argument is sent a message
with the identity of the operand, and the result channel. The
operand is represented by a process that waits for an evaluation
message (that will contain the result channel), and instructs its
body to send its result on that channel. A $\lambda$-abstraction
is seen as a server waiting for a message to come down its
specific communication channel (such messages are generated by
application).  When this message arrives, the formal argument of
the abstraction is bound to the (channel representing) the
operand, and the body of the abstraction is asked to send its
output to the result channel. The term $x$ with result $Z$ is
represented by a message on the channel $X$ asking for an
evaluation, with the result to be sent to $Z$. 

Let $M$ be an arbitrary lambda-term, and $Z$ an arbitrary
variable. We define an  agent $\tuple{M}_Z$ by structural induction
on $M$: 
$$
\begin{array}{ll}
\tuple{X}_Z & eval(X,Z)\\
\tuple{\lambda X\ M}_Z & \rep (\forall X \forall Y\ apply(Z,X,Y) \linarrow \tuple{M}_Y) \\
\tuple{MN}_Z &  \exists X \exists Y. (\tuple{M}_X, apply(X,Y,Z), 
! (\forall S. eval(Y, S) \linarrow \tuple{N}_S)
\end{array}
$$

For example, the agent $\tuple{\lambda x.x}_Y$ is just:
$$ \rep \forall B\ (apply(Y,X,B) \linarrow send(X,B)$$
That is, $\lambda X.X$ is viewed as a server that waits for an
argument ($X$), and return channel ($B$), and then sends $B$ to
$X$. 

\subsection{Other topics}\label{other-work}

\subsubsection{Determinate Disjunction.} \label{glb-defn} 

The simplest example of a new operator introduced by the
understanding of determinate \cc{} programs as closure operators 
is  the {\em glb} operation. The set of
closure operators forms a lattice, with the least upper bound
given by parallel composition. What does the greatest lower bound
of $f$ and $g$ give us? On any input $c$, it gives us the
greatest lower bound (in the space of constraints) of $f(c)$ and
$g(c)$ --- one can think of it is a {\em determinate or}: it
extracts the most {\em common} information between the two
branches. \cite{cc-FD} shows how this can be used to obtain
considerable improvement in some constraint propagation
algorithms.

Other examples of operations on constraints (e.g., indexical
constraints, cardinality operators) that can readily be
understood in terms of closure operators may be found in
\cite{cc-FD}. 

\subsubsection{Nondeterminate \cc{} language}
%% What about axiomatization? 

It is natural to ask what happens when disjunction is added to
the picture. This thesis shows one way of doing this
operationally --- one adds a notion of non-determinate prefixing
($\Rightarrow$), and takes as configurations multisets of pairs
of agents and constraints.

However, it is not clear how to connect non-deterministic
prefixing to logical notions. Instead \cite{angelic-nondet}
explores another route. Let $(A_1 \vee A_2)$ be an agent if $A_1$
and $A_2$ are.  Operationally, as with nondeterminate prefixing,
an agent $(A_1 \vee A_2)$ in store $c$ splits into {\em two}
configurations, \tuple{A_1,c} and \tuple{A_2,c}. (Conjunctively
running agents of $(A_1 \vee A_2)$ are copied into both.) When
the store of one branch becomes inconsistent, the branch
vanishes. The crucial question now is: what does one regard as
{\em observable} for such a system of disjunctive agents?

One answer is that the observables of such an agent should be the
multiset of constraints obtained on each branch. However, this
conflicts with the operational requirement that ``false''
branches are unobservable: an agent which produces false on one
branch and {\tt X=1} on another should not be distinguished from
the agent that just produces {\tt X=1}. 

This leads to the notion that two agents are considered
operationally indistinguishable if any contraint $c$ entailed by
{\em each} branch of one is also entailed by each branch of the
other, and vice versa. Indeed, this corresponds to the reading 
of $\vee$ as disjunction: $A \vee B$ entails $c$ just in case $A$
entails $c$ and $B$ entails $c$. 

\cite{angelic-nondet} develops the mathematical foundation of
such languages. As in the determinate case, it turns out to be
adequate to record for a process its {\em resting points}: those
stores in which the process is unable to add any more information
(on at least {\em one} of its branches). Again, the underlying
mathematical reason is that processes can be taken to denote
certain kinds of closure operators, namely {\em
non-deterministic}, {\em linear} closure operators.\footnote{{\em
Non-deterministic} because the operators are functions over a
powerdomain (the Smyth powerdomain), since a process is capable
of producing more than one result from a single store. {\em
Linear} because the operators do not share any information
between stores: the set of (minimal) results that a process can
produce from the set of stores $s \cup t$ is the union of the
results it can produce on $s$ and $t$.} Such operators can also
be represented by the set of their fixed-points; but now one is
not guaranteed that there is a {\em minimal} fixed point above
every input. Rather the result of the computation is taken to be
the set of minimal elements above the input.

This line of development is still unfinished however. Clearly,
other treatments of disjunction are possible, and should be
investigated.


\newpage\section{Indeterminate concurrent constraint programming}
\subsection{Syntax and Operational semantics}
There are several distinct ways of expanding the syntax of agents
to introduce indeterminacy in linear concurrent constraint
programming.  Since our primary task in this paper is to review
the state of concurrent logic programming, we shall adopt an
extension that most faithfully reflects the practice of concurrent
logic programming. Essentially we introduce the notion of
guarded choice:
\begin{equation}
\begin{array}{lllll}
(Agents) & A & ::= & \with_{i \in I} c_i \linarrow A_i & \mbox{--
Guarded choice}
\end{array}
\end{equation}
In concrete syntax, we will prefer to use the guarded choice box
$\OrSep$ rather than $\with$. 

This gives rise to the reduction rule:
\begin{equation}
\begin{array}{ll}
(\Gamma, C, \with_{i \in I} c_i \linarrow A_i) \derives
(\Gamma, C, A_j) & C \vdash c_j, j \in I
\end{array}
\end{equation}

The notions of reduction sequences, and termination remains the
same as before.

\subsection{Logical semantics}
We can now state the basic results connecting the logical
interpretation to the operational execution of indeterminate
programs. They are identical to the results in
Section~\ref{det-log}.

\begin{lemma} Let $A$ be a linear formula, with free variables 
  contained in the free variables of $\Gamma$. If $\Gamma
  \derives \Gamma'$, then $\Gamma \vdash A$ whenever $\Gamma'
  \vdash A$. 
\end{lemma}

\begin{theorem}[Soundness of logical interpration] Let $d$ be
  a constraint with its free variables contained in those of
  $\Gamma$.  If $\Gamma \derives^{\star} (\Gamma',C)$, and $C
  \vdash d$, then $\Gamma \vdash d \otimes \top$.
\end{theorem}
\begin{theorem}[Completeness of logical interpretation] 
  If $\Gamma \vdash d \otimes \top$, (with free variables of
  $d$ contained in those of $\Gamma$, then there is a
  configuration $\Gamma'$ such that $\Gamma \derives \Gamma'$ and
  $\Gamma' \vdash d \otimes \top$.
\end{theorem}


\subsection{Denotational Semantics}
%% In the introduction discuss trace semantics approach.
%% Background work.
% For indeterminacy, several  issues, not all of which have yet
% been investigated. How will the recursion and indeterminism be
% reconciled.
% 
% Construct models.

The denotation of an indeterminate process is only somewhat more
complex than in the determinate case: rather than storing just
the resting points of a process, we must also associate with each
resting point the ``path'' followed by the process in reaching
that point.  Such a path is also called a {\em failure}.  One of
the primary contributions of \cite{popl91} was to develop a
simple treatment of failures via the notion of trace operators
(see below). 

However, in a fashion well-known from work on the semantics of
other indeterminate formalisms, there are still a variety of
choices for the appropriate mathematical domain of processes,
depending on what one wants to consider observable. 

The simplest choice, perhaps, is to observe for a process just
the set of all stores that it can lead to, that is, stores it
``deadlocks'' in (that is, is unable to progress in, unless the
environment intervenes to upgrade the store).  Such a semantics
would be useful, for example, in proving safety (or ``partial
correctness'') properties for the programs: one would be able to
show that regardless of the store the process ends up in, some
given property holds. One can obtain a fully abstract semantics
for this notion of observation by associating with a process just
the set of all failures (resting points, together with the paths
taken to reach that point) generable by the process. Of
importance is the fact that the divergent or undefined process
(needed for the base case of the semantic recursion defining the
semantics of recursive processes) is just the empty set. This is
because there is no store that one can observe such a process
quiescing in.  Indeed, a consequence of this notion of
observation is that a process $\tt c \then A$ is identified with
the process $\tt (c \then \div) \OrSep (c
\then A)$, which can non-deterministically choose to diverge.
(Here \div{} is just some process that can engage in an infinite
execution sequence without producing any constraint or demanding
any constraint from its environment.)

A minor modification of this treatment is to observe for a
process the set of stores it quiesces in, together with a bit
indicating whether this store was {\em terminal}, that is, the
process had reached a stage where it would be unable to add any
more information even if the environment were to upgrade the
information content of the store.  For example, {\tt X=1} is a
terminal resting point of the agent {\tt X=1}, but not of the
agent $\tt (X=1, Y=2) \then Z=3$.  With such a notion of
observation we can distinguish between ``deadlocked'' programs
(programs that have quiesced but can continue if the environment
were to offer more information), and ``terminated'' programs.  It
is straightforward to develop a compositional semantics which is
fully-abstract for this notion of observation.

The biggest drawback of the above treatment, however, is that it
is not useful for reasoning about {\em liveness} properties of a
program, that is for establishing that a process {\em must} be
able to produce some constraints, regardless of the indeterminate
choices  that the underlying scheduler may make. For example, we
know that the process
\begin{ccprogram}
\agent \tt (X=1 \then (Y=1,Z=2)) ++ (R=1 \then (Y=1,S=2))..
\end{ccprogram}
{\em must} produce the constraint {\tt Y=1} on any store stronger than
either {\tt X=1} or {\tt R=1}. 

The main difficulty is treating non-termination or {\em
divergence}.  This is handled in \cite{popl91} by regarding a
divergent process as ``catastrophic'', in the style of
\cite{spec-csp,hoare-models}.  Such a process is modelled as
capable of exhibiting any failure whatsoever.  Thus, the process
$\tt (c \then \div) \OrSep (c \then A)$ is identified with $\tt c
\then \div$: a process that in store $\tt c$ can behave
completely unpredictably (and hence of which no total correctness
assertions can be made). 

The set of all processes is identified by presenting some
naturally motivated closure conditions on sets of failures,
following \cite{joseph-theory}.  (For example, one of the major
conditions ensures that processes are finitely nondeterminate.)
The resulting notion of a process is intuitive and easy to
understand.  Combinators corresponding to ask, tell,
nondeterminate (dependent) choice, parallel composition and
hiding are defined.  A simple operational semantics is given, and
a full correspondence with the denotational semantics is
established.

A very simple relationship was also established between the
indeterminate and determinate semantics, showing that the two
semantics are essentially identical for determinate programs over
finitary constraint systems.

In summary, \cite{popl91} presents a simple model for the \cc\
languages that is fully able to account for nondeterminism and
divergence, thus permitting the use of this model in reasoning
about liveness properties of programs.  The model is also shown
to be fully consistent with the intuitive operational semantics
of \cc\ programs.

\paragraph{Trace operators.}
A central issue in our semantic treatment is the representation
of a failure.  A failure could be represented as a sequence of
ask/tell annotated constraints.  Such a strategy is followed in
earlier work on related languages, for example in
\cite{partial-correctness,levi-invited,levi-tr},
\cite{CP-d-sem}, \cite{gaifman-react} etc.  However, ask-tell
sequences are far too concrete in this setting.  They store too
much information which must then be abstracted from (usually via
complex closure conditions) since they encode the precise path
followed by a process to arrive at a resting point.  Furthermore,
the definition of various combinators becomes rather cumbersome
because of the need to ``propagate'' information down the trace.
Instead, we chose to represent observations via a certain kind of
closure operators.  To capture various operational notions of
interest, we introduce the concept of {\em trace operators}, and
{\em bounded trace operators}. 

Intuitively, a trace operator corresponds to the closure
operators generated by ``straight line'' (non-branching)
ask-and-tell \cc{} programs. Mathematically, their characterizing
property is that they are {\em invertible}: to every trace
operator $f$ there corresponds another trace operator $f^{-1}$
such that the only resting point of $f \wedge f^{-1}$ is {\tt
false} (the top of the lattice). Intuitively, $f^{-1}$ is
obtained from $f$ by taking the characteristic ask-and-tell
sequence of $f$ and ``inverting'' it, making every ask a tell,
and every tell an ask. Thus, on any input, the conjunction of the
two processes is able to tell exactly what the other asks, and
thus proceed hand-over-hand, so to speak, all the way to {\tt
false}. Hence {\tt false} is the only resting point for their
conjunction. For instance, the inverse for the trace operator
$$\tt (W=1) \then (X=2 \wedge (Y=3 \then Z=4))$$ is $$\tt (W=1)
\wedge (X=2 \then (Y=3 \wedge (Z=4 \then \false)))$$

A {\em bounded} trace operator is a trace operator that can stop
short of \false; that is, it is a trace operator on a sublattice
obtained by considering all the elements below a certain element
in the lattice. 

\paragraph{``Assume/Tell'' sequences.}
A related line of attack has been independently pursued by Frank
De Boer, Catuscia Palamidessi and their colleagues
\cite{deboer-ccp,deBoer-asynch}. In \cite{deBoer-1,deBoer-2}, the
authors introduced the idea that it was possible to give a
semantics to the asynchronous concurrent logic programming
languages in a style familiar from the semantics of indeterminate
data-flow languages. With a process were associated the set of
its {\em assume-tell} sequences: a sequence $\tt
ask(c_1).tell(c_2).ask(c_3)\ldots tell(c_n)$ is associated with a
process if when initiated in store $\tt c_1$, it can upgrade it
to store $\tt c_2$, and then when initiated in $\tt c_3$ it can
upgrade it to $c_4$ and so on, until when initiated in $\tt
c_{n-1}$ it can upgrade it to $c_n$ and then quiesce. The
denotation of a process is then taken to a set of such sequences
closed under some conditions which reflect the asynchronous
nature of communication, and the lack of granularity of the
Eventual Tell operation.\footnote{In a language without Atomic
Tell operations, it is not possible to distinguish between the
agent $\tell(c_1), \tell(c_2)$ and the agent $\tell(c)$, where
$c$ is logically equivalent to the conjunction of $c_1$ and
$c_2$. Alternatively, each communication $\tell(c)$ can be
thought of as happening in arbitrarily many, finite ``pieces''.}

The resulting model are very similar in nature to the
trace-operator based models discussed above --- indeed trace
operators are just an abstract counterpart of such assume/tell
sequences. (A significant technical difference is that the
authors do not handle the interaction between divergence and
non-determinism, so that their semantics is not adequate for
dealing with liveness properties of processes.)  The basic point
made by the authors --- that for asychronous languages the
``branching structure'' of the refusal set semantics of
\cite{csp-theory} is not needed --- is indeed valid, and is
further corroborated by the work of \cite{joseph-theory}. Indeed,
deBoer, Palamidessi and their colleagues have gone on to
demonstrate the generality of their approach by applying it to
other asynchronous languages in which communication is not
``extensive'' \cite{deBoer-asynch}.

\cite{deBoer-ccp-pa} discusses an equational axiomatization of the
congruence on \cc{} processes induced by the above semantics.
This is very valuable for it serves as the foundation on which to
build a ``work-bench'' that can be used to construct and reason
about programs. A similar axiomatization for the determinate
language is provided in \cite{popl91}. 

%% What is their work on a process-algebra for cc programming?

\paragraph{Bisimulation semantics.}
\cite{popl90} develops a different approach to the semantics of
CCP. Following the operationally motivated work of Milner (and
Park) \cite{ccs} (see \cite{Milner-handbook} for an uptodate
account), we introduced the notion of {\em reactive equivalence}
of \cc{} programs.  

Let the ``environment'' in which an agent is executed be an
abstraction of the rest of the computational system---including
concurrently executing agents---with which the agent must
interact to complete its computation.  In order to interact, an
agent must have certain variables---the {\em visible
variables}---in common with its environment.  A basic action is
said to be a {\em visible action} if the agent's capability to
engage in that action may be influenced by the environment. Thus
a visible tell action is an action that imposes ``new''
constraints (i.e., constraints which are not yet known to have
been imposed) on the visible variables, and a visible ask action
is one which checks whether some new constraints have been
imposed on the visible variables. Note that a visible action is
not something that is ``directly seen'' by another agent---rather
it is a step in the agent's own computation which could have been
influenced by the action of some other agent. Such circumspection
is necessary in understanding this concept because communication
in \cc\ languages is very indirect, and hence the concepts
developed in the \ccs/\csp\ framework---in which an event is
observable if the environment can synchronously engage in
it--are not directly relevant.

Unrolling the possible visible actions a process engages yields
the {\em c-tree} (analogous to the ``synchronization tree'' for
CCS) for the process. Two processes are considered {\em
reactively equivalent} if, roughly, any visible action that one
can do can be matched by the other in such a way that both
processes end up in a state in which they remain reactively
equal. 

Note that the notion of visible action takes into account that
telling (or asking) a constraint $c$ is a silent action if the
store is already strong enough to entail $c$: thus the agent
$(X=1) \then (X=1 \wedge Y=2)$ is reactively equivalent to the
agent $(X=1) \then Y=2$. This highlights that the notion of 
communication in the \cc{} languages is very different from that
in \ccs{} or \csp{}: in those systems each action is taken as
atomic and does not interfere with any other actions the system
may have taken in the past. 
 
I once believed (in line with traditional dogma in concurrency
theory) that reactive equivalence was the ``finest'' congruence
that one could define on indeterminate \cc{} agents. However, the
emergence of linear logic has made it possible to treat
indeterminacy on a logical footing, and it now seems plausible
that there is an even finer congruence than bisimulation, namely
linear logic equivalence. 


\paragraph{Graph Grammars and True concurrency semantics}
\cite{fran-true} develops a ``true concurrency'' semantics for
CCP based on the notion of {\em graph grammars}. The state of the
computation is described by a graph, with variables as nodes, and
constraint tokens and agents as hyperarcs.  {\em Both} the
inference rules in the underlying constraint system, and (the
user-specified) agents are expressed as {\em graph productions}
which match subgraphs to other subgraphs, possibly introducing
new nodes in the process, and deleting portions of the matched
subgraph. A computation is then just a finite or infinite
sequence of such graph rewriting steps. From such a computation,
it is shown how to construct an {\em occurrence net}, a
particularly simple kind of Petri-net which can express
``causality'' information. The move from occurrence-nets to
partial orders is standard, though it depends on what is desired
to be observed of the computation.

An attractive feature of this approach is that the
``productions'' implementing the underlying constraint system are
treated on an equal footing with user-defined agents. Indeed, the
constraint system itself can be seen as a closure operator on
sets of tokens, ordered by the is-a-subset-of relation.

What remains unclear is the connection between the closure
operator semantics (which abstracts from execution sequences),
and the partial ordering semantics generated from an occurrence
net. More work is needed to resolve this issue. Also unclear is
the precise advantage gained by the more general true concurrency
approach over the interleaving approach. Perhaps this will become
clearer when one considers ``refinable'' basic actions (e.g., ask
operations that themselves involve some computation, i.e., the
so-called ``deep guards'', see \cite[Chapter 8]{vj-thesis-book}).

Based in part on this work, the authors and their colleagues have
gone on to describe in \cite{charm} an abstract machine, CHARM
(for Concurrency and Hiding in an Abstract Rewriting Machine).
Its distinguishing feature is the notion of partitioning the
state of a process into a {\em global} (externally visible) and a
local (hidden) part. Transitions preserve the global parts they
match on; intuitively the global parts matched represent the
structure of the underlying graph that is ``asked for'' (tested),
but not consumed.  CHARM is intended as a ``unifying framework''
for concurrency, since it is able to formalise in the same
conceptual framework, the classical approach to graph grammars as
well as the CCP paradigm; and its authors expect it will also be
useful in formalizing the basic concepts in other concurrent
paradigms such as CCS and CSP. 

As we shall see below, the connection with graph grammars is
particularly appropriate when one considers a visual syntax for
(at least) the \cc/\herbrand{} programming languages. 

\paragraph{Future work.}
%% What about other congruences derived from process algebras?
Among various aspects of concurrent constraint programming,
semantic foundations seem to be the most well-developed. Perhaps
the most glaring gap is that a comprehensive, systematic, theory
of testing, with more powerful tests tied to finer congruences,
has not yet been developed. Also, I believe that with the advent
of linear logic, it may be possible to tie the notion of testing
to the notion of recording as the denotation of a process a
well-chosen {\em subset} of the logical formulas entailed by the
process (viewed as a logical formula).  If this works out, it
will be a very beautiful way of exposing the logical structure of
the programming language.

Also, I would hope that the work can serve as the foundation for
the development of practical tools to help in partial evaluation, 
program transformation, and static analysis of programs. 


\subsubsection{Related work}
%% Discuss the set of traces work in my FSTTCS paper.
%% Discuss the Gerth et al paper.

The basic concepts of concurrent constraint programming were
introduced in \cite{clean-sem,my-thesis-book}.  Subseqently, we
developed an operational semantics and a bisimulation semantics
in \cite{popl90}.  This line of research extends and subsumes
(for all intents and purposes) earlier work on the semantics of
concurrent logic programming languages.\footnote{The semantics
presented in this paper does not account for the language
Concurrent Prolog studied in \cite{CP-d-sem}; we do not, however,
view this as a defect in our approach. Indeed, researchers
working with Concurrent Prolog have moved to the Ask-and-Tell
\cc\ framework \cite{shap-fgcs,gaifman-react}.}

The power of the ``logical variable'' has been amply recognized
in recent years, and numerous authors---too numerous for us to
survey their work in this extended abstract--- have investigated
the combination of functional and logic programming languages
\cite{fg+lv,lp-book-1,i-structures,Jagadeesan89}.  However, no
account of the basic paradigm comparable with Kahn's original
account in simplicity and generality has been forthcoming.

Perhaps the most noteworthy in this context is the work of
\cite{Jagadeesan89}, which discusses a subset of the functional
programming language {\em Id Noveau} with ``logical'' arrays
\cite{i-structures}.  Their semantic treatment also identifies a
program as a closure operator.  However, their treatment is
specialized to the particular kinds of arrays they were dealing
with; our work is parametrized by a very general notion of
constraint system.  Further, they do not explicitly discuss the
hiding and the Ask operations; though their ``Tell'' operation
does implicit asks.  We also present an axiomatization for the
model.  Their operational model includes an explicit discussion
of how the constraints are resolved and represented, we achieve
considerable simplicity by abstracting away the details of the
constraint resolution process.

The characterization of concurrency via intersection of sets of fixed
points has also been elaborated by Hoare in a recent paper
\cite{hoare-conj}, in a setting related to Unity~\cite{unity}.  Our paper
develops that viewpoint, and presents a characterization of other
combinators as well, in the same style.  We believe that the concurrent
constraint programming languages are the natural setting for the
development of a theory of computing with closure operators.

The semantics of nondeterminate concurrent constraint programming
languages is becoming an active area
\cite{popl90,gaifman-react,levi-tr,deboer}.  However none of this work
explores the very simple characterizations possible for the determinate
languages, dealing instead with the representation of processes as various
sorts of trees or sets of i/o annotated sequences of constraints.  The
notion of determinacy was studied in the set-up of logic programming
languages in \cite{alps}, but no characterizations of the kind we provide
here were given.

\cite{gaifman-react} does not consider most of the combinators we
treat here, besides having a complicated treatment of ask/tell
sequences.  Neither does it treat recursion.

In work to appear, deBoer and Palamidessi \cite{deboer,deboer-tr}
propose a model which is similar to ours in many respects.  In
particular, they have also recognized that it is sufficient to
take sequences of ask/tell actions in order to get a model for
the \cc\ languages.  However, their treatment ignores recursion.
Much of the sophistication of the present model lies in the way
finite nondeterminate axioms need to be introduced in order to
correctly model divergence.  Their model in \cite{deboer-tr} is
not compositional with respect to the choice operator.

The treatment described above follows closely Mark Josephs'
treatment of receptive processes in \cite{joseph-theory}.  A
receptive process can always accept any input from the
environment, but may produce different (or no) output depending
on the input.  Josephs gives a set of axioms for reactive
processes that turn out to be more or less what is needed in
order to develop a model for the \cc\ languages as well.  The
primary differences lie in the nature of communication, and in
the combinators treated.  Josephs' theory treats communication in
the usual CCS/CSP style as the exchange of uninterpreted tokens
with the environment.  The constraint-based nature of the \cc\
languages imposes additional structure which must be considered.
However, as this paper demonstrates, it is possible to adapt his
basic model to the \cc\ setup without major reworking, which
should confirm the essential robustness of his conceptualization
of asynchronous systems.

\subsection{Logical Semantics}
\subsection{Programming idioms}
\paragraph{Many to one communication using bags}

The communication mechanism discussed in \janus\ hitherto has focussed
on point-to-point directional communication channels.  An agent may have
a fixed number of such channels.  Such communication schemes are too
simplistic, however, to deal elegantly with many to one communication. 
Consider for example a situation in which an a priori unbounded number
of clients may seek service from a server. 

The server will have read access to a bag.  It will wait for requests by
asking if its bag is {\tt element(Request,Bag')}.  It then responds to
the {\tt Request} and recurs with the {\tt Bag'}.  A client has a teller
to an element of the bag, say {\tt X}.  It sends a request by telling
{\tt X=bag(Request,X')} and using {\tt  X'} for subsequent requests. 
It can split its access {\tt X} to the server by telling {\tt
X=bag(X1,X2)} thereby creating the two accesses {\tt X1} and {\tt X2}. 
A client drops bag reference by posting constraint {\tt X=\{\}}. 

Consider the simple bank account as an example of how to instantiate
this communication scheme:
\begin{ccprogram}
\agent	account\_receptionist(\{\}, State)..
\agent	account\_receptionist(element(Req, Rest), State)-o 
	    account(Req, State,  State1), 
	    account\_receptionist(Rest, State1)..

\agent	account(balance( Value), Bal,  NewBal), integer(Bal) -o  
	(NewBal = Bal, Value = Bal)..
\agent	account(withdraw(Amt,  Ok), Bal,  NewBal) -o
       Amt \leq Bal -> (Ok=ok, NewBal=Bal-Amt)..
       Amt > Bal -> (Ok=nope, NewBal=Bal)..
\agent	account(deposit(Amt), Bal,  NewBal)-o NewBal=Bal+Amt..
\end{ccprogram}

The guard {\tt integer(Bal)} was added to the first rule of {\tt
account} to indicate that {\tt Bal} is forced and can therefore occur
multiple times.  We expect a good compiler based upon abstract
interpretation can compile away this test. 


\newpage\section{Programming languages}
The basic scheme described above captures the semantics of Flat
GHC. Flat Parlog is also essentially identical.

%% What was the idea in Ringwood's GDCC? 

\subsection{Atomic publication}

\subsection{Read-only annotations}
New definition in Flat Concurrent Prolog. Based on an attractive
but flawed idea. 

\subsection{Deep guards}
\subsection{Non-determinism}
\subsection{Failure-free langauges}

%% Constraint system.
%% Asker teller notion. closely related to the proof as program
%% interpretation of linear logic. augmented with the notion of listeners.

In the context of distributed computation, a fundamental problem
with \cc{}  languages (the {\em multiple teller} problem) is that
a computation may have multiple producers of partial information
for the same set of variables.  Unless these producers are
coordinated in some way with respect to each other, they may
produce conflicting constraints on the same variable, thus
causing the system to crash.  For example, one agent may
constrain a variable {\tt X} to be equated to {\tt 3} and another
may constrain it to be {\tt 4}.

Languages such as \herbrand\ used the notion of atomic
publication to circumvent the problem: An agent must augment the
constraints in the store atomically, failing if the store is
incompatible with the new information.  However, this notion is
far too heavy-weight to serve as a foundational communication
mechanism in a distributed setting Other \cc/\herbrand\ languages
such as \parlog, \fghc\ and \strand\ deal with the multiple
teller problem by ignoring it!  They trust that the programmer
will organize computations in such a way that the computations
will never generate conflicting constraints at run-time.  If this
trust is violated then the entire computation will abort with a
logical error.

These motivations led to the design of \janus{} \cite{janus}.
The fundamental intuition is that mechanisms should be devised
such that computation is not forced to abort at run-time because
the store is rendered inconsistent.  Second, the
communication mechanism should naturally reflect the notion of
asynchronous message passing, to be compatible with basic
distributed computation concerns.  That is, it should be possible
for agents at a site to progress by publishing information
without having to coordinate with agents at other sites.  (This
desideratum is even more important if it is desired that
computations in other languages, built using different
abstractions, should be able to {\em inter-operate} smoothly with
computations in these languages.)

Finally, if the languages are to be widely used as procedural
languages there should at least be the potential for a
performance-efficient implementation competitive with that of
more conventional distributed assignment-based languages.  Even
more strongly, we believe that uni-processor implementations of
such languages should be comparable in performance with the
implementations of sequential, imperative languages.

\janus{} attacks this problem on two fronts.  First, we
identified a constraint system, suitably rich for procedural
programming, which possesses certain ``independence'' properties
and naturally supports many-to-one communication.  Suppose {\tt
X} and {\tt Y} are two variables unconstrained in the current
store, and an agent decides to post the (consistent) constraint
{\tt X=s}, for some term {\tt s} possibly containing {\tt Y}.
Then, intuitively, for the constraint system to be implementable
well in a distributed setting it should not be possible to
construe this imposition as a constraint on {\tt Y}.  That is,
more generally, if $c_1$ is a constraint of the form
$\wedge_{i=1}^n X_i=s_i$ and $c_2$ of the form $\wedge_{j=1}^m
Y_j=t_j$, where the $X_i$ and $Y_j$ are distinct, then $c_1
\wedge c_2$ should be consistent iff $c_1$ and $c_2$ are
consistent separately. The \herbrand\ constraint system does not
satisfy this property: if the constraint {\tt X=f(Y)} is imposed,
then this restricts {\tt Y}; for example, the constraint {\tt
Y=X} cannot be imposed.  Part of our task was to design a
constraint system, $\cal J$, closely related to \herbrand, which
satisfies such properties.  Essentially, the constraint system is
defined over infinite trees \cite{prologiia}; but matters are
complicated due to the necessity of introducing
``hyper-multisets'' in order to cater for many-to-one
communication.  In designing $\cal J$ our paramount concern had
been that it should be efficiently implementable in a distributed
setting, expressive enough to allow straight-forward and elegant
formulation of standard algorithms over standard data-structures
such as arrays and lists, and should be such that with an
understanding of some basic representation decisions made by an
implementation, a programmer can fairly accurately predict the
cost of executiion of various programs.

Second, we have exploited the idea of {\em syntactic control of
interference}.  Compile-time checkable restrictions are
introduced which ensure that if the current store is consistent
and a goal reduces via a program clause, then the resultant store
will be guaranteed to be consistent, {\em without having to check
for consistency}.  This is achieved by exploiting a remarkably
simple idea that I first stumbled across in 1987 when studying
the notion of broadcast-free communication.

Say a clause is binary if any variable occurs free at most twice
in the clause. In such a clause, a variable may be considered as
a ``point-to-point'' communication channel because messages
generated on it from one occurrence only need to travel to one
other occurrence. Now, observe that the resolvent of two binary
clauses is also binary! (``Two'' is a magic number here --- from
three one may generate as many occurrences as one wants.) This
means that if the source program clauses satisfy this condition,
then any configuration generated at run-time will also satisfy
this condition. All that remains then is to (a)~devise a scheme
which ensures that exactly one of these occurrences (the {\em
teller}) can be used to impose a constraint on the variable ---
the other occurrence (the {\em asker}) is then used to
``receive'' information (b)~ensure that in this scheme
``tell-rights'' can be communicated on channels, and
(c)~determine if this scheme is powerful enough to express the
characteristic idioms of concurrent logic programming.  All three
goals were achieved in \cite{janus}, adapting ideas from
\cite{doc}.


A major opportunity offered by \janus{} is that of compile-time
garabage-collection. Because any variable has at most two
occurrences, once one occurrence of a variable receives a term,
it ``knows'' that there are no other references outstanding to
that term (the other occurrence of the variable has been ``used
up'' in binding the variable to the term). Therefore, a compiler
can generate code knowing that whenever a data-structure in an
atom has been successfully matched, the data-structure is garbage
and can be recycled. This is discussed in more detail in the
implementation section below.


\newpage\section{Implementation}
\label{intro}

Efficient implementation of \cc\ languages, as that of more
traditional languages, hinges on low memory usage, compile-time code
optimization, and low-overhead runtime management of concurrency.  By
keeping the program's working set small, locality of the machine's
memory hierarchy can be best exploited, reducing expensive faults
farther from the CPU.  The storage model selected, e.g., stack or
heap, is a critical design decision here.  Compile-time code
optimization also centers around the memory hierarchy: efficient
utilization of the available machine registers and cache.  This
involves avoiding redundant computation (e.g., by strength reduction),
which also saves CPU cycles.  Finally, the runtime overheads of
concurrent task management must be significantly lower than
computation within tasks.  This is often noted in terms of the
communication to computation ratio, assuming that the primary action
of task management is transmission of messages between processors.

Implementations to date of \cc\ languages were targeted to the broad
categories of uniprocessor, shared-memory multiprocessor, and
distributed-memory multiprocessor hosts.  Almost all of the
implementations use a heap-based storage model wherein tasks are
allocated in an {\em ad hoc} fashion from either global or local
storage pools.  Stack-based implementations are more compatible with
uniprocessor hosts; however, task suspension must still be
implemented, perhaps in a manner similar to implementations of the
{\em freeze} primative in certain Prologs.  In the long-run, although
not seen yet, implementations for multiprocessor hosts must also move
to stack-based models, perhaps adopting ideas from partitioning of
threads in dataflow languages \cite{TAM,traub92}.  We discuss these
issues further in Section \ref{} on future work.  In this section we
review the main themes and efforts in the implementation of \cc\
languages over the past decade.  We begin with a brief historical
overview.

\paragraph{Historical review.}
Some interesting comparative work done at the University of Edinburgh
by R. Trehan \cite{trehan-ms} and H. Pinto \cite{pinto-ms} summarized
the experiences of programming and interpreting these languages in the
``early days.''  These concurrent languages could be differentiated
primarily by their synchronization mechanisms and how they managed
multiple local environments.  There are various other attributes, such
as granularity control and goal scheduling, unification, etc., that
affect implementation complexity and efficiency.  We discuss these in
depth in the following sections.

Trehan and Pinto's studies focused on interpretation, whereas
further evolution of implementation efforts led to compilation
and hardware support.  The first abstract machine designs for
this family of languages were the WAM-like FCP machine (Emu) by
A. Houri \cite{houri-cp-book,silverman-cp-book}, Sequential
Parlog machine (SPM) by S.  Gregory \cite{gregory-book}, and the
KL1 machine by Y. Kimura \cite{KL1}.  These systems represent the
first-generation compiled implementations of concurrent logic
languages, evolving into more sophisticated systems.  The
sequential FCP machine was refined first by S. Taylor into a
hypercube system \cite{taylor-book} and by S.  Kliger into a
RISC-based abstract machine and optimizing compiler
\cite{kliger-phd}.  The SPM led to J.  Crammond's Abstract
Machine (JAM), the first parallel Parlog implementation
\cite{jam-paper}, and the PPM \cite{cheese-phd}.  The KL1 machine
was implemented on shared-memory machines as Panda
\cite{sato-pisa} and evolved into the abstract machine shared
among the ``parallel inference machines'' (PIMs).  A
hybridization of a few of these projects was I.  Foster and S.
Taylor's Flat Parlog machine \cite{flat-parlog} leading to the
Strand Abstract Machine (SAM) \cite{strand}, ported to several
types of multiprocessors.  These systems represent the
second-generation parallel implementations, the first comparative
study of which was by Foster and Taylor \cite{flat-parlog}.  The
community is now beginning the construction of third-generation
{\em optimized} compiler-based, portable systems, e.g., the {\tt
jc} Janus system \cite{gudeman-jicslp92}, Monaco
\cite{tick-iclp93}, and a portable KL1 system
\cite{chikayama-fast}.\footnote{ Interesting comparisons of the
execution performance of many of these first, second and third
generation systems can be found in Taylor \cite{taylor-book} and
Tick \cite{tick-iclp93,tick-crammond-icpp}.}

Specialized hardware efforts were concentrated mainly at ICOT with the
decade-long FGCS project and their aim of building PIMs.  The personal
inference machines (PSI-I,II,III) \cite{psi4,PSI-II} were followed by
mockup PIMs (Multi-PSI-V1/2, built of PSI-IIs), and finally
PIM/\{c,i,k,m,p\} \cite{pim/c,sato91,pim/k,pim/m,pim/p}.  The main
efforts, PIM/\{m,p\}, are large multiprocessors ($2^8$--$2^9$
processors) based on specialized hardware for direct execution of KL1.
Other notable hardware implementation efforts include the Carmel
microprocessors \cite{CARMEL-4} and a related microprocessor proposed
by Alkalaj \cite{alkalaj-isca}.  A full analysis of hardware issues in
concurrent logic language implementations is beyond the scope of this
article (see Tick \cite{tick-fgcs93} for instance), although we do
correlate the instruction set designs of the software and hardware
oriented implementations in Section \ref{arch}.

Originating around 1988, a related family of implementations has been
developing around the Andorra model of computation, specifically
Andorra I \cite{costa-ppopp,costa-iclp}, Andorra Kernal Language (AKL)
\cite{akl-paradigms}, Pandora \cite{bahgat-phd,pandora}, and ANDOR-II
\cite{takeuchi-phd}.  We cannot do justice to these implementations in
this paper, as they are rapidly evolving.

\subsection{Principles and Trends}

Efficient implementation of concurrent logic programs requires strong
foundations in several areas.  As in any parallel system,
task\footnote{ We use the words {\em task}, {\em process}, and {\em
goal} interchangeably.} switching and task creation are the
primitive operations that must be made fast.  Furthermore, as in any
computational system, task invocation, variable binding, and memory
reclamation must also be made fast.  For concurrent logic programs,
task switching means suspending one task and substituting (resuming)
another; task creation means building a body goal task from its
parent's arguments and perhaps spawning it on a remote processor.
Task invocation is extended here to include the action of executing a
goal to its commit point, i.e., performing the clause tries needed to
commit, suspend or fail.\footnote{ A concurrent logic program task is
like a {\em thread} in threaded architectures.  The task invocation
creates a main thread which may split into multiple threads during
guard evaluation, all synchronizing at commit, leading to a single
clause body thread.  The body calls spawn new threads, and so
on.} Variable binding incurs added overheads to guarantee atomicity
(i.e., locking around the update to avoid races among competing
writers).  Not only is fast memory reclamation critical, but moreover
so is efficient use of memory in the first place, since the
single-assignment nature of the languages can be quite profligate in
touching memory.

By far the most complex implementation aspect of these basic
operations is task switching and task invocation because of language
synchronization semantics that require implicit synchronization on
potentially incoming procedure arguments.  This places a burden on the
compiler and generally bloats procedure invocations with respect
to sequential languages and implementations.  The various
nuances of language semantics, e.g., deep or flat guards, atomic or
nonatomic tell unification, impact implementation efficiency.

Orthogonal to these primitive operations are intelligent task
management policies that are desirable: balanced load, balanced
granularity, and fair scheduling.  These concepts are not unique to
concurrent logic programs, and are required independent of how fast
the primitive operations can be made.  Looking at underlying
multiprocessor hosts, we have an additional requirement to achieve
full efficiency: latencies must be hidden.  Memory latency in
distributed multiprocessors is the major problem to dealt with.  As
Arvind showed \cite{arvind87}, hiding latency effectively is directly
traded-off against switching tasks quickly.  We shall see (Section
\ref{read}) that current concurrent logic programming systems can hide
latency, but only within limits, and certainly overly-complex
languages features cannot be effectively hidden.

The past ten years have seen a trend towards {\em devolution} of logic
programming languages driven by the practical need to build fast
implementations.  The most drastic step was the definition of
committed-choice languages that did not backtrack, enabling the first
pseudo-parallel interpreters to be built.  The next devolutionary step
was from deep to flat guards, and moving from synchronizing on dynamic
read-only variables to synchronizing on statically-declared arguments,
enabling the first efficient implementations to be built.  Next were
restrictions placed on how variables could be bound: Strand
\cite{strand} abolished output unification in favor of assignment,
similar to moded FGHC \cite{ueda-93} which constrains a logical
variable to have a single producer.  More strict, Doc \cite{doc}, \aum
\cite{yoshida-fgcs88}, and Janus \cite{janus} constrain a logical
variable to have at most a single producer and single consumer.  These
simplifications facilitate compile-time analysis and optimization of
memory usage.

The progressions are further described in subsequent sections.  The key
point is that languages are refined by reaching an equilibrium between
what application writers {\em demand} and what implementors {\em
supply}.  There is not yet full agreement as to where this equilibrium
point is for concurrent logic programs, and we think it will be most
strongly influenced by fast, portable, and parallel implementations.

\subsection{Synchronization}
\label{synch}

Concurrent logic programs synchronize on logical variables, similar to
how non-strict dataflow languages use I-structures 	
\cite{arvind89,lindstrom85}.
If a required input variable is unbound upon procedure invocation, the
corresponding clause cannot commit.  Furthermore, if no clause
defining the procedure can commit and not all clauses fail, it implies
some required input variable(s) have not been delivered, and the task
must be suspended.  Concurrently, if any of these required input
variables are bound, the task must be resumed.

Input matching (passive unification) is transformed by compilation
into instruction sequences that make matching efficient in general.
The ability to synchronize on variables requires temporarily binding
certain unbound logical variables to a suspended task to enable
subsequent resumption.  The efficiency of this infrastructure is the
main factor in synchronization performance.\footnote{ Early systems
did not attempt to statically analyze logical variables, e.g., to
determine if a variable can possibly be hooked, and if not, how to
generate more efficient code for the ask tests.  Recent compilers,
e.g., \cite{kliger-phd,yanoo}, claim to do global static analysis to
determine this and other information.}

%
% IMAI AND TICK MEASUREMENT WAS FOR ACTIVE CELLS DURING GC: THUS
% A VARIABLE CELL COULD BE RECOUNTED AS IT BOUNCED BETWEEN OLD AND NEW
% SPACES.  THIS INDICATED HOWEVER, THAT THESE VARIABLES LIVED A LONG
% TIME?  OR WAS IT BECAUSE OF TAILS OF MANY STREAMS?
%

The FCP, JAM and KL1 machine architectures all use similar methods of
``hooked'' variables, i.e., assigning indirect pointers from suspended
variables to process structures \cite{houri-cp-book}.  Indirection is
required to allow both multiple variables to synchronize the same
task, and multiple tasks to be synchronized by the same variable.
Variables are infrequent data types: Imai and Tick \cite{imai-spdp}
measured 1--15\% of dynamic objects are variables across a KL1
benchmark suite.  To our knowledge, no one has measured the prevalence
of hooked variables, and the characteristics of those hooks.  It is a
widely-held belief that hooks are quite simple in structure and rare
in frequency. \footnote{The former assertion is more strongly
supported than the latter --- ``object-oriented'' programs can create
many suspensions, as discussed in the remainder of this section.  The
two such programs measured by Imai and Tick produced far more
variables than the other benchmarks.} Thus JAM Parlog \cite{jam-paper}
and Strand \cite{strand} allow goals to be hooked to only one
variable, thereby obviating the complex bookkeeping structures needed
for the general case.  JAM exploits shared memory to implement a
``hybrid'' suspension list to gain this efficiency.  Strand initiates
all suspensions as if they are single-variable type, and if this most
frequent case is violated, a queue of exceptionally suspended goals is
used.

An orthogonal issue is how to specify the input variables upon which
to synchronize.  The most common method is ``procedure level''
representation wherein synchronizing variables are syntactically
specified (explicitly as in Parlog or implicitly as in GHC) on a per
clause basis.  Alternatively synchronization at a ``data level''
representation specifies synchronizing variables, e.g., ``read-only''
variables in CP.  The latter method has gone out of favor because,
although it facilitates certain sophisticated systems programming
techniques, it complicates dereferencing and unification, and
frustrates static analysis.  
%
% VIJAY: FOR PREVIOUS SENTENCE, YOU MAY WANT TO ADD A FOOTNOTE, OR
% REFERENCE TO PREVIOUS SECTION, THAT DISCUSSES THE SEMANTIC FLAWS
% INHERENT IN READ-ONLY VARIABLE UNIFICATION, E.G., THAT HEAD
% UNIFICATION ORDER CAN AFFECT PROGRAM BEHAVIOUR.  I LEFT THIS OUT
% OF THIS SECTION SINCE IT APPEARS TO BE LANGUAGE ISSUE TO ME.
%

The elegant programming techniques it enables are rarely used in
applications programming \cite{shapiro-survey}, yet the cost of
implementation is felt throughout the design \cite{flat-parlog}.
Foster and Taylor \cite{flat-parlog} measured (for small
benchmarks executing on a sequential workstation) that trailing
needed to support atomic unification (discussed further in
Section \ref{unify}) in FCP caused a 5\% degradation in
performance compared to flat Parlog (without atomic unification).
For programs with suspension ratios (\# suspensions/\#
reductions) of 19--56\%, additional degradation of 4--8\% was
observed, hypothesized as other overheads associated with
read-only variables (since flat Parlog and FCP were calibrated
except for that).

Another implementation issue is the actual control flow of checking
the synchronizing variables (discussed at length in Section
\ref{guard}).  It was originally believed that parallel execution of
the clause tries was beneficial because it implied faster invocation.
However, if deep guards are permitted, then parallel clause tries
require the ability to sustain multiple environments and incur most of
the problems associated with OR-parallel management of bindings under
search for a single solution.  Furthermore, with flat guards, the
little amount of work within the clause tries may not justify the
overhead of executing them in parallel.  Crammond showed consistently
negative speedups (3.8\% to $-$32\%) for small, flat-guarded
benchmarks on JAM Parlog on a shared-memory multiprocessor
\cite{crammond-phd}.  In fact, compilation techniques such as decision
graphs \cite{kliger-phd} remove redundant computations among the
clause tries, furthering the argument that parallel tries do not pay
for themselves.

Sato and Goto \cite{sato-pisa} showed, for the shared-memory Panda
system, that suspension induces execution overhead of 1--5\% for small
benchmarks, because of the necessity to redo the clause tries on
resumption.  Especially with decision-graph compilation techniques, it
is not easy to avoid recomputation since there is more sharing among
the code generated.  For Panda benchmarks with low suspension ratios
of 1--8\%, depth-first scheduling mechanism effectively suppressed
suspensions (with respect to breadth-first scheduling), but the
benchmarks were quite simple.  The one Panda benchmark with a high
suspension ratio of 42\% was not suppressed by depth-first scheduling.
Taylor \cite{taylor-book} measured suspension ratios of 0--56\% on the
hypercube for small programs, including an assembler.  The higher
ratios are due to static pragma-driven scheduling on the hypercube,
compared to the dynamic scheduling on Panda.  Imai and Tick
\cite{imai-spdp} measured 14 medium-sized benchmarks ranging from
0.3--67\% suspension ratios, with a geometric mean of 3.7\%.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%								%
% Imai + Tick suspension:reduction ratio =			%
% (14.5 * 3.40 * 2.06 * 66.86 * 0.328 * 43.75 * 0.351 *		%
% 4.79 * 11.56 * 1.64 * 8.33 * 5.26 * 1.57 * 0.494)^(1/14)	%
%								%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

All these statistics taken together indicate, among other things, that
a significant number of programs are {\em active}, i.e., process
groups are spawned to implement active objects that compute and
communicate until the termination of the algorithm.  This
object-oriented programming style causes frequent suspensions because
the processes composing the active objects are normally suspended,
awaking only upon receiving a message upon a stream.  Yet all the
parallel systems previously mentioned implement {\em process-oriented
scheduling} wherein a goal reduction leads to the enqueuing of its
body goals onto runtime work queues, with one of the goals selected
for local execution (analogous to tail recursion optimization).  Such
a model executes active programs inefficiently.  

Ueda and Morita proposed an alternative called {\em message-oriented
scheduling} \cite{ueda-fgcs92,ueda-93} for more efficient active
program execution.  The main idea is to transfer control to a stream
consumer at the point when the producer sends the message.  In the
case where buffering can be avoided, this method of task switching to
an active process has less overhead than the standard execution
mechanism.  Ueda and Morita implemented this method for shared-memory
multiprocessors by scheduling from a global work pool.  Since control
is transferred immediately upon message sending, effectively
independent chains of message sends are executed by the processors.
Their initial performance results are extraordinarily good: naive
reverse executes on a single processor at 3.3 seconds (cf., {\tt jc}
(Janus) runs at 3.9 seconds \cite{gudeman-jicslp92} and Monaco (FGHC)
runs at 19.2 seconds \cite{tick-iclp93}).  Furthermore almost linear
speedups are achieved, as well as comparable performance to optimized
`C' (the authors however do not compare the technique with
process-oriented scheduling for an equivalent technology base, to
determine the utility of the concept).


For example,
Kliger \cite{kliger-phd} reports that a set of 27 small-to-medium size
FCP(:,$\commit$) benchmarks achieved 21\% geometric mean speedup due
to a set of optimizations based on global analysis, including
reduction of unification into assignment, in-lining arithmetic,
manipulating unboxed objects, and excluding procedure clauses that can
never commit for any invocation.


\subsection{Guards and the Process Structure}
\label{guard}

Guards, similar in purpose to Dijkstra's {\em guarded commands}, were
introduced to logic programming in the Relational Language
\cite{ClaGre81}.  They extend, from simple input matching, the
expressivity of how to commit to a clause.  In their most general
form, guards among clauses defining the same procedure represent
disjunctive processes racing to commit.  Implementation difficulties
occur 1) if these processes are allowed to bind variables, and 2) even
if binding is outlawed, if processes are permitted to make nested
calls.  The former problem is indicative of ``unsafe'' languages, and
the latter problem is indicative of languages with ``deep'' guards.
Considering the range of complex to simple implementations, the
languages fall into three basic categories: unsafe and deep (e.g.,
CP), safe and deep (e.g., Parlog), and safe and flat (e.g., FGHC).

Unsafe clauses may compete with one another in the sense that each may
wish to make conflicting bindings to the same variables.  This is
implemented by restricting bindings to a local environment, for
eventual exportation upon commit.  Exportation can, however, conflict
with concurrent bindings made nonlocally.  If this happens, the clause
try fails.  Detecting inconsistencies is a major implementation problem 
in these languages --- there is a choice among detection {\em before}
commit or {\em after} commit.  The former presents a clearer semantic
model to the programmer, but is far more difficult to implement (see
Section \ref{unify}).

Deep guards effectively form a process hierarchy or tree, with local
environments at each level.  Local environments are needed, even if
the language is safe, because incoming bindings (to local variables)
must be saved across deep guard evaluation.  One severe implementation
problem is the management of multiple environments (one per deep-guard
clause in the same procedure) if guards are evaluated concurrently.
This {\em value access control problem} is similar to that of
OR-parallel implementations of Prolog: how to efficiently ensure that
only ancestor environments on a path to the root are accessible, and
that all other environments are hidden.

Another implementation problem is supporting {\em fair} execution
while descending the hierarchy, while retaining low complexity and
cost.  If fair execution is not guaranteed then eagerly executed
guards may loop, preventing later guards from failing and freeing up
the computation.  Shapiro \cite{shapiro-survey} states that the
inability to achieve fairness at low cost motivated flat
languages.\footnote{This problem still exists, in a less troublesome
form, for unfair flat languages when early builtin guards suspend,
preventing or delaying later guard failure.  This can only reduce
the failure set, which is not important because failure is a 
catastrophic event in flat languages.} He cites early CP
implementations (e.g., \cite{miyazaki-cp-book})
as either unfair or of ``unacceptable'' complexity.

JAM Parlog \cite{jam-paper} constrains deep guards to be used only in
clauses bracketed by {\em sequentialized clause separators} (in some
languages called {\em otherwise} guards).  Such separators prevent
subsequent clauses from being tried until all previous clause tries
fail.  This restriction obviates concurrent evaluation of deep guards,
simplifying management to that of a single local environment per
procedure.  Crammond \cite{jam-paper} states that this restriction
allows most of programmers' intended uses of deep guards, e.g., as
if-then-else conditionals.

Parlog offers the programmer a sequentialization operator {\tt g \& b}
that guarantees goal {\tt g} executes to completion before goal {\tt
b} is executed.  In JAM, the implementation views a clause as compiled
above with the guard {\tt g} and body {\tt b}.  In other words, the
same mechanism used to implement sequential goal execution does double
duty for deep guard execution.  Deep guards need to be evaluated
concurrently to avoid deadlock; however, given mode information, flat
guards can often be executed in-line for efficiency.  The environment
necessary for carrying local bindings over a sequentialization
operator are not unlike a Prolog environment.

Restricting the language to only flat (safe) guards engendered
decision-graph compilation \cite{kliger-phd} because clause tries can
be compiled in line without transfer of control nonlocally to other
goals.  A decision graph is composed of if-then-else and switch nodes
which transfer local control conditionally upon a test.  A graph is
formed, rather than a tree, to guarantee space proportional to the
number of clauses in the procedure.  To ensure space linearity, a
clause is propagated down one and only one branch of the graph as code
is being generated.  Thus clauses ambiguous to a test are
conservatively placed in a {\em continuation} branch, and sibling
branches jump to the continuation upon failure.  For a suite of 27
medium-size benchmarks, decision graphs executed 3.2 times faster on
average than WAM-like compilation \cite{kliger-phd}.  The code size
expanded by 30\% on average, with a particularly degenerate program
(Salt \& Mustard \cite{tick-book}) doubling in size.  An interesting
problem is how to order the graph nodes, and how to generate optimal
code for the tests, conditional branches and switches, to minimize
execution time \cite{debray-jicslp92}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							%
% Appears to be no data on the following:		%
% Empirical characteristics of deep and flat 		%
% guards and their overheads.				%
%							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

A main purpose of deep guards is to perform speculative computations
that can fail allowing alternative solutions to succeed.  Unsafe
languages enriched this paradigm allowing bindings to be made along
the speculative path.  Experience has shown that support of both of
these operations is too expensive for the low frequency with which
they are used.  The devolution to flat languages is complete in the
sense that almost all research groups opted to reduce language
expressibility in favor of easily-implementable flat guards.  In a
further extreme, Fleng \cite{fleng1} abolished guards in an effort to
streamline execution.  In general, guard tests must be pulled up and
evaluated at each call site.  This allows the optimization wherein
certain guard tests need only be evaluated at certain call sites.
Global analysis is needed to produce the information required for this
optimization.  Although FCP has guards, a similar optimization is
enabled by Kliger's method of customizing decision-graph clause tries
for different call sites \cite{kliger-phd}.\footnote{ Practical
definitions of Fleng allow the programmer to specify guards, which are
then pulled up to the call sites.  Without global analysis, of the
complexity required by Kliger, {\em all} guard tests must be pulled up
to {\em each} site.}

\subsection{Reading and Writing Logical Variables}
\label{read}

The costs of reading and writing logical variables can be calculated as
the {\em frequency} of operations required, multiplied by the {\em
cost} of the operations.  For example, reading a logical variable
incurs the incremental cost of suspending the variable at the rate of
suspension.  In shared-memory multiprocessors, all accesses are
``local,'' so that the relevant overheads are lock traffic on the
shared bus and lock contention.  Contention, i.e., multiple concurrent
requests of the same lock, can be exacerbated when the host does not
supply enough physical locks for all objects needing
locks.\footnote{Because this hardware attribute cannot be easily
modified for a given host, studies of lock contention {\em vs.} lock
granularity have not been performed for concurrent logic languages.}
For small benchmarks, Sato and Goto \cite{sato-pisa} reported that
locking accounted for only 1--5\% performance degradation on the
Sequent Balance.  This conveys both the relative efficiency with which
locks can be implemented with shared memory, as well as the retained
significance of lock overhead.  Interestingly, although most of the
lock traffic they measured was for protecting bindings, most of the
observed lock contention was for bookkeeping locks for scheduling and
termination.

Distributed-memory multiprocessors are significantly more problematic
because of the overheads incurred in reading and writing nonlocal
variables.  Nonlocal reading requires sending a message requesting the
variable's value, and receiving a reply.  Nonlocal writing requires
issuing the binding --- the receiver can update the variable locally
(without explicit locking) and send either a success or failure
acknowledgement.  The incremental cost of resuming tasks hooked to the
bound variable must be accounted for in a macro view of execution.

There are of course variations on both of these protocols.  Taylor
\cite{taylor-book} discusses a protocol on a hypercube where nonlocal
writes first request a remote lock, and upon receiving the lock, issue
a remote write.  He measured 61\%--100\% of all messages sent, for six
FCP benchmarks, are nonlocal reads.  The four smallest benchmarks
required an arithmetic average of 99\% reads.  Although write
frequency is seen to be very low, its amplified cost can be felt.  For
example, Taylor demonstrated that for the incomplete message paradigm
(where nonlocal reading and writing occur with equal frequency), the
main execution overhead on a multiprocessor was not the sender reading
the return value, but rather the receiver locking and writing the
return value.  In this simple example, locking proved to be extremely
expensive (performance degradation of two times on two hypercube
nodes) because the latency could not be hidden.

Reducing the cost of reading is critical in distributed
implementations.  If a complex term is to be read nonlocally, an
important design consideration is how much of the term should be
eagerly transferred.  Taylor also examined the affect of this {\em
copy depth} parameter on performance.  For standard paradigms such as
producer-consumer and incomplete messages, performance improved
significantly for initial increases in copy depth, after which no
improvement was seen.  The interpretation of these results is that the
consumer is ``brought up to speed'' by increasing transfer size until
the point at which it outruns the producer, after which no further
improvement can be achieved.  Because these are such pervasive
programming techniques in concurrent logic programs, it is imperative
to find ways to speed them up.  Hardware support for message
management (packing and unpacking, merging active messages straight
into the execution pipeline) to more effectively hide latencies is one
approach, similar to the goals of threaded architectures (e.g.,
\cite{nikhil92}).  Another idea is to reduce the number of 
messages sent, either by introducing new programming paradigms, or by
dynamically migrating tasks and streams so that communication is
local.  Yoshida \cite{yoshida-fgcs88} took the latter approach in the
design of \aumc, discussed in Section \ref{streams}.

\subsection{Unification}
\label{unify}

Unification is somewhat controversial because it stands out as one of
the few unbounded-time operations required by logic programs compared
to conventional languages.  In many cases unification can be compiled
into simple instructions, as was elegantly shown in the WAM
\cite{wam-book,warren-tn}.  Unification in committed-choice languages
can be categorized as either {\em input} (also: {\em passive} and {\em
ask}) or {\em output} (also: {\em active} and {\em tell}), reflecting
the exportation of bindings.  Ask unifications implement head matching
either as explicitly compiled match instructions or as invocations of a
fully general passive unify routine.  Luckily, full passive
unification is rarely executed: it occurs only when checking the
equality of two incoming arguments.  For example, Foster and Taylor
\cite{flat-parlog} measured the execution of 153,800 matching
operations and 15,300 general passive unifies (9\% of total) in an
Assembler benchmark written in Flat Parlog.  Furthermore, if
sufficient type information is inferred, general passive unifies can
be reduced to simpler tests.

At the leaves of unification's recursive descent, rules for unifying
primitive data types come into play.  Read-only synchronization
requires an extended set of rules \cite{miyazaki-cp-book,taylor-book}
compared to procedure-level synchronization, potentially reducing
performance.

Whereas ask unification occurs before commit, the location of tell
unification varies among the languages.  Unsafe languages require {\em
atomic} tell unification wherein no output bindings are seen until
commit.  This means that bindings must be locally trailed, and perhaps
undone upon failure.  Furthermore, atomic unification can bind two
variables which are inputs of the same procedure invocation.  This
raises the issue of whether such bindings should be eagerly
acknowledged, although such an implementation causes a performance
degradation of 1--29\% \cite{flat-parlog}.  This and trailing
overheads led to the abandonment of atomic unification for
implementation reasons alone.  Safe languages place tell unifications
after commit (called {\em body unification}).  The implementations are
thus free to perform body unifications on the fly, with failure
causing procedure invocation failure.

Even body unification is complex when considering multiprocessor
implementations.  The main problem is to avoid potential race
conditions among concurrently executing tasks.  Thus any logical
variable that needs to be bound must be locked first.  Furthermore, if
both operands are unbound, they cannot be locked in an arbitrary order
under threat of deadlock with a competing task trying to bind the pair
in the opposite order (very unlikely, but possible, unless mode
restrictions are known, as discussed below).  Some ordering must be
made, e.g., exploiting the nature of a shared-memory name space.

By making the most frequent case fast, general unification on
multiprocessors can be implemented efficiently.  The most common case
by far is binding a constant to a logical variable that does not
require dereferencing.  A fast stub can be constructed that tests if
one operand is constant, and one is unbound and not hooked.  On a
shared-memory machine, the variable is then locked, bound, and
unlocked.  Otherwise, if the initial condition is not met, full
unification is required.  The two operands to be unified must be
locked, dereferenced, and compared.  On distributed-memory
multiprocessors, the same algorithms can be naively used, with
nonlocal accesses potentially at the leaves.

Because of the implementation complexity and potential execution
overheads of output unification on distributed-memory multiprocessors,
and the evidence that it is infrequently used in its full generality,
output unification further devolved in Strand to {\em assignment}.
Thus recursive descent is obviated, and the left-hand side of the
assignment is required to be a variable.  However, this does not rule
out the need to test for exceptions, or hooked variables, which need
to be resumed.  Thus binding is still expensive compared to imperative
assignment.\footnote{With compile-time freeness analysis, e.g.,
\cite{ueda-93}, and hookedness analysis, e.g., \cite{yanoo}, these
tests can be safely removed.} Program Composition Notation (PCN)
\cite{pcn-book} is a multi-lingual language built from Strand and `C'
that offers both {\em definition} (logical) and {\em mutable}
variables.  Thus safe and efficient assignment to mutable variables
can be guaranteed by the programmer.  Furthermore, memory usage can be
reduced by destructive update of mutable variables.  In a sense, PCN
is the farthest devolution has progressed in concurrent logic
languages.

\subsection{Task Scheduling and Priority}
\label{priority}

There are various philosophies for automatic scheduling of parallel
tasks.  Compile-time analysis can be attempted to determine a fixed
schedule mapping tasks to processors.  Runtime profiling information
can aid the static analysis.  A radical departure is to perform all
scheduling dynamically without any static aid, or a hybrid combination
of static and dynamic.  Another approach is to avoid automation and
require the programmer to explicitly distribute tasks.

Automatic scheduling in concurrent logic programming systems is
usually dynamic (e.g., JAM, Panda, Monaco) because tasks are too small
and numerous to allow practical static analysis.  For shared-memory
multiprocessors, the main implementation issue is how to efficiently
manage the goal queues.  A single shared queue would eliminate the
need for load balancing, but contention for this scarce resource is
too costly.  Splitting the queues up, one per processor, removes
contention but leads to potential unbalancing.  Once queues have been
split, there emerges the implementation paradigm, on large-grain
process systems such as UNIX, of {\em task farming}.  Here a single
UNIX process, often called a ``worker,'' is responsible for
coroutining between the execution and scheduling of goals.  In {\em
on-demand scheduling}, goals are not eagerly distributed among workers
and only an idle worker searches for work, thereby minimally
disturbing busy processors.  Sato and Goto \cite{sato-pisa} and
Crammond \cite{crammond-naclp90} examined variations of on-demand
scheduling involving further splitting the local queue into {\em
private} and {\em public} queues and allowing idle workers to steal
only public work.  Crammond reported that for eight medium-sized
Parlog benchmarks, private/public queues offered slightly better and
more consistent speedups than public-only scheduling, on the Symmetry
and Butterfly II (on 16 Symmetry PEs, geometric mean efficiencies of
86\% and 83\%, respectively).  These early studies measured
multiprocessors with far slower processing elements than are available
today.

There has also been much work within the ICOT FGCS Project exploring
automatic load balancing methods, e.g.,
\cite{furuichi90,ichiyoshi-fgcs92,pim/c,sato-pisa,sugie}.  The most
successful experiment has been the {\em multi-level load balancing}
(MLLB) scheme for balancing OR-parallel search programs on a
distributed-memory multiprocessor \cite{furuichi90}.  The idea is to
partition the available processors into groups, and allocate one
distribution master per group.  Slave processors within these groups
request work from the master.  There is a method of merging groups,
and given the regular nature of OR-parallel search, this method has
been shown to be quite effective, e.g., speedup of 50 on 64 processor
Multi-PSI for the pentomino benchmark \cite{furuichi90}.

The drawback of MLLB is its limited application domain.  Thus even
ICOT resorted to explicit user-defined ``pragma'' in the KL1 language
for remote task scheduling on distributed multiprocessors.  Strand and
PCN also require pragma.  In these latter languages, the user is
encouraged to design load-distribution management networks, called
{\em motifs}, e.g., MLLB could be specified as such
\cite{motif2,motif1}.  In PCN, motifs consist of several programming
constructs implemented in the source language with libraries providing
support.  Simple pragma are enriched by allowing the definition of
{\em virtual topologies}, which can be embedded within physical
topologies.  Topologies are collections of nodes, such as a hypercube
network, implemented by process structures.  User-defined tasks are
mapped onto the nodes by passing the tasks as messages for
meta-execution.  A user program can be written to interface to a
single virtual topology, which can then be automatically mapped onto
whatever physical topology is offered by the hardware organization.
There are several other constructs, such as {\em templates} and {\em
ports}, which facilitate program creation, but do not present major
implementation difficulties.

An issue related to task scheduling is task priority.  Early
concurrent logic languages specified that goals were required to be
executed in a {\em fair manner}.  Fairness is difficult to define in a
manner that can be easily implemented.  One weak definition is that
all tasks which {\em can} execute {\em are} attempted at some time.
This guarantees avoidance of spurious deadlock.  Normally,
tail-recursion optimization (TRO) is implemented wherein a selected
body goal is directly executed and all others are wrapped up as goal
records and enqueued.  By extending the life of a thread through a
selected child in this manner, efficient use of registers for argument
passing can be achieved.  Fair execution is emulated in a number of
systems by a time-slicing technique, wherein every $k$ reductions, TRO
is replaced by enqueueing {\em all} body goals at the back of the
queue, and switching in a goal from the front of the queue.
Implementation incurs the overhead of updating a counter comparing it
to $k$ for each reduction, as well as enabling queue access from both
the front and back.

The KL1 PIM systems took a different approach, discarding the notion
of task execution fairness altogether.  It is replaced by a goal
priority scheme, wherein the scheduler makes its best effort to abide
by priorities.  This allows programming techniques such as speculative
exploration of alternative solutions.  KL1 allows goal pragma that set
priorities relative to a parent goal or a collection of goals called a
{\em shoen}.  These {\em logical} priorities, potentially ranging from
0 to $2^{32}$, are retained in goal records, but also mapped into a
smaller {\em physical} range for purpose of sorting.  For example, if
the physical range is 0 to $2^{14}$, then the PIMs use an array of
$2^{14}$ queues.  The non-empty queues are linked to allow efficient
dequeueing across priorities.  Insertion of a goal into an empty queue
requires a linear search up to the nearest non-empty neighbors to
update the links.  This algorithm is sufficiently simple for its
microcoded implementation, although software-based implementations 
might be better served by balanced, priority trees.

\subsection{Granularity Control}

Concurrent logic programs are fine grained: Alkalaj
\cite{alkalaj-isca} measured from ``20 to several hundred single-cycle
instructions'' per average goal reduction.  Taylor \cite{taylor-book}
measured FCP granularity on a hypercube as a ratio of
reductions/messages-passed,\footnote{ In a distributed memory
multiprocessor implementation, messages would be used for communicating
values down a stream from producer to consumer, for example.  }
ranging from 3.5--220.  It is clear that granularity is very much
dependent on application and programming style, but even in the best
case, granularity is still low compared to conventional approaches to
parallel programming in imperative languages.

The advantage of fine-grained concurrent languages is the abundance of
potential parallelism.  However, the main disadvantage is that
too-fine granularity can lead to excess overheads in task management.
Alkalaj \cite{alkalaj-isca} has shown that 50\% of the execution time
of large FCP applications is spent on goal management for a reasonable
machine execution model.  His recommendation was a specialized
hardware organization to support this efficiently.  Such directions
are promising, as echoed for instance in the hardware implementations
of threaded architectures, mainly predicated on dataflow languages
(e.g., \cite{monsoon}).   Special hardware or not, it is necessary for
to boost efficiency by ``collecting'' granularity at compile time
\cite{sarkar-phd,traub92}.  

Ideas along these lines were developed for logic programs by Debray
{\em et al.} \cite{debray:90}, King and Soper
\cite{king-gran-tr}, and Tick and Zhong \cite{tick+xzhong}.
Debray's design seeks to construct, as compile time, estimators of
input argument size, and formulate these estimates into granularity
estimations.  At runtime, a granularity estimate is evaluated for each
procedure invocation, and the weight value is used to make dynamic
scheduling decisions.  For example, if the weight is below a
threshold, a task will not be spawned because of excessive overhead.

King \cite{king-gran-tr} discusses an analysis technique with no
runtime component.  Similar to Debray's method, granularity is modeled
as a function of argument size; however, these sizes are estimated by
abstract interpretation, associating argument types with a finite
lattice of sizes.  The result is a purely static determination of
granularity.  King uses this analysis, as an alternative to
profiling for example, to drive task sequentialization (see Section
\ref{arch}).

Zhong's approach \cite{tick+xzhong} attempts to remove the complexity
of argument-size estimation (both at compile time and subsequent
runtime evaluation costs) by introducing an abstract ``iteration
parameter'' which is a proxy for relevant granularity information.
The remainder of the scheme is similar to Debray's, with the major
distinction that the estimators are easier to formulate, cheaper to
evaluate, but far less accurate.  Furthermore, the weights computed
are relative, e.g., it can be estimated that one task is half the
weight of another task, but it cannot be determined if either are
below some absolute threshold weight.

The verdict is not yet in on the utility of these granularity
analyses, because empirical data is sparse.  Robust analyzers and
larger benchmarks are needed.

\subsection{Abstract Instruction Set Architectures}
\label{arch}

The Warren Abstract Machine (WAM) \cite{wam-book,warren-tn} had a
great influence on the various concurrent logic language
implementations discussed in previous sections.  The important
differences among the abstract machines developed for committed-choice
languages are in their storage models.  The primary distinction is
whether the heap is based in shared memory
\cite{jam-paper,gudeman-jicslp92,houri-cp-book,KL1,kliger-phd,tick-iclp93}, 
or distributed memory \cite{pim/p,pim/m,sato91,taylor-book}.  In
general, all variables and terms are stored in a heap, and memory is
reclaimed by explicit, periodic garbage collection.  Goals are usually
represented by heap terms that can be linked into work queues for
scheduling.  A goal that is suspended can ``float'' on the heap, to be
relinked to a work queue upon binding its hooked variable.

There are several variations to this basic model to gain efficiency.
Goal records can be constrained to be fixed size and queued in free
lists facilitating memory reuse.  Furthermore, all data structures can
be partitioned onto heaps corresponding to size, each with its own
free list for ease of (de)allocation.  Crammond \cite{jam-paper} split
arguments away from goal records, allocating them on their own stacks
to improve locality and reuse.  With arguments allocated separately
from goal records, goal-record locality improves, and arguments no
longer need to be of fixed size if allocated in a stack-based fashion.
However, this results in the creation of ``holes,'' i.e., deallocated
frames trapped below the top of stack, which can require general
garbage collection if they grow too large.  Crammond \cite{crammond-phd} 
illustrates the extent of this problem for some small benchmarks. 

In addition, bookkeeping structures for evaluating deep guards and
suspension management are necessary.  Deep guards and sequential
conjunctions require the use of environments which hold values of
variables active throughout the clause try or sequential body
evaluation.  Practical implementations require at most one environment
per invocation \cite{jam-paper}, which is deallocated upon body
completion or guard failure (to be reallocated for the next clause
try).  In addition, a {\em trail} is needed for atomic tell
unification wherein failure and suspension during unification must
``back out'' all bindings generated.  Suspension management requires a
suspension stack holding pointers to input arguments that are needed
but are as yet unbound.

A less important distinction among the systems are their abstract
instruction sets.  The instruction sets of the various machines follow
the general WAM model, passing arguments through dedicated registers,
and having a set of additional state registers for control and storage
management.  The instruction sets can be broken down into similar
groups.  Older models use WAM-like indexing control instructions,
whereas decision-graph compilers for flat languages avoid shallow
backtracking and much of the required control instructions.  Head
matching (ask unification) is compiled with {\bf wait} instructions
that will push their corresponding argument onto the suspension stack
if it is not instantiated.  Tell unification is compiled into {\bf
get} instructions that will make assignments or invoke a general
unifier.  Finally, body goals are generated with {\bf put}
instructions for loading arguments and enqueueing goal records.  As
mentioned in Section \ref{priority}, usually a form of tail recursion
optimization (TRO) can be implemented by loading the arguments of one
of the body goals directly into the argument registers and jumping to
the goal code.

Additional instructions are needed for the (de)allocation of local
environments (for non-flat languages and/or sequential
conjunctions) and heap storage.  Goal-management instructions are
responsible for terminating a thread (in unit clauses), enqueueing a
goal (creating threads), directly executing a goal (TRO, called {\em
promoting} a thread in JAM), and initiating deep guards.  Note that
flat languages have threads that live very short lives, not counting
promotions.  Sequential conjunctions, introduced in Parlog, do not
lengthen threads in practice because they are necessarily implemented
with trees of local environments (see Section \ref{guard}).  This
stems from the fact that within a sequential conjunction, concurrent
goals may execute.  If total sequentialization of a goal and all its
children can be specified or derived, then these local environments
can be stacked, resulting in superior space and execution time
efficiency.  This would be a true elongation of threads, resulting in
increased performance \cite{king92,lpar93}.  Such an implementation
requires a sequential call as well as stack (de)allocation
instructions.

Arithmetic instructions and builtin predicates must be able to suspend
if executed before commit, or be enqueued as bonefide goals if
executed after commit.  JAM optimizes this by checking arithmetic
operator inputs in the body, and if available, executing the
arithmetic in place.  Otherwise, a goal is created.  In general, a
lazy strategy of dynamically checking for data dependencies is taken
by all concurrent logic programming systems.  Threaded architectures
for dataflow are slightly different in that they do static analysis to
generate arithmetic threads that cannot internally suspend.  However,
both research communities treat array accesses in a similar manner: a
builtin access goal is spawned, avoiding long latencies for nonlocal
access.  After all, statically analyzing array indicies is very
difficult \cite{banerjee}.

A trend towards reduced abstract machine design, following the
principles of RISC design, has led to instruction sets such as Carmel
\cite{CARMEL-4}, SAM \cite{strand}, the {\tt jc} machine 
\cite{gudeman-jicslp92}, Kliger's machine \cite{kliger-phd}, 
and the various PIM architectures \cite{pim/p,pim/m,sato91}.  For
example, Strand, FGHC, Fleng, Janus, and FCP(:,$\commit$) sufficiently
simplify the execution model, obviating trailing, environments, atomic
tell unification, and a process hierarchy for deep guards and
sequential conjuncts.  This allows these compilers to concentrate on
optimizations, such as decision-graph generation, in-line arithmetic,
and global register allocation.  Sequential implementations offer
further performance gains, obviating locking and allowing the leverage
of compilation into `C'.   Debray and Tick measured a mean speedup of 
2.4 comparing {\tt jc} with Monaco for six small benchmarks 
\cite{tick-iclp93}, illustrating the potential advantages of
sophisticated register allocation, and streamlined binding mechanisms.

Readers interested in concurrent logic language
instruction-set design are referred to Crammond
\cite{jam-paper}, Foster and Taylor \cite{flat-parlog}, and Kliger
\cite{kliger-phd} for the most complete expositions. 

\subsection{Stream Communication, Arrays and Garbage Collection}
\label{streams}

A major defect in concurrent logic languages and their implementations
is inefficient use of memory.  This problem is prevalent in the
treatment of communication streams and data arrays, and is exacerbated
in distributed-memory multiprocessors.  A general, after-the-fact,
solution to the problem is the construction of ever more efficient
garbage collectors, about which we comment at the end of this section.
We first discuss pre-emptive solutions, such as making stream
communication efficient with buffers and migration.

Streams are second-class citizens in most logic programming languages.
Stream communication is programmed by having a producer write messages
into a difference list, the head of which is read by a consumer.  To
nondeterminately merge multiple streams, a chain of active {\em merge}
processes is needed.  This methodology was stressed in the original
literature because it is elegant and all that is offered at the {\em
language} level.  However, defining and implementing streams in this
manner is highly inefficient.  First, merged streams incur extra
process reductions, lengthening transmission {\em delay}.  Second,
naive stream merging can result in {\em unfair} data transmission.
Third, if the memory cells comprising a stream and the reader of that
stream are located on different processors in a distributed-memory
system, then reading a value requires the overhead of sending a
request message.\footnote{Analogous to driving all over town to pick
up mail at different post offices, instead of having all mail
delivered directly to your house.}

The fairness problem can be solved with more sophisticated,
dynamically-balanced merge trees \cite{mierowsky-cp-book}, although
this is expensive in time.  The delay problem has been solved both in
software and hardware.  In software, a data type, called a {\em mutual
reference}, interfaces multiple writers to a single reader
\cite{safra-cp-book}.  Writing to one of the streams will atomically
write the merged output stream and update the mutual reference to
point to the new output tail.  This scheme, originally designed for
FCP, is the essence of implementing streams as buffers in other
languages also.  For example, the PIMs implement mergers, in
microcode, in a similar manner \cite{inamura-naclp89,ueda-84}.  The
critical difference is that the new data structure is hidden from the
language definition.  Furthermore, with MRB (discussed below), memory
can potentially be reused when merging a new writer into the stream.

The indirection problem could be corrected by locating the buffer with
the consumer (similar in intent to message-oriented scheduling, see
Section \ref{synch}); however, in most concurrent logic languages,
multiple consumers are permitted, and single consumers are not
recognized as such by the compiler.  Global analysis might be used to
determine single consumer streams, or the languages can be restricted.
The latter solution is another devolutionary step (notably Janus
\cite{janus} and \aum \cite{yoshida-fgcs88}) with a ``single
writer/single reader'' restriction and abstract stream semantics that
do not constrain implementations.

Janus defines a {\em bag} data type that can be used as a
multiple-writer stream, with no constraint on write order.  Janus also
defines standard arrays; however, the restriction permits an
implementation to automatically reuse array locations.  Neither bags,
nor reusable arrays, have yet been implemented for Janus.  \aumc-90 is
called a stream-based concurrent object-oriented programming language
(SCOOL) by its authors \cite{konishi-fgcs92}, emphasizing the
first-class citizenship of streams.  Streams are implemented as buffer
objects that can migrate.  The migration policy moves buffers to their
(unique) consumers, thereby obviating the overhead of sending read
requests.  This is implemented by having the producer and consumer
initially communicate with {\bf where} and {\bf here} messages,
allowing them to locate each other and begin message copying.  Future
messages are forwarded automatically to the new location.  For a
generate-\&-test prime-number generator, migration achieved a speedup
of 12\% (on 10 Symmetry processors) compared to no migration
\cite{konishi-fgcs92}.  This increased to 70\% speedup on two 
Sparcstations connected over an Ethernet.

The general topic of garbage collection is too large to cover here,
but it is important nonetheless.  There are fundamentally two
approaches to garbage collection: static and dynamic.  Static
collection requires compiler analysis to determine the guaranteed
reusability of a data structure.  Code can then be generated directly
for memory reuse.  Incremental dynamic collection involves runtime
checking to determine reuseability of structures.  Furthermore,
dynamic garbage collections across an entire memory (local or global)
are required when the previous incremental methods fail.  Examples of
these collection types within concurrent logic programming are
abstract interpretation for local reuse \cite{renga92}, binary
reference counting with multiple reference bits (MRB) 
\cite{MRB,MRB-perf}, and several stop \& copy schemes (e.g., 
\cite{lazy-ref-count,imai-spdp,inamura-naclp89,piling-gc}).

The MRB scheme \cite{MRB} is a one-bit approximative reference count
per data cell.  If the flag is off, the cell can be reused because it
is guaranteed to have a single reader.  However, once the flag is set,
it becomes stuck and no reuse is possible.  Setting the flag requires
nontrivial rules for many of the KL1 abstract instructions that
manipulate memory.  The advantages, however, can be significant, for
example, array copying can be dynamically converted to destructive
update by exploiting the MRB method.  Nishida {\em et al.}
\cite{MRB-perf} demonstrated the effectiveness of the MRB scheme for a
shared-memory multiprocessor model.  Depending on data
cache configuration, two small benchmarks displayed bus traffic
reduction from 20\%--57\% on 16 processors.  The least beneficial
result of 20\% reduction clearly showed a drastic increase in
cache-to-cache traffic that was indicative of reused cells being
transferred between processors.  Overall the method achieved
significant reduction in memory-to-cache (``swap in'') traffic,
indicating success at improving locality by reuse.  Duvvuru {\em et
al.} \cite{duvvuru-iwmm92} showed that MRB running on SUN Sparcstation
had associated management overhead of 5--7\% for five small
benchmarks.  With static analysis \cite{renga92} to determine
guaranteed single-referenced cells, this overhead was removed.
It is clear however that array indexing can baffle such analysis,
thus dynamic schemes still have utility.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%							%
% Discuss issue of *where* to reuse optimally,		%
% as researched by Debray and Winsborough.		%
%							%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The challenges of implementing efficient garbage collection schemes
for concurrent logic and functional programs are similar.  There are
several garbage collection schemes proposed (and some prototyped) for
concurrent logic languages.  These efforts have concentrated on stop
\& copy schemes for shared-memory multiprocessors, mirroring the
general sophistication of the corresponding runtime systems.  Research
has recently focused on 1) distributed memory garbage collection
schemes that do not require barrier synchronization of all processors
within the collector \cite{koike}; 2) efficient, parallel stop \& copy
garbage collectors for shared memory \cite{crammond-phd,imai-spdp},
and 3) generation-scavenging garbage collectors for reducing
collection latency by interning long-lived objects
\cite{ozawa-naclp90}.  For generation-scavenging schemes, a problem
arises when the ``old'' space (the interned space) points into the
``new'' space.  This can occur for interned logical variables and
reused objects (e.g., via MRB).  Methods of trailing these unsafe
cells and implementing indirection tables are costly \cite{piling-gc},
but seemingly unavoidable.


\newpage\section{Future Work and Conclusions} 

Issues that are raised: is the pure model of computation really
practical? Is it really worthwhile to have to pay all that extra
cost for de-referncing every variable when in reality only
occassionaly will there be need to suspend. Can clever techniques
be developed to hide this under the hood --- or is there a basic
distinction between synchronization (active) variables, and local
(passive) variables (those on which you are guranteed not to
suspend). Can some mecahnisms be described to allow the user to
provide extra information for the implementation, without letting
this information change the semantics of the program, such that
the information can be used by the compiler to improve performance?

\paragraph{Relationship to other theories of concurrency.} Recently there
have been radical new ideas about concurrency.  Two of particular
note are the so called Chemical Abstract Machine~\cite{Boudol90},
due to Boudol and Berry, and mobile processes, due to Milner and
his co-workers~\cite{Milner89}.  In both of these approaches the
key new ingredient is that processes can alter their interactions
and are, in efffect, mobile.  In our approach the interactions
between processes is dynamic too in the sense that there is no
predetermined set of agents that a given agent is limited to
interact with.  The relationships need, however, to be understood
carefully.  It would be particularly interesting to understand
how a lambda abstraction mechanism could be incorporated into the
concurrent constraint paradigm.  Understanding the relationships
with the work on mobile processes or Boudol's gamma calculus
would be very helpful, as it is known that the latter can encode
the lazy lambda calculus~\cite{Milner90,Jagadeesan90}.


\subsection{Real-time languages}
\subsection{Logic and state-change}

\subsection{Higher-order languages}
Be able to pass processes as parameters.  
Even though this may be simulated at the level of the language,
isn't particularly a nice way of doing it. 

\subsection{Some broad themes from the viewpoint of programming
languages}
Why bother to develop stand-alone programming languages.

Software architectures are more important than programming
languages. The difference is one of syntax! 

Architectures can be embedded in any kind of host computing
framework, e.g. C++.

Model-based computing.

\section*{Acknowledgements}

E. Tick was supported by an NSF Presidential Young Investigator
award, with matching funds from Sequent Computer Systems Inc.  We
thank T.  Chikayama, J. Crammond, and S. Debray for sharing their
expert knowledge, greatly assisting in the completion of the
article.


{\footnotesize
\newpage
\bibliography{/tilde/saraswat/bib/master-0}
%\bibliography{/herbrand/saraswat/bib/master-0}
%%\bibliography{/herbrand/saraswat/bib/master-0,/herbrand/saraswat/bib/master-2}
\bibliographystyle{plain}

%% \bibliographystyle{plain}
%% \bibliography{a_k,l_z}

}

\end{document}