%% Where should we discuss where this style of computation is %% useful and interesting? %% Theme: connection with object-oriented programming. %% original techniques, Lucy, Janus. %% Use for distributed computation. %% Janus %% The use of constraint languages for problem-solving. %% Stefik, MOLGEN planning \documentstyle{article} \input{vijay-macros} \def\tell{{tell}} \long\def\MetaNote#1{\relax} \def\strand{{Strand}} \def\div{{\tt div}} \def\nil{{\bf 1}} \def\rname#1\from#2\infer#3{{{\textstyle #2}\over{\textstyle #3}}{\ \textstyle(#1)}} \def\axname#1\axiom#2{{\textstyle #2}\ \ {\textstyle(#1)}} \def\ccfd{{\sf cc(FD)}} \def\true{{\tt true}} \def\imp{\hbox{${\tt \ :\!-\ }$}} \def\rep{!\,} \def\bs{\backslash} \def\lcc{{\sf lcc}} \def\cc{{\sf cc}} \input{lmacros} \def\lcc{{\sf Lcc}} \def\lj{{\sf Linear Janus}} \def\janus{{\ccfont Janus}} \def\bisim{\simeq} \def\fix{{\bf fix}} \def\defeq{\stackrel{d}{=}} \def\rep{!\,} \newcommand{\linarrow}{\;\mbox{$-\!\circ$}\;} \newcommand{\linbiarrow}{\;\mbox{$\circ\!\!-\!\!\circ$}\;} \def\hlcc{{\sf HLcc}} \newcommand{\implies}{\mbox{ :-- }} \newcommand{\commit}{\;|\;} \newcommand{\aum}{${\cal A}'\,{\cal U}\!{\cal M}\ $} \newcommand{\aumc}{${\cal A}'\,{\cal U}\!{\cal M}$} \begin{document} \title{A retrospective look at concurrent logic programming} \author{Vijay Saraswat \\ (saraswat@parc.xerox.com)\\ Xerox Palo Alto Research Center \\ 3333 Coyote Hill Road\\ Palo Alto, Ca 94304 \and Evan Tick \\ (tick@cs.uoregon.edu)\\ Department of Computer Science\\ University of Oregon\\ Eugene, Or 97403 } \maketitle \begin{abstract} We survey the development of the field of concurrent logic programming from the modernist stance of constraint programming. We discuss the basic conceptions of the field, the language designs it has engendered, areas of applicability of these ideas, the underlying semantic notions, and the implementation issues that arise. The paper assumes the reader is familiar with the basic ideas of (constraint) logic programming. \end{abstract} \setcounter{tocdepth}{2} \newpage\tableofcontents \newpage\section{Introduction} Concurrent logic programming arises in a very natural way from considerations of parallel execution of (backward-chaining) logic programs. Unbridled parallel execution very quickly leads to wasteful nondeterministic computation. Goals that are wildly (infinitely) non-deterministic (e.g., {$\tt S > 0$)} with {\tt S} unknown) may be scheduled instead of others that may instantly prune search (e.g. {\tt S = 1}). It is not hard to see that substantial gains in performance can be obtained if some operational mechanism were to be provided to {\em suspend} the execution of certain goals (e.g., $\tt S > 0$) until certain bindings were obtained --- the better to quickly fail some (possibly infinitely many) non-deterministic branches of that goal. Concurrent logic programming languages arose from a desire to explicitly introduce operations to provide this suspension, and to indeterministically select one of one of many competing clauses (commit, i.e., generalized ``cut'', operations). From the time in the early 1980's when these ideas were initially recognized by Keith Clark and Steve Gregory \cite{rel-lang} to 1987, the group of researchers developing these ideas were primarily concerned with identifying various plausible operational mechanisms for specifying these suspension conditions. The nature of work in this area at that time consisted of examining in detail the most general unification (mgu) algorithm as it was used in executing logic programs, and experimenting with ``control annotations'' to harness the ``eager'' execution it engendered. While some derided this activity as ``hackers doing logic'', others were fascinated by the obviously attractive paradigm for concurrent computation that was thus opened up (particularly by the work of \cite{CP}).\footnote{See the recent personal accounts by some of the original participants in the field reported in \cite{fgcs-cacm}.} The introduction of constraints into logic programming in 1987 \cite{lassez-const} provided the first opportunity to establish a more substantial basis for concurrent logic programming. % something more than a serendipitous hack. Michael Maher introduced in \cite{alps} a suspension condition in the language ALPS that was based on checking that the constraints in a clause were {\em entailed} (``validated'') by the constraints that had been accumulated in the refutation. This opened the door to a {\em logical} treatment of the suspension condition in stark contrast to the operational hacks that had hitherto occupied the field. As sometimes happens, by stepping back to a more general framework (arbitrary constraint solving), the details of the particular situation (unification --- constraint solving over finite trees) fell away and the basic underlying idea emerged in its essential simplicity and generality. This basic idea, simply put, is {\em constraint-based computation}. What concurrent logic programming provided the first clear manifestation of was the idea that computation could be conceptualized as occurring through the interaction of simultaneously executing agents posting and checking constraints (on a set of variables) in a shared store \cite{vj-thesis-book}. But then this style of communication and computation can be isolated from the context in which it was first discovered, and a framework for programming languages built around it from ``first principles'', as in the work on ``process algebras'' based on similar lines of development due to Hoare, Milner and their colleagues (\cite{csp-book,ccs} --- see \cite{Milner-handbook} for an overview). This was the origin of the conception of concurrent constraint programming. The move enables the idea to be treated in its own terms and allows all the standard apparatus of modern day programming languages (recursion, block-structure, scoping, modules, types etc.) to be used for developing the paradigm, without being necessarily encumbered by a somewhat ill-suiting parent paradigm. In retrospect, though, it is rather surprising that many of the natural operations one would use to obtain a programming language organized around the ideas of constraint-based communication are already available in logic programming. In this paper we shall review the field of concurrent logic programming from this modernist stance of constraint programming: constraint logic programming is essentially concurrent constraint programming instantiated over the \Herbrand{} constraint system over finite trees that underlies deduction in first-order logic (see Section~\ref{herbrand}). We shall emphasize the connection with the work in theoretical computer science on process algebras, rather than the view of concurrent logic programming as a deviant version of (backward-chaining, Horn clause) logic programming. (The latter view is adopted, for example, in another survey of this area, \cite{shapiro-survey}.) At the same time, we shall develop and emphasize (see below) a major attraction of the CCP framework --- a sound and {\em complete} view of computation as deduction, a view that unfortunately is not possible if CCP is viewed as a special case of logic programming. In particular, we shall emphasize the three-way connection between the logical interpretation, operational semantics and denotational semantics of these languages. The work on algebraic axiomatizations of various congruences of interest is not sufficiently mature for us to do a proper survey; we shall, however point out results wherever there are some. A theme that runs through this paper is the separation of the {\em determinate} CCP languages from the {\em in}determinate --- the determinate languages already offer the power of constraint-based communications and expose the basic new conceptual and semantic ideas in this approach. In some sense, the extension of these semantic ideas to the indeterminate languages, while technically complicated, is not particularly specific to constraint programming. Essentially the same ideas used here have been shown to be applicable to other, different models of computation \cite{kok-and-others}. We also discuss in some detail the implementation issues that arise in realizing this rather high-level computation model. Over the last decade considerable practical experience has been gained in developing implementations of concurrent logic programming languages in uni-processors, shared-memory multiprocessors and distributed networks. The basic concepts are reviewed and the major results discussed. A brief word about some themes and issues that are consciously not being developed in this survey. First, we have not adopted the point of view that concurrent logic programming languages are reasonable ``machine'' or ``kernel'' languages for advanced parallel and distributed computers. This is the view that underlay work at ICOT and at the Weizmann Institute in this area in the past decade, but its validity is still open to considerable question. For a more sympathetic viewpoint, see \cite{shapiro-survey} and \cite{fgcs-cacm}. We have also chosen not to treat here the search-non-deterministic concurrent constraint programming languages, such as \herbrand{} \cite{vj-thesis-book}, and languages based on the Andorra principle \cite{dhd-andorra} (e.g., AKL \cite{sverker-AKL}). There are three primary reasons. Space. Time. Substance -- the state of our understanding of the possibilities, languages, and implementation tradeoffs for these languages still seems immature. \paragraph{Logical account of CCP languages.} We now describe briefly the basic intuitions behind the logical account of concurrent logic programming underlying this paper. The interpretation inherited from the view of these languages as variants of backward-chaining Horn clauses is not particularly satisfying, primarily because there is no good characterization of deadlocked computations, and of indeterminate computations. We shall review an approach that has recently been developed that provides concurrent constraint programs with an interpretation that is ``dual'' that of backward-chaining Horn clause logic programs. Programs are viewed as providing {\em necessary} rather than {\em sufficient} conditions for a predicate to hold: defintions of relations (such as {\tt append}) specify what must necessarily follow {\em if} the relation is known to hold between some objects, about which some additional pieces of information may be known. For example, we may state that if {\tt X}, {\tt Y} and {\tt Z} are in the {\tt append} relation, and {\tt X=[]}, then it must be the case that {\tt Y=Z}. Inference progresses forward: from the knowledge that certain pieces of information are true of the objects of interest, some other pieces of information are deduced. Thus the operational interpretation of predicates is that of information transducers: given certain pieces of information, they produce (potentially) more pieces of information. It turns out that this characterization is sound and {\em complete} for determinate concurrent constraint programming computations in a precise and important sense. Obtaining a logical characterization of indeterminate computations, is however, a more tricky matter. Only a partial solution --- that of the characterization of ``may'' properties of indeterminate computations --- seems possible. Even for that, we have to make the move to a more powerful notion of logic, {\em linear logic}, recently introdued by \cite{girard-ll}. The underlying issue is this: if the programming language is indeterminate, it must be possible to obtain two different answers, say $b$ and $c$ from the same initial computational state $a$. But then it must also be possible to obtain the answer $b$ {\em and} $c$ from $a$, and this may not be acceptable. For example, suppose that the initial computational state describes a situation in which Alice's account contains \$100, and she has given a cheque to Bob for \$100, and a cheque to Charlie for \$100 (and she has no other cheques outstanding). There are basically two evolutions of such a system: one in which Bob's cheque is cashed and the other in which Charlie's cheque is cashed. However, if this indeterminate computation is expressible in logic, then there should be yet a third computation in which the {\em conjunction} of the two results is possible: Bob's cheque is cashed, and Charlie's cheque is cashed. This situation is clearly incoherent.\footnote{The first author is indebted to Carl Hewitt for making this argument.} The root of the problem lies in that (classical) logic is not able to distinguish the case in which a situation $A$ can give rise to a conclusion $B$ and it can give rise to a conclusion $C$ from the case in which $A$ can give rise to ($B$ and $C$). For, from an operational perspective, the problem arises because in logic one is freely allowed to {\em duplicate} formulas: $A \wedge A$ is semantically identical to $A$. Therefore the system of atoms \begin{ccprogram} debit(Account, alice(100, AckA), debit(Account, bob(100, AckB), account(Account, 100) \end{ccprogram} \noindent is semantically indistinguishable from \begin{ccprogram} debit(Account, alice(100, AckA), debit(Account, bob(100, AckB), account(Account, 100), account(Account, 100) \end{ccprogram} The latter system can, however, give rise to a computation in which {\em both} cheques are cashed. It is not suprising that indeterminism and concurrency thus bring to the surface the fundamental notions of {\em object identity} and duplication. It is precisely the commitment of classical logic to deal with ``static'' notions like truth and falsehood that let us down. What is required here is a system of logic faithful to a dynamic notion of non-duplicable {\em resources}, a logic that can distinguish between the two situations above. This kind of flexibility has recently been placed on the table by the advent of the so-called sub-structural logics, the most prominent of which is linear logic. Linear logic may be seen as arising from classical logic by dropping certain structural rules that allow formulas to be arbitrarily duplicated and eliminated during the course of a derivation. This causes the familiar conjunction and disjunction operations to split into two: the so-called {\em additive} versions that copy formulas, and the {\em multiplicative} versions which do not. This gives formulas the nature of {\em resources} that must be accounted for carefully during proof (computation). In particular, multiplicative conjunction ($\otimes$) is {\em not} idempotent: $A \otimes A$ is not logically equivalent to $A$. The lost power of the structural rules can be recovered locally by means of the ``modalities'' $!$ and $?$. In particular, the formula $!A$ is allowed to be duplicated any number of times, derelicted to obtain $A$, and dropped altogether.\footnote{See \cite{scedrov-ll,PDLsigact} for more motivation and tutorial introductions to linear logic.} Below we shall develop in some more detail the characterization of indeterminate CCP in linear logic. For the time being, we note that linear logic allows one to distinguish the case in which from $A$ we can get {\em both} $B$ and $C$ (this is represented as $A \linarrow B \otimes C$) from the situation in which from $A$ we can get {\em either} $B$ or $C$ (this is represented as $A \linarrow B \with C$, where $\with$ is the operator for additive conjunction). (And this distinction arises precisely because the identification $A \otimes A = A$ is no longer valid.) \paragraph{Rest of this paper.} We first review the basic ideas behind the CCP paradigm. The determinate CCP languages are then presented --- their syntax and operational semantics, the logical readings of programs, denotational semantics, and examples of computations in this style. A similar treatment is then carried out for the indeterminate CCP languages. For historical purposes, we then review the special features of some of the major concurrent logic programming languages that have been developed in the last decade. This also offers an opportunity to discuss some related sub-themes, such as atomic vs. eventual publicatio of constraints, failure-free computations, etc. We then survey the work on implementations of these languages, focusing on different architectural issues that arise. Finally we point to areas of active current work and provide an assessment of the future of the field. \newpage\section{The Basic Paradigm} The fundamental move underlying concurrent constraint programming is to replace the notion of {\em store-as-valuation} behind imperative programming languages with the notion of {\em store-as-constraint}. A constraint may conceptually be taken to be a (possibly infinite) subset of the space of all possible valuations in the variables of interest, e.g. {\tt X = f(a,Y)} denotes the set of all pairs $\tuple{x,y}$ of finite trees such that the first is a tree with root labelled {\tt f}, and with two sons, the first of which is a tree with root labelled {\tt a} and with no sons, and the second is the tree $y$. Thus the store may describe only partial pieces of information about the possible values that variables can take. The situation in which the actual denotation of a variable (say, $\tt X=a$) is permissible but the exception rather than the rule. This paradigm shift renders the usual notions of of (imperative) ``write'' and ``read'' incoherent. There may be no single, finitely-describable value left to return as the result of a ``read'' operation on a variable. A ``write'' operation is only capable of taking a fully-formed concrete value for a variable, and updating the store to have this variable take on this new value. But the new store no longer has any connection with the old store --- constraints satisfied by the old store may not be preserved at all by the new store. The underlying principle of {\em information preservation} and {\em accumulation} we are trying to build on is irretrievably compromised. Instead, \cite{my-thesis-book} proposes the replacement of read with the notion of {\em ask} operations and write with the notion of {\em tell} operations. An ask operation takes a constraint (say, $c$) and uses it to probe the structure of the store. It succeeds if the store contains enough information to entail $c$. Tell takes a constraint and conjoins it to the constraints already in place in the store. That is, the set of valuations describing the resultant store is the intersection of the set of valuations describing the original store and those describing the additional constraint. Thus, as computation progresses, more and more information is accumulated in the store---a basic step does not {\em change} the value of a variable but rules out certain values that were possible before. Thus, the store is {\em monotonically refined} --- all the constraints that were true of the store before the operation (that is, were entailed by the store), continue to be true of the store {\em after} the operation. Central to the notions described above is the idea of constraint systems as systems of partial information. Indeed, it turns out to be natural to adopt Scott's theory of information systems ~\cite{scott-info-sys} for this purpose. For the time being, suffice to say that all the theory demands is a specification of the pieces of information (in finitely many variables) admissible in the system, and a specification of the entailment relation that holds between them. The possibilities for concurrency are now quite apparent. Instead of a single agent interacting with the store via ask and tell operations, any number of agents can simultaneously interact with the store in such a fashion. Synchronization is achieved via a blocking ask---an agent blocks if the store is not strong enough to entail the constraint it wishes to check; it remains blocked until such time as (if ever) some other concurrently executing agents add enough information to the store for it to be strong enough to entail the query. Note, in particular, that even though the paradigm is based on the notion of a shared store, ideas such as ``read/write locks'' do not have to be introduced for synchronization. The basic reason is that only a benign form of ``change''---accumulation of information---is allowed in the system.\footnote{Which of course, is not to say that systems with changable state are not describable in the \cc\ paradigm. State change can be represented without compromising the basic paradigm by adapting standard techniques from logic and functional programming. Namely, ``assignable variables'' are embedded in the local state of a recursive agent---the agent ``changes'' the value of the variable by merely recurring with a new value for one of its arguments. In some languages in the \cc\ framework (e.g., \janus \cite{janus}) there is considerable hope that such mechanisms for representing state change can be competitive in performance with traditional assignment-based techniques.} If desired, indeterminacy can be introduced by allowing an agent to block on multiple distinct constraints simultaneously, and specify distinct actions which must be invoked if the corresponding ask condition is satisfied by the store. Thus, in this view, an agent (``computing station'') is thought of as an {\em information transducer}. An agent can either add a constraint to the store (tell), suspend until there is enough information in the store to entail a given constraint, or decompose into a number of other agents running in parallel, possibly with hidden interconnections, and communicating and synchronizing via the store. Of course, as is standard, computations of unbounded length can be achieved by using recursion. \paragraph{Compuational significance of the paradigm.} While extremely simple, these ideas yield a powerful programming paradigm which derives its strength from the versatility of its communication mechanism. Essentially the entire range of ``data structures'' in computer science and AI opens up to examination. One can ask of graphs, binary trees, dictionaries, hash tables, association lists, finite sets, temporal intervals, real-valued temporal signals, etc.: what are coherent pieces of partial information about these objects and interrelationshps?\footnote{As indeed these questions have been asked over the centuries for such ``classical'' data-structures as continuous real-valued functions.} Indeed, a tremendous amount of work in computer science and AI is concerned precisely with elaborating such systems of partial information about objects of interest, and investigating their cost vs. expressiveness tradeoffs. Here, we shall just attempt to convey the basic idea, and refer the reader for more details to documents such as \cite{vj-thesis-book} and the review of constraint logic programming in this volume. The essential idea is that variables serve as communication channels between multiple concurrently executing agents. The computational framework does not insist that there be an {\em a priori} partitioning of the agents into ``producers'' and ``consumers'' of information on the variables (as happens, for example, in function application, where information flows from an argument to the function body, or in CSP/CCS-style or actor-style languages). In fact, the {\em same} agent may simultaneously add or check constraints on the same variables. Also, the computational framework allows for the underlying ``network transport protocol'' to perform (potentially arbitrarily sophisticated) inferences on the ``messages'' entrusted to them---these, of course, are the deductions sanctioned by the entailment relation of the constraint system. This allows each agent to state the constraints as it sees them and frees up the programmer from having to put in agents for explicitly collecting these messages and drawing the appropriate inferences (``reformatting the data''). This allows for very ``high-level'' languages in which essentially computation and communication can be expressed merely by posting and checking constraints. \paragraph{The \Herbrand{} constraint system.}\label{herbrand} For concreteness, we consider the \Herbrand{} constraint system that underlies logic programming languages. Assume given some underlying domain of denotations, a set $I$ of individuals. The set of interest to us is the set of labelled, finitely branching, finitely deep trees over $I$, ${\cal T}(I)$. (Individuals are identified with the tree with $0$ sons labelled with the name of the individual.) We shall be interested in two syntactic categories, Terms and Constraints, specified by the BNF grammar: \begin{equation} \begin{array}{llll} \mbox{(Terms)} & s,t & ::= & X \alt a \alt f(t_1, \ldots, t_n) \\ \mbox{(Constraints)} & c & ::= & (s = t) \alt \exists X. c \end{array} \end{equation} Here $X$ is a meta-variable ranging over some underlying domain of variables, $a$ ranges over names for the individuals in $I$, and $f$ over the permissible labels. A constraint $s=t$ is said to be satisfiable if there is an assignment of values to the variables in $s$ and $t$ such that the two terms denote the same element of ${\cal T}(I)$. Existential quantification is interpreted in the usual way. A collection of constraints $c_1, \ldots, c_n$ entails $c$ if every valuation that satisfies each of the $c_i$ also satisfies $c$. It is well known that the satisfiability and entailment algorithms for this constraint system can be solved essentially by the first-order most general unification algorithms. (See, e.g. \cite{lassez-unify} for a review of the basic ideas.) The fundamental properties of this constraint system are: \begin{itemize} \item two terms $f(t_1,\ldots, t_n)$ and $g(s_1,\ldots, s_k)$ denote the same tree iff $f=g$, $n=k$, and recursively, $t_i$ denotes the same tree as $s_i$, for $i=1, \ldots,n$, \item a constraint $X=t$ is satisfiable iff $t$ does not contain $X$ \end{itemize} As a result, consistency and entailment for this system can be checked in time proportional to the size of the formulas involved. In reality, this theoretical result is not of much use --- rather what one is interested in is an assurance that the commonly occurring unification problems are solvable in {\em constant} time. As is the sad customary practice in logic programming implementations, this is guaranteed by ignoring the ``occurrence check''. In the spirit of constraint programming, this can be justified by claiming that the implementations are faithful to a constraint solver that works over the domain of rational rather than finite trees\footnote{Rational trees are allowed to contain loops --- the requirement is that they must have a finite number of {\em distinct} sub-trees.}. For the purposes of this survey, however, we shall ignore the complications that this compromise brings. Even a constraint system as mundane as \Herbrand{} already offers powerful possibilities for communication, leading to interesting idioms such as incomplete messages, short-circuits, difference-lists, ``messages to the future'' etc. \cite{my-thesis-book}---the techniques that have colloquially been referred to as stemming from ``the power of the logical variable''. (These idioms shall be illustrated later in this paper.) For instance, it is possible to embed a communication channel (variable) in a message as a ``first-class'' object, in exactly the same way as data. A process $p$ may post the constraint $\tt X = f(a, Producer)$. Simultaneously and separately, another process $q$ may post the constraint $\tt X = f(Item, Consumer)$. And now $q$ can discover that in fact the store implies {\tt Item = a} (thus, the value {\tt a} has been communicated from $p$ to $q$ --- and in fact, to any other process that has access to (contains) the variable $X$). Further communication can occur on the channel accessed by $p$ as {\tt Producer} and by $q$ as {\tt Consumer} since indeed these have been identified (as a result of the two constraints on $\tt X$). This kind of communication of a channel in a message is the heart of the power of concurrent constraint programming. It allows for an elegant and simple form of dynamic reconfigurability of processes. This is difficult to achieve in a simple way within frameworks such as those of CCS and CSP. \newpage\section{Determinate concurrent constraint programming} Kahn's theory of determinate data-flow \cite{kahn} remains the dominant theory of determinate concurrent computation. While simple, elegant and widely applicable, the theory's emphasis on computing with directional point-to-point communication channels imposes some expressibility restrictions: channels carry only fully formed values from the underlying data domain; networks are not dynamically reconfigurable in that connections cannot themselves be passed as data on communication channels; and each channel has only one reader and one writer. Because of these restrictions, Kahn's theory cannot model such useful programming idioms as incomplete messages, difference lists, recursive doubling, multiple readers/writers of a communication channel, dynamic reconfiguration, etc. Although these techniques were originally developed in the context of indeterminate concurrent logic programming, they are all both inherently determinate and usefully expressible in determinate contexts. s \subsection{Syntax and operational semantics} What should a programming language for the paradigm of store-as-constraints look like? We shall develop here a simple concrete syntax influenced by the syntax of linear logic. For the determinate languages, we could use a syntax from classical logic, since it will turn out that the execution mechanism (operational semantics) for determinate \cc{} programs is complete for classical derivability (and hence also linear derivability). This, however, will not be the case when we move on to indeterminate languages: only completeness with respect to linear derivability will be obtained. So we prefer to use linear syntax up front. This will involve using $\otimes$ for parallel composition instead of $\wedge$, $\linarrow$ instead of $\rightarrow$, and explicitly using $!$ to indicate reusability of formulas. First, there has to be a mechanism for posting constraints to the store. Constraints are to be regarded as formulas that, once added to the store, are there forever: \begin{equation} \begin{array}{lllll} (Agents) & A & ::= & !c & \mbox{-- Tell} \end{array} \end{equation} Next, there has to be the possibility for checking that certain pieces of information are present in the store, and, if so, reducing to some other agent: \begin{equation} \begin{array}{lllll} & A & ::= & !c \linarrow A & \mbox{-- Ask Method} \end{array} \end{equation} In the following we will sometimes use the concrete syntax $c \then A$ instead of $!c \linarrow A$ to emphasise that this usage of $\linarrow$ is consistent with its interpretation as classical implication. There should be a mechanism to execute more than one such agent in parallel: \begin{equation} \begin{array}{lllll} & A & ::= & A \otimes A & \mbox{-- Parallel Composition} \end{array} \end{equation} In concrete syntax we shall write $A \otimes B$ as $A,B$. It should be possible to introduce new communication channels private to a scope. \begin{equation} \begin{array}{lllll} & A & ::= & \exists X. A & \mbox{-- Local Channels} \end{array} \end{equation} Finally, there should be the possibility of {\em replication}, of invoking an Agent an unbounded number of time. For this, we allow atoms with user-defined predicate symbols (called user-defined atoms; let their set be ranged over by the meta-variable $H$), and replicated definitions: \begin{equation} \begin{array}{lllll} & A & ::= & H & \mbox{-- Linear Message}\\ &&& \alt ! \forall \bar{X}. H \linarrow A & \mbox{-- Linear Method} \end{array} \end{equation} We shall impose the syntactic restriction designed to avoid indeterminacy in process invocation. \begin{restriction}\label{defin-convention} We shall require (1) For any given atom $G$, and linear method $! \forall \bar{X}. H \linarrow A$, there should be at most one substitution $\theta$ such that $G=\theta(H)$. (This is true if $G$ and $H$ are \Herbrand{} terms, but may not be true if they contain interpreted terms, e.g. $+,\cup$ introduced by an underlying constraint system.) (2) If an agent contains more than one linear method for a predicate letter, then the heads of these two methods shall not be unifiable, once existential variables are replaced with distinct constants. \end{restriction} The above syntax is very flexible. It does not distinguish between {\em programs } and {\em agents}, like logic programming languages have traditionally done. This allows methods to be defined local to the scope of existentially quantified variables (e.q.v.'s) and is a power that we shall exploit subsequently. Furthermore, methods are allowed to have free variables that are quantified outside the method. \begin{note} Indeed, it is possible to show that with such dynamic specification of methods, it is not necessary to have any user-defined predicate letters except for one. Essentially, e.q.v. can play the role of constant names for new procedures. However, we shall not take that extra step here in order to remain close to the traditional syntactic presentation of concurrent logic programming languages. \end{note} \begin{note} Henceforth, when presenting programs in concrete syntax, we shall adopt the following conventions. We shall assume that whether an atom is a constraint atom or a user-definable atom is obvious by inspection. Constraints are assumed to be ``!''ed, hence they need not be explicitly ``!''ed (both in the left and right hand sides of implications). Also, we adopt the lexical convention that universally quantified variables (u.q.v) begin with a capital letter, and existentially quantified variables (e.q.v) with an underscore. A u.q.v. is considered scoped at the outermost arrow in whose left hand-side it occurs, and an e.q.v. is considered scoped at the largest conjunct containing all occurrences of that variable. This will allow us to almost always dispense with explicit quantifiers in programs. \end{note} The translation from the conventional syntax for concurrent logic programs (e.g., FGHC) to this syntax is straightforward, as the following example shows. \begin{example} The {\tt append/3} FGHC program: \begin{ccprogram} \agent append([],Y, Z) :- true | Y=Z.. \agent append([A|B], Y, Z) :- true | Z=[A|C], append(B, Y, C).. \end{ccprogram} \noindent is translated to \begin{ccprogram} \agent append(X,Y,Z) -o X = [] -> Y=Z, X=[A|B] -> (Z=[A| \_C], append(B, Y,\_C)).. \end{ccprogram} \end{example} \paragraph{Operational semantics.} To present the operational semantics, it is most convenient to take a configuration to be merely a multiset of agents. (Multiset union is to be treated as conjunction.)\footnote{The approach we use here is a departure from the traditional SOS-style of giving operational semantics rule. However, it has the advantage of being extremely simple and drawing on the presentation of inference rules in a Gentzen-style sequent presentation of the underlying logic.} Intuitively, a configuration represents the parallel conjunction of agents that are currently present in the system. The collection of constraints in a configuration contitute the store. Below, we shall let $\Gamma$ range over such a multiset, and $C$ over a multiset of constraints. We shall assume given a relation $\vdash$ between finite multisets of constraints and a constraint, specifying the semantics of the underlying constraint system, in this case \Herbrand. There are four transition rules, corresponding to the four composite combinators in the language. An agent $(A,B)$ decomposes instantly into its constituents: \begin{equation}\label{tensor-trans} (\Gamma, (A_1,A_2))\derives (\Gamma, A_1, A_2) \end{equation} We shall identify formulas that are the same upto renaming of bound variables. To execute an existentially quantified agent, then, we simply find a renamed version in which the bound variable is not free in the configuration: \begin{equation}\label{exists-trans} \begin{array}{ll} (\Gamma, \exists X. A)\derives{(\Gamma, A)} & \mbox{($X$ not free in $\Gamma$)} \end{array} \end{equation} Computation occurs when a message meets a method. There are two kinds of reductions that can happen, depending on whether the formulas involved are reusable or not. \begin{equation}\label{cons-trans} \begin{array}{ll} (\Gamma, C, !d \linarrow A)\derives(\Gamma, C, A) & \mbox{(if $C\vdash !d$)} \end{array} \end{equation} A message can be consumed, if there is a method that matches, given the constraints in the store: \begin{equation}\label{proc-trans} \from{\begin{array}{ll} C, G \vdash H[\bar{t}/\{\bar{X}] & \mbox{$M=(\forall \bar{X}. H \linarrow A)$} \end{array} } \infer{(\Gamma, C, !M, G)\derives(\Gamma, !M, A[\bar{t}/\bar{X}])} \end{equation} As usual, a configuration $\Gamma$ is said to be {\em stuck} if there is no configuration $\Gamma'$ such that $\Gamma \derives \Gamma'$. A stuck configuration is said to have {\em terminated} if it consists of only reusable agents (methods and constraints). The result of a terminated computation is its store, existentially quantified with all variables other than those present in the initial configuration. The fundamental property of the operational semantics is Church Rosser: \begin{theorem}[$\derives$ is Church Rosser.] Let $\Gamma, \Gamma_1, \Gamma_2$ be configurations such that $\Gamma \derives \Gamma_i$, $i=1,2$. Then there are configurations $\Gamma_1', \Gamma_2'$ such that $\Gamma_i \derives \Gamma_i'$ and $\Gamma_1'$ is identical to $\Gamma_2'$ upto renaming of e.q.v.'s. \end{theorem} The basic reason, as always, is that no two reduction steps can be in conflict. If a step can be taken at any given stage of the computation, then the same step can be taken at any subsequent stage of the computation. Clearly, this is the case for steps involving Rules~(\ref{tensor-trans}) and (\ref{exists-trans}) (upto e.q.v. renaming). A reduction by Rule~(\ref{cons-trans}) once enabled is always enabled, since constraints added to the configuration are never deleted. There can be only one reduction involving a user-defined atom, by Convention~\ref{defin-convention}, and that reduction once enabled will stay enabled. Hence any two basic reduction steps will commute. The theorem immediately gives us that if an agent has a terminating reduction sequence then all its reduction sequences terminate, and in final stores that are equivalent when projected onto the variables in the initial configuration. Thus, determinate \cc{} programs are, indeed, determinate. \subsection{Logical Semantics}\label{det-log} The fragment of logic specified above is in some sense ``dual'' to the usual definite clause logic underlying logic programming: a clause is of the form $H \then A$, {\em not} $H \leftarrow A$; one describes {\em necessary} rather than sufficient conditions. Intuitively, one thinks of specifying for each predicate some pieces of information that must follow (for some of the arguments to the predicate) when some other pieces of information are know to be true about some arguments to the predicate. Each predicate is thus thought of as a transducer of (partial) information. One thinks of the underlying engine as making explicit pieces of information that are implicit in the original presentation of an agent --- as explicit as is needed to answer a (primitive) test. \begin{lemma} Let $A$ be a linear formula, with free variables contained in the free variables of $\Gamma$. If $\Gamma \derives \Gamma'$, then $\Gamma \vdash A$ whenever $\Gamma' \vdash A$. \end{lemma} This implies the {\em soundness} of CCP computations for this interpretation in logic: \begin{theorem}[Soundness of logical interpration] Let $d$ be a constraint with its free variables contained in those of $\Gamma$. If $\Gamma \derives^{\star} (\Gamma',C)$, and $C \vdash d$, then $\Gamma \vdash d \otimes \top$. \end{theorem} In the following $\top$ is a linear logic constant with the following proof rule in a Gentzen style presentation of the logic: $$ \Gamma \vdash \top$$ In other words, any set of formulas can be used to establish $\top$. \begin{theorem}[Completeness of logical interpretation] If $\Gamma \vdash d \otimes \top$, (with free variables of $d$ contained in those of $\Gamma$, then there is a configuration $\Gamma'$ such that $\Gamma \derives \Gamma'$ and $\Gamma' \vdash d \otimes \top$. \end{theorem} The basic idea behind the proof is to examine the proof of $\Gamma \vdash d \otimes \top$ and from that proof construct an execution sequence that satisfies the given property. The soundness and completeness results cited above could have been cast in intuitionistic and classical logic as well as linear logic. No essential features of linear logic are being used in the determinate CCP set-up. However, we are setting the stage for similar results for the indeterminate CCP languages --- there we can obtain completeness only with respect to linear derivability. \subsection{Denotational Semantics} Above, we have seen that (terminating) determinate computations give rise to a unique store of constraints. One could then ask about the nature of the mappings from constraints to constraints established by the execution of determinate \cc{} programs. \cite{popl91} established that the right mathematical objects corresponding to determinate \cc{} programs are {\em closure operators}. Closure operators are functions $f$ over a partial order $L$ that are extensive $f(x) \geq x$, monotone ($f(x) \geq f(y)$ whenever $x \geq y$) and idempotent ($f(f(x))=x$). Idempotence says that the range of $f$ is equal to the set of its fixed-points; extensiveness says that each point in the domain is mapped to a fixed-point above it, and monotonicity ensures that this must be the {\em least} such fixed point. Hence, a closure operator can be uniquely represented by the set of its fixed-points (its range). Closure operators capture precisely the idea of monotonic accumulation of information inherent in the \cc{} conception. On the one hand this discovery showed us that the mathematical objects underlying (at least the determinate) \cc{} languages were very simple indeed, and had many interesting properties. (For example, the set of resting points of the process $f \wedge g$ is the intersection of the resting points of $f$ and $g$ --- from this it is immediate that parallel composition is commutative, associative, idempotent etc.) On the other, by thinking of the \cc{} languages as {\em the} language framework for computing with closure operators we were lead to a number of simple extensions with interesting computational content. (See Section~\ref{other-work}.) \subsection{Programming idioms} To make it simpler to write programs over \Herbrand{}, we shall adopt one more convention. Often, we shall want to specify that a parametrized constraint be checked, and allow the check to return values for the paramter. For instance, we may want to check whether {\tt X} has been bound to a cons, and if so to obtain in {\tt Y} the car and in {\tt Z} the cdr. This can be done by checking the constraint {\tt X = [Y | Z]}, where {\tt Y} and {\tt Z} are parameters. To allow for this we introduce: \begin{equation} \begin{array}{llll} A & ::= & \forall \bar{X}. c \linarrow A & \mbox{-- Ask Method} \end{array} \end{equation} If $c$ is a \herbrand{} constraint then $ \forall \bar{X}. c \linarrow A$ is equivelant to $ (\exists\bar{X}.c) \linarrow \exists\bar{X}.(c, A)$. So the new form can be regarded merely as a convenient shorthand. \subsubsection{Simple data-flow programming.} It should be clear how Kahn data-flow networks can be programmed in \cc{} languages. Essentially, these languages directly provide the ability to compose (recursive) networks concurrently, to establish connections between communication channels, and to hide local communication channels. \begin{ccprogram} \agent fib(Seq, Count) -o fibb([0,1 |Seq], Count).. \agent fib_1([X1, X2 | Rest], Count) -o Count \gt 0 -> (Rest= [X1+X2 | \_Rest1], fib_1([X2 | Rest], Count-1)), Count = 0 -> Rest = [].. \end{ccprogram} Shapiro's paper \cite{shapiro-systolic} examines many systolic programming idioms from the concurrent logic programming viewpoint. \subsubsection{Producer consumer interactions.} We show how to set up a bounded buffer producer-consumer interaction between two agents. When the two agents are created, a new channel is generated, and the producer is given the teller for the channel, and the consumer, the asker. Subsequently, the producer chooses to generate messages on the channel through the teller in the usual stream-oriented way, and the consumer ``receives'' these messages through the asker. \begin{ccprogram} \agent goal -o producer(B, [demand, demand, \ldots, demand | \_D]), consumer(B, \_D).. \agent producer(B, X) -o X = [demand | D] -> (B = [produced | \_B], producer(\_B, D)), X = [] -> B = [].. \agent consumer(In, D) -o In = [produced | B] -> (D = [demand | \_N], consumer(B, \_N)), In = [] -> D= [].. \end{ccprogram} \subsubsection{Short circuits} The standard short-circuit programming technique can be expressed thus: \begin{ccprogram} \agent p(\ldots,L,R) -o \ldots c_1 -> p(\ldots,L, \_M), p(\ldots, \_M, R), c_2 -> R = L.. \end{ccprogram} Here, $c_1$ is the condition under which a new switch must be spliced into the circuit, and $c_2$ the condition under which the circuit must be closed. An initial query may be {\tt p(done, D),q(D)}. ``{\tt q}'' can wait until {\tt D=done} succeeds, before proceeding. In \cite{dist-term}, streams of switches, a programming technique for repeated use of a short-circuit, was presented in the context of \herbrand\ programming. \begin{ccprogram} \agent p(\ldots,L,R) -o \ldots, c_1 -> p(\ldots,L, \_M), p(\ldots,\_M,R), c_2 -> (L=[\_M | \_L'], R = [\_M|\_R'], p(L',R')), c_3 -> R=L.. \end{ccprogram} When an agent desires to close a switch {\em for the current phase}, it merely equates its left channel to a cons and its right channel to a cons, and equates their cars, recurring with the cdrs. This appears to be an instance where run-time unification is required, though it is guaranteed that the unification will succeed. Unification in this case provides a degree of decoupling between successive {\tt p} agents: a {\tt p} agent can decide to close a switch in the current phase without having to synchronize with any other agent. It records its decision in the store through the constraints it has imposed on its left and right channels. When all agents have imposed similar constraints, the end of the current phase can be detected by the agent holding both ends of the chain. Thus the bidirectionality of unification is put to good use. Note that the third rule in one step closes a switch for all future rounds. \subsubsection{Incomplete messages} A major source of simplicity and elegance in \cc{} languages is that servers can respond directly to service requests by communicating on ports embedded in the service request. This technique is illustrated by the following ``queue-server'' program: \begin{ccprogram} \agent queue(Input)-o (split(Input, \_Enq, \_Deq), match(\_Enq, \_Deq)).. \agent split([Message | Rest], En, De) -o Message = enqueue(X) ->(En = [X | \_En1], split(Rest, \_En1, De)), Message = dequeue(X) ->(De = [X | \_De1], split(Rest, En, \_De1)).. \agent match([X | Enq], [ Y | Deq]) -o Y=X, match(Enq, Deq).. \end{ccprogram} These rules define a predicate {\tt queue/1}, which will receive a sequence of {\tt enqueue} and {\tt dequeue} messages over time and will be able to respond as if these operations were being performed on a queue.\footnote{Typically the input stream is obtained by merging one stream from each client---the handling of many-to-one communication schemes is discussed in more detail below.} Operationally, a {\tt queue} agent splits into two agents. One monitors the input queue and distributes it into two queues, one consisting of the enqueue requests received (preserving order of arrival) and the other consisting of dequeue requests received. Each enqueue request contains the value to be enqueued. Each dequeue request contains a {\em teller} for a variable, the asker for which is retained by the agent sending the request. When an enqueue request arrives, it is communicated via the tell capability directly to the agent which requested that the dequeue operation be performed. (Note that the length of the dequeue list can be greater than the length of the enqueue list, so that the queue may be ``negative'' for a while.) If we wish to give clients separate ``enqueue'' and ``dequeue'' capabilities then only the {\tt match/2} program is needed. \subsubsection{Recursive doubling} The parallel prefix problem is of substantial interest in the area of concurrent algorithms (see {e.g.} \cite{kruskal-par-prefix}). We use it here to demonstrate how to achieve recursive doubling by sending askers and tellers as messages, thus achieving ``transparent forwarding''. Given $n$ numbers $a_1, \ldots, a_n$, the problem is to compute the $n$ numbers $b_1, \ldots, b_n$ such that $b_j = a_1 \oplus \ldots \oplus a_j$ where $\oplus$ is a binary, associative operation. To be concrete, we shall assume that ``$\oplus$'' is ``$+$'' in the following. To solve the problem, a chain of $n$ agents is generated at run-time. That is, each agent ${\tt pp}_i$ shares a variable with the $(i-1)$st agent (the {\em left} link for the $i$th agent and the {\em right} link for the $(i-1)$st agent) and with the $(i+1)$st agent. Each agent ${\tt pp}_i$ also has access to $a_i$. In what follows it is useful to think of computation as proceeding in cycles---in each ``cycle'' an agent receives one message and sends another---even though, in reality, there is no central clock and all agents are executing asynchronously. Each message is of one of two types. First, it may be the constant {\tt []}. Alternately, it may be a three-element array: the {\em value}, the {\em left} connection and the {\em left2} connection. An agent ${\tt pp}_i$ stores locally the following pieces of information: it stores $a_i$, its left and right connections on the chain, a variable that will be equated to the {\em final} value of the number at this agent, and another variable, which will represent a connection to the agent ``two steps to its right''. The role of the last variable---which is critical to the algorithm---is explained below. When an agent is enabled initially, it sends a message to the agent on its right. In this message, the value component field contains the number at the agent, and the left component is a variable that may be used to receive the messages the agent will send subsequently. The left2 component contains a variable which will be constrained {\em in the next cycle} to a variable that may be used to receive the messages that the agent {\it to the left} of the agent sending this message, will send in future. In each cycle, every agent reads the variable that connects it to the agent to its left. Each such variable will be instantiated to a message of the above type. The agent computes its new local value by performing the given operation ($\oplus$) on the value it receives from the left, and its local value. It also equates the variable in the left field in the message with the variable in the left2 field {\em in the message it sent out in the previous cycle}. It recurs with its new left connection being the variable in the left2 component field in the message it just read. If the agent reads a {\tt []} message then it sends {\tt []} on the other two connections that it has locally, declares that the value it was supposed to compute is its current value and terminates. \begin{ccprogram} \agent pp(In, Out)-o Right\Hat spawn(In, [0,[],[]], Right, Out).. \agent spawn(In, Left, Right, Out) -o In = [] -> (Right = Left, Out = []), In = [N | Rest], number(N) -> \((Right_1,Right_2,Out_1)\Hat pp(N, Left, Right_1, Right_2, Final), Out = [Final | Out_1], spawn(Rest, [N, Right_1, Right_2], Right, Out_1)\).. \agent pp(N, Left, Right, Right_2, Final) -o Left = [] -> (Final = N, Right = [], Right_2 = []), Left = [M, Left_2, Left] -> \(pp(M + N, Left, Right_1, Right_{21}, Final), Right = [M + N, Right_1, Right_{21}], Right_2 = Left_2\).. \end{ccprogram} Note that an agent does not {\em use} the value of the variables {\tt Left} and {\tt Right2}: it merely serves to equate them so that {\em some other agents} can directly communicate in the next round. This simple connection pattern achieves recursive doubling: in the $i$th cycle an agent receives messages from another agent $2^i$ steps away in the initial chain. Transparent forwarding is used to obtain the desired effect. \subsubsection{Translating the $\lambda$-calculus} Perhaps the most remarkable illustration of the power of constraint-based communication is the translation of the untyped $\lambda$-calculus into determinate CCP. There are a host of related translations; we treat here a translation of the lazy version of the lambda-calculus \cite{abramsky-ong}. Intuitively, an application is treated as a parallel composition, and the operand and the argument are treated as separate processes communicating on a channel hidden from the outside world. The agent corresponding to the argument is sent a message with the identity of the operand, and the result channel. The operand is represented by a process that waits for an evaluation message (that will contain the result channel), and instructs its body to send its result on that channel. A $\lambda$-abstraction is seen as a server waiting for a message to come down its specific communication channel (such messages are generated by application). When this message arrives, the formal argument of the abstraction is bound to the (channel representing) the operand, and the body of the abstraction is asked to send its output to the result channel. The term $x$ with result $Z$ is represented by a message on the channel $X$ asking for an evaluation, with the result to be sent to $Z$. Let $M$ be an arbitrary lambda-term, and $Z$ an arbitrary variable. We define an agent $\tuple{M}_Z$ by structural induction on $M$: $$ \begin{array}{ll} \tuple{X}_Z & eval(X,Z)\\ \tuple{\lambda X\ M}_Z & \rep (\forall X \forall Y\ apply(Z,X,Y) \linarrow \tuple{M}_Y) \\ \tuple{MN}_Z & \exists X \exists Y. (\tuple{M}_X, apply(X,Y,Z), ! (\forall S. eval(Y, S) \linarrow \tuple{N}_S) \end{array} $$ For example, the agent $\tuple{\lambda x.x}_Y$ is just: $$ \rep \forall B\ (apply(Y,X,B) \linarrow send(X,B)$$ That is, $\lambda X.X$ is viewed as a server that waits for an argument ($X$), and return channel ($B$), and then sends $B$ to $X$. \subsection{Other topics}\label{other-work} \subsubsection{Determinate Disjunction.} \label{glb-defn} The simplest example of a new operator introduced by the understanding of determinate \cc{} programs as closure operators is the {\em glb} operation. The set of closure operators forms a lattice, with the least upper bound given by parallel composition. What does the greatest lower bound of $f$ and $g$ give us? On any input $c$, it gives us the greatest lower bound (in the space of constraints) of $f(c)$ and $g(c)$ --- one can think of it is a {\em determinate or}: it extracts the most {\em common} information between the two branches. \cite{cc-FD} shows how this can be used to obtain considerable improvement in some constraint propagation algorithms. Other examples of operations on constraints (e.g., indexical constraints, cardinality operators) that can readily be understood in terms of closure operators may be found in \cite{cc-FD}. \subsubsection{Nondeterminate \cc{} language} %% What about axiomatization? It is natural to ask what happens when disjunction is added to the picture. This thesis shows one way of doing this operationally --- one adds a notion of non-determinate prefixing ($\Rightarrow$), and takes as configurations multisets of pairs of agents and constraints. However, it is not clear how to connect non-deterministic prefixing to logical notions. Instead \cite{angelic-nondet} explores another route. Let $(A_1 \vee A_2)$ be an agent if $A_1$ and $A_2$ are. Operationally, as with nondeterminate prefixing, an agent $(A_1 \vee A_2)$ in store $c$ splits into {\em two} configurations, \tuple{A_1,c} and \tuple{A_2,c}. (Conjunctively running agents of $(A_1 \vee A_2)$ are copied into both.) When the store of one branch becomes inconsistent, the branch vanishes. The crucial question now is: what does one regard as {\em observable} for such a system of disjunctive agents? One answer is that the observables of such an agent should be the multiset of constraints obtained on each branch. However, this conflicts with the operational requirement that ``false'' branches are unobservable: an agent which produces false on one branch and {\tt X=1} on another should not be distinguished from the agent that just produces {\tt X=1}. This leads to the notion that two agents are considered operationally indistinguishable if any contraint $c$ entailed by {\em each} branch of one is also entailed by each branch of the other, and vice versa. Indeed, this corresponds to the reading of $\vee$ as disjunction: $A \vee B$ entails $c$ just in case $A$ entails $c$ and $B$ entails $c$. \cite{angelic-nondet} develops the mathematical foundation of such languages. As in the determinate case, it turns out to be adequate to record for a process its {\em resting points}: those stores in which the process is unable to add any more information (on at least {\em one} of its branches). Again, the underlying mathematical reason is that processes can be taken to denote certain kinds of closure operators, namely {\em non-deterministic}, {\em linear} closure operators.\footnote{{\em Non-deterministic} because the operators are functions over a powerdomain (the Smyth powerdomain), since a process is capable of producing more than one result from a single store. {\em Linear} because the operators do not share any information between stores: the set of (minimal) results that a process can produce from the set of stores $s \cup t$ is the union of the results it can produce on $s$ and $t$.} Such operators can also be represented by the set of their fixed-points; but now one is not guaranteed that there is a {\em minimal} fixed point above every input. Rather the result of the computation is taken to be the set of minimal elements above the input. This line of development is still unfinished however. Clearly, other treatments of disjunction are possible, and should be investigated. \newpage\section{Indeterminate concurrent constraint programming} \subsection{Syntax and Operational semantics} There are several distinct ways of expanding the syntax of agents to introduce indeterminacy in linear concurrent constraint programming. Since our primary task in this paper is to review the state of concurrent logic programming, we shall adopt an extension that most faithfully reflects the practice of concurrent logic programming. Essentially we introduce the notion of guarded choice: \begin{equation} \begin{array}{lllll} (Agents) & A & ::= & \with_{i \in I} c_i \linarrow A_i & \mbox{-- Guarded choice} \end{array} \end{equation} In concrete syntax, we will prefer to use the guarded choice box $\OrSep$ rather than $\with$. This gives rise to the reduction rule: \begin{equation} \begin{array}{ll} (\Gamma, C, \with_{i \in I} c_i \linarrow A_i) \derives (\Gamma, C, A_j) & C \vdash c_j, j \in I \end{array} \end{equation} The notions of reduction sequences, and termination remains the same as before. \subsection{Logical semantics} We can now state the basic results connecting the logical interpretation to the operational execution of indeterminate programs. They are identical to the results in Section~\ref{det-log}. \begin{lemma} Let $A$ be a linear formula, with free variables contained in the free variables of $\Gamma$. If $\Gamma \derives \Gamma'$, then $\Gamma \vdash A$ whenever $\Gamma' \vdash A$. \end{lemma} \begin{theorem}[Soundness of logical interpration] Let $d$ be a constraint with its free variables contained in those of $\Gamma$. If $\Gamma \derives^{\star} (\Gamma',C)$, and $C \vdash d$, then $\Gamma \vdash d \otimes \top$. \end{theorem} \begin{theorem}[Completeness of logical interpretation] If $\Gamma \vdash d \otimes \top$, (with free variables of $d$ contained in those of $\Gamma$, then there is a configuration $\Gamma'$ such that $\Gamma \derives \Gamma'$ and $\Gamma' \vdash d \otimes \top$. \end{theorem} \subsection{Denotational Semantics} %% In the introduction discuss trace semantics approach. %% Background work. % For indeterminacy, several issues, not all of which have yet % been investigated. How will the recursion and indeterminism be % reconciled. % % Construct models. The denotation of an indeterminate process is only somewhat more complex than in the determinate case: rather than storing just the resting points of a process, we must also associate with each resting point the ``path'' followed by the process in reaching that point. Such a path is also called a {\em failure}. One of the primary contributions of \cite{popl91} was to develop a simple treatment of failures via the notion of trace operators (see below). However, in a fashion well-known from work on the semantics of other indeterminate formalisms, there are still a variety of choices for the appropriate mathematical domain of processes, depending on what one wants to consider observable. The simplest choice, perhaps, is to observe for a process just the set of all stores that it can lead to, that is, stores it ``deadlocks'' in (that is, is unable to progress in, unless the environment intervenes to upgrade the store). Such a semantics would be useful, for example, in proving safety (or ``partial correctness'') properties for the programs: one would be able to show that regardless of the store the process ends up in, some given property holds. One can obtain a fully abstract semantics for this notion of observation by associating with a process just the set of all failures (resting points, together with the paths taken to reach that point) generable by the process. Of importance is the fact that the divergent or undefined process (needed for the base case of the semantic recursion defining the semantics of recursive processes) is just the empty set. This is because there is no store that one can observe such a process quiescing in. Indeed, a consequence of this notion of observation is that a process $\tt c \then A$ is identified with the process $\tt (c \then \div) \OrSep (c \then A)$, which can non-deterministically choose to diverge. (Here \div{} is just some process that can engage in an infinite execution sequence without producing any constraint or demanding any constraint from its environment.) A minor modification of this treatment is to observe for a process the set of stores it quiesces in, together with a bit indicating whether this store was {\em terminal}, that is, the process had reached a stage where it would be unable to add any more information even if the environment were to upgrade the information content of the store. For example, {\tt X=1} is a terminal resting point of the agent {\tt X=1}, but not of the agent $\tt (X=1, Y=2) \then Z=3$. With such a notion of observation we can distinguish between ``deadlocked'' programs (programs that have quiesced but can continue if the environment were to offer more information), and ``terminated'' programs. It is straightforward to develop a compositional semantics which is fully-abstract for this notion of observation. The biggest drawback of the above treatment, however, is that it is not useful for reasoning about {\em liveness} properties of a program, that is for establishing that a process {\em must} be able to produce some constraints, regardless of the indeterminate choices that the underlying scheduler may make. For example, we know that the process \begin{ccprogram} \agent \tt (X=1 \then (Y=1,Z=2)) ++ (R=1 \then (Y=1,S=2)).. \end{ccprogram} {\em must} produce the constraint {\tt Y=1} on any store stronger than either {\tt X=1} or {\tt R=1}. The main difficulty is treating non-termination or {\em divergence}. This is handled in \cite{popl91} by regarding a divergent process as ``catastrophic'', in the style of \cite{spec-csp,hoare-models}. Such a process is modelled as capable of exhibiting any failure whatsoever. Thus, the process $\tt (c \then \div) \OrSep (c \then A)$ is identified with $\tt c \then \div$: a process that in store $\tt c$ can behave completely unpredictably (and hence of which no total correctness assertions can be made). The set of all processes is identified by presenting some naturally motivated closure conditions on sets of failures, following \cite{joseph-theory}. (For example, one of the major conditions ensures that processes are finitely nondeterminate.) The resulting notion of a process is intuitive and easy to understand. Combinators corresponding to ask, tell, nondeterminate (dependent) choice, parallel composition and hiding are defined. A simple operational semantics is given, and a full correspondence with the denotational semantics is established. A very simple relationship was also established between the indeterminate and determinate semantics, showing that the two semantics are essentially identical for determinate programs over finitary constraint systems. In summary, \cite{popl91} presents a simple model for the \cc\ languages that is fully able to account for nondeterminism and divergence, thus permitting the use of this model in reasoning about liveness properties of programs. The model is also shown to be fully consistent with the intuitive operational semantics of \cc\ programs. \paragraph{Trace operators.} A central issue in our semantic treatment is the representation of a failure. A failure could be represented as a sequence of ask/tell annotated constraints. Such a strategy is followed in earlier work on related languages, for example in \cite{partial-correctness,levi-invited,levi-tr}, \cite{CP-d-sem}, \cite{gaifman-react} etc. However, ask-tell sequences are far too concrete in this setting. They store too much information which must then be abstracted from (usually via complex closure conditions) since they encode the precise path followed by a process to arrive at a resting point. Furthermore, the definition of various combinators becomes rather cumbersome because of the need to ``propagate'' information down the trace. Instead, we chose to represent observations via a certain kind of closure operators. To capture various operational notions of interest, we introduce the concept of {\em trace operators}, and {\em bounded trace operators}. Intuitively, a trace operator corresponds to the closure operators generated by ``straight line'' (non-branching) ask-and-tell \cc{} programs. Mathematically, their characterizing property is that they are {\em invertible}: to every trace operator $f$ there corresponds another trace operator $f^{-1}$ such that the only resting point of $f \wedge f^{-1}$ is {\tt false} (the top of the lattice). Intuitively, $f^{-1}$ is obtained from $f$ by taking the characteristic ask-and-tell sequence of $f$ and ``inverting'' it, making every ask a tell, and every tell an ask. Thus, on any input, the conjunction of the two processes is able to tell exactly what the other asks, and thus proceed hand-over-hand, so to speak, all the way to {\tt false}. Hence {\tt false} is the only resting point for their conjunction. For instance, the inverse for the trace operator $$\tt (W=1) \then (X=2 \wedge (Y=3 \then Z=4))$$ is $$\tt (W=1) \wedge (X=2 \then (Y=3 \wedge (Z=4 \then \false)))$$ A {\em bounded} trace operator is a trace operator that can stop short of \false; that is, it is a trace operator on a sublattice obtained by considering all the elements below a certain element in the lattice. \paragraph{``Assume/Tell'' sequences.} A related line of attack has been independently pursued by Frank De Boer, Catuscia Palamidessi and their colleagues \cite{deboer-ccp,deBoer-asynch}. In \cite{deBoer-1,deBoer-2}, the authors introduced the idea that it was possible to give a semantics to the asynchronous concurrent logic programming languages in a style familiar from the semantics of indeterminate data-flow languages. With a process were associated the set of its {\em assume-tell} sequences: a sequence $\tt ask(c_1).tell(c_2).ask(c_3)\ldots tell(c_n)$ is associated with a process if when initiated in store $\tt c_1$, it can upgrade it to store $\tt c_2$, and then when initiated in $\tt c_3$ it can upgrade it to $c_4$ and so on, until when initiated in $\tt c_{n-1}$ it can upgrade it to $c_n$ and then quiesce. The denotation of a process is then taken to a set of such sequences closed under some conditions which reflect the asynchronous nature of communication, and the lack of granularity of the Eventual Tell operation.\footnote{In a language without Atomic Tell operations, it is not possible to distinguish between the agent $\tell(c_1), \tell(c_2)$ and the agent $\tell(c)$, where $c$ is logically equivalent to the conjunction of $c_1$ and $c_2$. Alternatively, each communication $\tell(c)$ can be thought of as happening in arbitrarily many, finite ``pieces''.} The resulting model are very similar in nature to the trace-operator based models discussed above --- indeed trace operators are just an abstract counterpart of such assume/tell sequences. (A significant technical difference is that the authors do not handle the interaction between divergence and non-determinism, so that their semantics is not adequate for dealing with liveness properties of processes.) The basic point made by the authors --- that for asychronous languages the ``branching structure'' of the refusal set semantics of \cite{csp-theory} is not needed --- is indeed valid, and is further corroborated by the work of \cite{joseph-theory}. Indeed, deBoer, Palamidessi and their colleagues have gone on to demonstrate the generality of their approach by applying it to other asynchronous languages in which communication is not ``extensive'' \cite{deBoer-asynch}. \cite{deBoer-ccp-pa} discusses an equational axiomatization of the congruence on \cc{} processes induced by the above semantics. This is very valuable for it serves as the foundation on which to build a ``work-bench'' that can be used to construct and reason about programs. A similar axiomatization for the determinate language is provided in \cite{popl91}. %% What is their work on a process-algebra for cc programming? \paragraph{Bisimulation semantics.} \cite{popl90} develops a different approach to the semantics of CCP. Following the operationally motivated work of Milner (and Park) \cite{ccs} (see \cite{Milner-handbook} for an uptodate account), we introduced the notion of {\em reactive equivalence} of \cc{} programs. Let the ``environment'' in which an agent is executed be an abstraction of the rest of the computational system---including concurrently executing agents---with which the agent must interact to complete its computation. In order to interact, an agent must have certain variables---the {\em visible variables}---in common with its environment. A basic action is said to be a {\em visible action} if the agent's capability to engage in that action may be influenced by the environment. Thus a visible tell action is an action that imposes ``new'' constraints (i.e., constraints which are not yet known to have been imposed) on the visible variables, and a visible ask action is one which checks whether some new constraints have been imposed on the visible variables. Note that a visible action is not something that is ``directly seen'' by another agent---rather it is a step in the agent's own computation which could have been influenced by the action of some other agent. Such circumspection is necessary in understanding this concept because communication in \cc\ languages is very indirect, and hence the concepts developed in the \ccs/\csp\ framework---in which an event is observable if the environment can synchronously engage in it--are not directly relevant. Unrolling the possible visible actions a process engages yields the {\em c-tree} (analogous to the ``synchronization tree'' for CCS) for the process. Two processes are considered {\em reactively equivalent} if, roughly, any visible action that one can do can be matched by the other in such a way that both processes end up in a state in which they remain reactively equal. Note that the notion of visible action takes into account that telling (or asking) a constraint $c$ is a silent action if the store is already strong enough to entail $c$: thus the agent $(X=1) \then (X=1 \wedge Y=2)$ is reactively equivalent to the agent $(X=1) \then Y=2$. This highlights that the notion of communication in the \cc{} languages is very different from that in \ccs{} or \csp{}: in those systems each action is taken as atomic and does not interfere with any other actions the system may have taken in the past. I once believed (in line with traditional dogma in concurrency theory) that reactive equivalence was the ``finest'' congruence that one could define on indeterminate \cc{} agents. However, the emergence of linear logic has made it possible to treat indeterminacy on a logical footing, and it now seems plausible that there is an even finer congruence than bisimulation, namely linear logic equivalence. \paragraph{Graph Grammars and True concurrency semantics} \cite{fran-true} develops a ``true concurrency'' semantics for CCP based on the notion of {\em graph grammars}. The state of the computation is described by a graph, with variables as nodes, and constraint tokens and agents as hyperarcs. {\em Both} the inference rules in the underlying constraint system, and (the user-specified) agents are expressed as {\em graph productions} which match subgraphs to other subgraphs, possibly introducing new nodes in the process, and deleting portions of the matched subgraph. A computation is then just a finite or infinite sequence of such graph rewriting steps. From such a computation, it is shown how to construct an {\em occurrence net}, a particularly simple kind of Petri-net which can express ``causality'' information. The move from occurrence-nets to partial orders is standard, though it depends on what is desired to be observed of the computation. An attractive feature of this approach is that the ``productions'' implementing the underlying constraint system are treated on an equal footing with user-defined agents. Indeed, the constraint system itself can be seen as a closure operator on sets of tokens, ordered by the is-a-subset-of relation. What remains unclear is the connection between the closure operator semantics (which abstracts from execution sequences), and the partial ordering semantics generated from an occurrence net. More work is needed to resolve this issue. Also unclear is the precise advantage gained by the more general true concurrency approach over the interleaving approach. Perhaps this will become clearer when one considers ``refinable'' basic actions (e.g., ask operations that themselves involve some computation, i.e., the so-called ``deep guards'', see \cite[Chapter 8]{vj-thesis-book}). Based in part on this work, the authors and their colleagues have gone on to describe in \cite{charm} an abstract machine, CHARM (for Concurrency and Hiding in an Abstract Rewriting Machine). Its distinguishing feature is the notion of partitioning the state of a process into a {\em global} (externally visible) and a local (hidden) part. Transitions preserve the global parts they match on; intuitively the global parts matched represent the structure of the underlying graph that is ``asked for'' (tested), but not consumed. CHARM is intended as a ``unifying framework'' for concurrency, since it is able to formalise in the same conceptual framework, the classical approach to graph grammars as well as the CCP paradigm; and its authors expect it will also be useful in formalizing the basic concepts in other concurrent paradigms such as CCS and CSP. As we shall see below, the connection with graph grammars is particularly appropriate when one considers a visual syntax for (at least) the \cc/\herbrand{} programming languages. \paragraph{Future work.} %% What about other congruences derived from process algebras? Among various aspects of concurrent constraint programming, semantic foundations seem to be the most well-developed. Perhaps the most glaring gap is that a comprehensive, systematic, theory of testing, with more powerful tests tied to finer congruences, has not yet been developed. Also, I believe that with the advent of linear logic, it may be possible to tie the notion of testing to the notion of recording as the denotation of a process a well-chosen {\em subset} of the logical formulas entailed by the process (viewed as a logical formula). If this works out, it will be a very beautiful way of exposing the logical structure of the programming language. Also, I would hope that the work can serve as the foundation for the development of practical tools to help in partial evaluation, program transformation, and static analysis of programs. \subsubsection{Related work} %% Discuss the set of traces work in my FSTTCS paper. %% Discuss the Gerth et al paper. The basic concepts of concurrent constraint programming were introduced in \cite{clean-sem,my-thesis-book}. Subseqently, we developed an operational semantics and a bisimulation semantics in \cite{popl90}. This line of research extends and subsumes (for all intents and purposes) earlier work on the semantics of concurrent logic programming languages.\footnote{The semantics presented in this paper does not account for the language Concurrent Prolog studied in \cite{CP-d-sem}; we do not, however, view this as a defect in our approach. Indeed, researchers working with Concurrent Prolog have moved to the Ask-and-Tell \cc\ framework \cite{shap-fgcs,gaifman-react}.} The power of the ``logical variable'' has been amply recognized in recent years, and numerous authors---too numerous for us to survey their work in this extended abstract--- have investigated the combination of functional and logic programming languages \cite{fg+lv,lp-book-1,i-structures,Jagadeesan89}. However, no account of the basic paradigm comparable with Kahn's original account in simplicity and generality has been forthcoming. Perhaps the most noteworthy in this context is the work of \cite{Jagadeesan89}, which discusses a subset of the functional programming language {\em Id Noveau} with ``logical'' arrays \cite{i-structures}. Their semantic treatment also identifies a program as a closure operator. However, their treatment is specialized to the particular kinds of arrays they were dealing with; our work is parametrized by a very general notion of constraint system. Further, they do not explicitly discuss the hiding and the Ask operations; though their ``Tell'' operation does implicit asks. We also present an axiomatization for the model. Their operational model includes an explicit discussion of how the constraints are resolved and represented, we achieve considerable simplicity by abstracting away the details of the constraint resolution process. The characterization of concurrency via intersection of sets of fixed points has also been elaborated by Hoare in a recent paper \cite{hoare-conj}, in a setting related to Unity~\cite{unity}. Our paper develops that viewpoint, and presents a characterization of other combinators as well, in the same style. We believe that the concurrent constraint programming languages are the natural setting for the development of a theory of computing with closure operators. The semantics of nondeterminate concurrent constraint programming languages is becoming an active area \cite{popl90,gaifman-react,levi-tr,deboer}. However none of this work explores the very simple characterizations possible for the determinate languages, dealing instead with the representation of processes as various sorts of trees or sets of i/o annotated sequences of constraints. The notion of determinacy was studied in the set-up of logic programming languages in \cite{alps}, but no characterizations of the kind we provide here were given. \cite{gaifman-react} does not consider most of the combinators we treat here, besides having a complicated treatment of ask/tell sequences. Neither does it treat recursion. In work to appear, deBoer and Palamidessi \cite{deboer,deboer-tr} propose a model which is similar to ours in many respects. In particular, they have also recognized that it is sufficient to take sequences of ask/tell actions in order to get a model for the \cc\ languages. However, their treatment ignores recursion. Much of the sophistication of the present model lies in the way finite nondeterminate axioms need to be introduced in order to correctly model divergence. Their model in \cite{deboer-tr} is not compositional with respect to the choice operator. The treatment described above follows closely Mark Josephs' treatment of receptive processes in \cite{joseph-theory}. A receptive process can always accept any input from the environment, but may produce different (or no) output depending on the input. Josephs gives a set of axioms for reactive processes that turn out to be more or less what is needed in order to develop a model for the \cc\ languages as well. The primary differences lie in the nature of communication, and in the combinators treated. Josephs' theory treats communication in the usual CCS/CSP style as the exchange of uninterpreted tokens with the environment. The constraint-based nature of the \cc\ languages imposes additional structure which must be considered. However, as this paper demonstrates, it is possible to adapt his basic model to the \cc\ setup without major reworking, which should confirm the essential robustness of his conceptualization of asynchronous systems. \subsection{Logical Semantics} \subsection{Programming idioms} \paragraph{Many to one communication using bags} The communication mechanism discussed in \janus\ hitherto has focussed on point-to-point directional communication channels. An agent may have a fixed number of such channels. Such communication schemes are too simplistic, however, to deal elegantly with many to one communication. Consider for example a situation in which an a priori unbounded number of clients may seek service from a server. The server will have read access to a bag. It will wait for requests by asking if its bag is {\tt element(Request,Bag')}. It then responds to the {\tt Request} and recurs with the {\tt Bag'}. A client has a teller to an element of the bag, say {\tt X}. It sends a request by telling {\tt X=bag(Request,X')} and using {\tt X'} for subsequent requests. It can split its access {\tt X} to the server by telling {\tt X=bag(X1,X2)} thereby creating the two accesses {\tt X1} and {\tt X2}. A client drops bag reference by posting constraint {\tt X=\{\}}. Consider the simple bank account as an example of how to instantiate this communication scheme: \begin{ccprogram} \agent account\_receptionist(\{\}, State).. \agent account\_receptionist(element(Req, Rest), State)-o account(Req, State, State1), account\_receptionist(Rest, State1).. \agent account(balance( Value), Bal, NewBal), integer(Bal) -o (NewBal = Bal, Value = Bal).. \agent account(withdraw(Amt, Ok), Bal, NewBal) -o Amt \leq Bal -> (Ok=ok, NewBal=Bal-Amt).. Amt > Bal -> (Ok=nope, NewBal=Bal).. \agent account(deposit(Amt), Bal, NewBal)-o NewBal=Bal+Amt.. \end{ccprogram} The guard {\tt integer(Bal)} was added to the first rule of {\tt account} to indicate that {\tt Bal} is forced and can therefore occur multiple times. We expect a good compiler based upon abstract interpretation can compile away this test. \newpage\section{Programming languages} The basic scheme described above captures the semantics of Flat GHC. Flat Parlog is also essentially identical. %% What was the idea in Ringwood's GDCC? \subsection{Atomic publication} \subsection{Read-only annotations} New definition in Flat Concurrent Prolog. Based on an attractive but flawed idea. \subsection{Deep guards} \subsection{Non-determinism} \subsection{Failure-free langauges} %% Constraint system. %% Asker teller notion. closely related to the proof as program %% interpretation of linear logic. augmented with the notion of listeners. In the context of distributed computation, a fundamental problem with \cc{} languages (the {\em multiple teller} problem) is that a computation may have multiple producers of partial information for the same set of variables. Unless these producers are coordinated in some way with respect to each other, they may produce conflicting constraints on the same variable, thus causing the system to crash. For example, one agent may constrain a variable {\tt X} to be equated to {\tt 3} and another may constrain it to be {\tt 4}. Languages such as \herbrand\ used the notion of atomic publication to circumvent the problem: An agent must augment the constraints in the store atomically, failing if the store is incompatible with the new information. However, this notion is far too heavy-weight to serve as a foundational communication mechanism in a distributed setting Other \cc/\herbrand\ languages such as \parlog, \fghc\ and \strand\ deal with the multiple teller problem by ignoring it! They trust that the programmer will organize computations in such a way that the computations will never generate conflicting constraints at run-time. If this trust is violated then the entire computation will abort with a logical error. These motivations led to the design of \janus{} \cite{janus}. The fundamental intuition is that mechanisms should be devised such that computation is not forced to abort at run-time because the store is rendered inconsistent. Second, the communication mechanism should naturally reflect the notion of asynchronous message passing, to be compatible with basic distributed computation concerns. That is, it should be possible for agents at a site to progress by publishing information without having to coordinate with agents at other sites. (This desideratum is even more important if it is desired that computations in other languages, built using different abstractions, should be able to {\em inter-operate} smoothly with computations in these languages.) Finally, if the languages are to be widely used as procedural languages there should at least be the potential for a performance-efficient implementation competitive with that of more conventional distributed assignment-based languages. Even more strongly, we believe that uni-processor implementations of such languages should be comparable in performance with the implementations of sequential, imperative languages. \janus{} attacks this problem on two fronts. First, we identified a constraint system, suitably rich for procedural programming, which possesses certain ``independence'' properties and naturally supports many-to-one communication. Suppose {\tt X} and {\tt Y} are two variables unconstrained in the current store, and an agent decides to post the (consistent) constraint {\tt X=s}, for some term {\tt s} possibly containing {\tt Y}. Then, intuitively, for the constraint system to be implementable well in a distributed setting it should not be possible to construe this imposition as a constraint on {\tt Y}. That is, more generally, if $c_1$ is a constraint of the form $\wedge_{i=1}^n X_i=s_i$ and $c_2$ of the form $\wedge_{j=1}^m Y_j=t_j$, where the $X_i$ and $Y_j$ are distinct, then $c_1 \wedge c_2$ should be consistent iff $c_1$ and $c_2$ are consistent separately. The \herbrand\ constraint system does not satisfy this property: if the constraint {\tt X=f(Y)} is imposed, then this restricts {\tt Y}; for example, the constraint {\tt Y=X} cannot be imposed. Part of our task was to design a constraint system, $\cal J$, closely related to \herbrand, which satisfies such properties. Essentially, the constraint system is defined over infinite trees \cite{prologiia}; but matters are complicated due to the necessity of introducing ``hyper-multisets'' in order to cater for many-to-one communication. In designing $\cal J$ our paramount concern had been that it should be efficiently implementable in a distributed setting, expressive enough to allow straight-forward and elegant formulation of standard algorithms over standard data-structures such as arrays and lists, and should be such that with an understanding of some basic representation decisions made by an implementation, a programmer can fairly accurately predict the cost of executiion of various programs. Second, we have exploited the idea of {\em syntactic control of interference}. Compile-time checkable restrictions are introduced which ensure that if the current store is consistent and a goal reduces via a program clause, then the resultant store will be guaranteed to be consistent, {\em without having to check for consistency}. This is achieved by exploiting a remarkably simple idea that I first stumbled across in 1987 when studying the notion of broadcast-free communication. Say a clause is binary if any variable occurs free at most twice in the clause. In such a clause, a variable may be considered as a ``point-to-point'' communication channel because messages generated on it from one occurrence only need to travel to one other occurrence. Now, observe that the resolvent of two binary clauses is also binary! (``Two'' is a magic number here --- from three one may generate as many occurrences as one wants.) This means that if the source program clauses satisfy this condition, then any configuration generated at run-time will also satisfy this condition. All that remains then is to (a)~devise a scheme which ensures that exactly one of these occurrences (the {\em teller}) can be used to impose a constraint on the variable --- the other occurrence (the {\em asker}) is then used to ``receive'' information (b)~ensure that in this scheme ``tell-rights'' can be communicated on channels, and (c)~determine if this scheme is powerful enough to express the characteristic idioms of concurrent logic programming. All three goals were achieved in \cite{janus}, adapting ideas from \cite{doc}. A major opportunity offered by \janus{} is that of compile-time garabage-collection. Because any variable has at most two occurrences, once one occurrence of a variable receives a term, it ``knows'' that there are no other references outstanding to that term (the other occurrence of the variable has been ``used up'' in binding the variable to the term). Therefore, a compiler can generate code knowing that whenever a data-structure in an atom has been successfully matched, the data-structure is garbage and can be recycled. This is discussed in more detail in the implementation section below. \newpage\section{Implementation} \label{intro} Efficient implementation of \cc\ languages, as that of more traditional languages, hinges on low memory usage, compile-time code optimization, and low-overhead runtime management of concurrency. By keeping the program's working set small, locality of the machine's memory hierarchy can be best exploited, reducing expensive faults farther from the CPU. The storage model selected, e.g., stack or heap, is a critical design decision here. Compile-time code optimization also centers around the memory hierarchy: efficient utilization of the available machine registers and cache. This involves avoiding redundant computation (e.g., by strength reduction), which also saves CPU cycles. Finally, the runtime overheads of concurrent task management must be significantly lower than computation within tasks. This is often noted in terms of the communication to computation ratio, assuming that the primary action of task management is transmission of messages between processors. Implementations to date of \cc\ languages were targeted to the broad categories of uniprocessor, shared-memory multiprocessor, and distributed-memory multiprocessor hosts. Almost all of the implementations use a heap-based storage model wherein tasks are allocated in an {\em ad hoc} fashion from either global or local storage pools. Stack-based implementations are more compatible with uniprocessor hosts; however, task suspension must still be implemented, perhaps in a manner similar to implementations of the {\em freeze} primative in certain Prologs. In the long-run, although not seen yet, implementations for multiprocessor hosts must also move to stack-based models, perhaps adopting ideas from partitioning of threads in dataflow languages \cite{TAM,traub92}. We discuss these issues further in Section \ref{} on future work. In this section we review the main themes and efforts in the implementation of \cc\ languages over the past decade. We begin with a brief historical overview. \paragraph{Historical review.} Some interesting comparative work done at the University of Edinburgh by R. Trehan \cite{trehan-ms} and H. Pinto \cite{pinto-ms} summarized the experiences of programming and interpreting these languages in the ``early days.'' These concurrent languages could be differentiated primarily by their synchronization mechanisms and how they managed multiple local environments. There are various other attributes, such as granularity control and goal scheduling, unification, etc., that affect implementation complexity and efficiency. We discuss these in depth in the following sections. Trehan and Pinto's studies focused on interpretation, whereas further evolution of implementation efforts led to compilation and hardware support. The first abstract machine designs for this family of languages were the WAM-like FCP machine (Emu) by A. Houri \cite{houri-cp-book,silverman-cp-book}, Sequential Parlog machine (SPM) by S. Gregory \cite{gregory-book}, and the KL1 machine by Y. Kimura \cite{KL1}. These systems represent the first-generation compiled implementations of concurrent logic languages, evolving into more sophisticated systems. The sequential FCP machine was refined first by S. Taylor into a hypercube system \cite{taylor-book} and by S. Kliger into a RISC-based abstract machine and optimizing compiler \cite{kliger-phd}. The SPM led to J. Crammond's Abstract Machine (JAM), the first parallel Parlog implementation \cite{jam-paper}, and the PPM \cite{cheese-phd}. The KL1 machine was implemented on shared-memory machines as Panda \cite{sato-pisa} and evolved into the abstract machine shared among the ``parallel inference machines'' (PIMs). A hybridization of a few of these projects was I. Foster and S. Taylor's Flat Parlog machine \cite{flat-parlog} leading to the Strand Abstract Machine (SAM) \cite{strand}, ported to several types of multiprocessors. These systems represent the second-generation parallel implementations, the first comparative study of which was by Foster and Taylor \cite{flat-parlog}. The community is now beginning the construction of third-generation {\em optimized} compiler-based, portable systems, e.g., the {\tt jc} Janus system \cite{gudeman-jicslp92}, Monaco \cite{tick-iclp93}, and a portable KL1 system \cite{chikayama-fast}.\footnote{ Interesting comparisons of the execution performance of many of these first, second and third generation systems can be found in Taylor \cite{taylor-book} and Tick \cite{tick-iclp93,tick-crammond-icpp}.} Specialized hardware efforts were concentrated mainly at ICOT with the decade-long FGCS project and their aim of building PIMs. The personal inference machines (PSI-I,II,III) \cite{psi4,PSI-II} were followed by mockup PIMs (Multi-PSI-V1/2, built of PSI-IIs), and finally PIM/\{c,i,k,m,p\} \cite{pim/c,sato91,pim/k,pim/m,pim/p}. The main efforts, PIM/\{m,p\}, are large multiprocessors ($2^8$--$2^9$ processors) based on specialized hardware for direct execution of KL1. Other notable hardware implementation efforts include the Carmel microprocessors \cite{CARMEL-4} and a related microprocessor proposed by Alkalaj \cite{alkalaj-isca}. A full analysis of hardware issues in concurrent logic language implementations is beyond the scope of this article (see Tick \cite{tick-fgcs93} for instance), although we do correlate the instruction set designs of the software and hardware oriented implementations in Section \ref{arch}. Originating around 1988, a related family of implementations has been developing around the Andorra model of computation, specifically Andorra I \cite{costa-ppopp,costa-iclp}, Andorra Kernal Language (AKL) \cite{akl-paradigms}, Pandora \cite{bahgat-phd,pandora}, and ANDOR-II \cite{takeuchi-phd}. We cannot do justice to these implementations in this paper, as they are rapidly evolving. \subsection{Principles and Trends} Efficient implementation of concurrent logic programs requires strong foundations in several areas. As in any parallel system, task\footnote{ We use the words {\em task}, {\em process}, and {\em goal} interchangeably.} switching and task creation are the primitive operations that must be made fast. Furthermore, as in any computational system, task invocation, variable binding, and memory reclamation must also be made fast. For concurrent logic programs, task switching means suspending one task and substituting (resuming) another; task creation means building a body goal task from its parent's arguments and perhaps spawning it on a remote processor. Task invocation is extended here to include the action of executing a goal to its commit point, i.e., performing the clause tries needed to commit, suspend or fail.\footnote{ A concurrent logic program task is like a {\em thread} in threaded architectures. The task invocation creates a main thread which may split into multiple threads during guard evaluation, all synchronizing at commit, leading to a single clause body thread. The body calls spawn new threads, and so on.} Variable binding incurs added overheads to guarantee atomicity (i.e., locking around the update to avoid races among competing writers). Not only is fast memory reclamation critical, but moreover so is efficient use of memory in the first place, since the single-assignment nature of the languages can be quite profligate in touching memory. By far the most complex implementation aspect of these basic operations is task switching and task invocation because of language synchronization semantics that require implicit synchronization on potentially incoming procedure arguments. This places a burden on the compiler and generally bloats procedure invocations with respect to sequential languages and implementations. The various nuances of language semantics, e.g., deep or flat guards, atomic or nonatomic tell unification, impact implementation efficiency. Orthogonal to these primitive operations are intelligent task management policies that are desirable: balanced load, balanced granularity, and fair scheduling. These concepts are not unique to concurrent logic programs, and are required independent of how fast the primitive operations can be made. Looking at underlying multiprocessor hosts, we have an additional requirement to achieve full efficiency: latencies must be hidden. Memory latency in distributed multiprocessors is the major problem to dealt with. As Arvind showed \cite{arvind87}, hiding latency effectively is directly traded-off against switching tasks quickly. We shall see (Section \ref{read}) that current concurrent logic programming systems can hide latency, but only within limits, and certainly overly-complex languages features cannot be effectively hidden. The past ten years have seen a trend towards {\em devolution} of logic programming languages driven by the practical need to build fast implementations. The most drastic step was the definition of committed-choice languages that did not backtrack, enabling the first pseudo-parallel interpreters to be built. The next devolutionary step was from deep to flat guards, and moving from synchronizing on dynamic read-only variables to synchronizing on statically-declared arguments, enabling the first efficient implementations to be built. Next were restrictions placed on how variables could be bound: Strand \cite{strand} abolished output unification in favor of assignment, similar to moded FGHC \cite{ueda-93} which constrains a logical variable to have a single producer. More strict, Doc \cite{doc}, \aum \cite{yoshida-fgcs88}, and Janus \cite{janus} constrain a logical variable to have at most a single producer and single consumer. These simplifications facilitate compile-time analysis and optimization of memory usage. The progressions are further described in subsequent sections. The key point is that languages are refined by reaching an equilibrium between what application writers {\em demand} and what implementors {\em supply}. There is not yet full agreement as to where this equilibrium point is for concurrent logic programs, and we think it will be most strongly influenced by fast, portable, and parallel implementations. \subsection{Synchronization} \label{synch} Concurrent logic programs synchronize on logical variables, similar to how non-strict dataflow languages use I-structures \cite{arvind89,lindstrom85}. If a required input variable is unbound upon procedure invocation, the corresponding clause cannot commit. Furthermore, if no clause defining the procedure can commit and not all clauses fail, it implies some required input variable(s) have not been delivered, and the task must be suspended. Concurrently, if any of these required input variables are bound, the task must be resumed. Input matching (passive unification) is transformed by compilation into instruction sequences that make matching efficient in general. The ability to synchronize on variables requires temporarily binding certain unbound logical variables to a suspended task to enable subsequent resumption. The efficiency of this infrastructure is the main factor in synchronization performance.\footnote{ Early systems did not attempt to statically analyze logical variables, e.g., to determine if a variable can possibly be hooked, and if not, how to generate more efficient code for the ask tests. Recent compilers, e.g., \cite{kliger-phd,yanoo}, claim to do global static analysis to determine this and other information.} % % IMAI AND TICK MEASUREMENT WAS FOR ACTIVE CELLS DURING GC: THUS % A VARIABLE CELL COULD BE RECOUNTED AS IT BOUNCED BETWEEN OLD AND NEW % SPACES. THIS INDICATED HOWEVER, THAT THESE VARIABLES LIVED A LONG % TIME? OR WAS IT BECAUSE OF TAILS OF MANY STREAMS? % The FCP, JAM and KL1 machine architectures all use similar methods of ``hooked'' variables, i.e., assigning indirect pointers from suspended variables to process structures \cite{houri-cp-book}. Indirection is required to allow both multiple variables to synchronize the same task, and multiple tasks to be synchronized by the same variable. Variables are infrequent data types: Imai and Tick \cite{imai-spdp} measured 1--15\% of dynamic objects are variables across a KL1 benchmark suite. To our knowledge, no one has measured the prevalence of hooked variables, and the characteristics of those hooks. It is a widely-held belief that hooks are quite simple in structure and rare in frequency. \footnote{The former assertion is more strongly supported than the latter --- ``object-oriented'' programs can create many suspensions, as discussed in the remainder of this section. The two such programs measured by Imai and Tick produced far more variables than the other benchmarks.} Thus JAM Parlog \cite{jam-paper} and Strand \cite{strand} allow goals to be hooked to only one variable, thereby obviating the complex bookkeeping structures needed for the general case. JAM exploits shared memory to implement a ``hybrid'' suspension list to gain this efficiency. Strand initiates all suspensions as if they are single-variable type, and if this most frequent case is violated, a queue of exceptionally suspended goals is used. An orthogonal issue is how to specify the input variables upon which to synchronize. The most common method is ``procedure level'' representation wherein synchronizing variables are syntactically specified (explicitly as in Parlog or implicitly as in GHC) on a per clause basis. Alternatively synchronization at a ``data level'' representation specifies synchronizing variables, e.g., ``read-only'' variables in CP. The latter method has gone out of favor because, although it facilitates certain sophisticated systems programming techniques, it complicates dereferencing and unification, and frustrates static analysis. % % VIJAY: FOR PREVIOUS SENTENCE, YOU MAY WANT TO ADD A FOOTNOTE, OR % REFERENCE TO PREVIOUS SECTION, THAT DISCUSSES THE SEMANTIC FLAWS % INHERENT IN READ-ONLY VARIABLE UNIFICATION, E.G., THAT HEAD % UNIFICATION ORDER CAN AFFECT PROGRAM BEHAVIOUR. I LEFT THIS OUT % OF THIS SECTION SINCE IT APPEARS TO BE LANGUAGE ISSUE TO ME. % The elegant programming techniques it enables are rarely used in applications programming \cite{shapiro-survey}, yet the cost of implementation is felt throughout the design \cite{flat-parlog}. Foster and Taylor \cite{flat-parlog} measured (for small benchmarks executing on a sequential workstation) that trailing needed to support atomic unification (discussed further in Section \ref{unify}) in FCP caused a 5\% degradation in performance compared to flat Parlog (without atomic unification). For programs with suspension ratios (\# suspensions/\# reductions) of 19--56\%, additional degradation of 4--8\% was observed, hypothesized as other overheads associated with read-only variables (since flat Parlog and FCP were calibrated except for that). Another implementation issue is the actual control flow of checking the synchronizing variables (discussed at length in Section \ref{guard}). It was originally believed that parallel execution of the clause tries was beneficial because it implied faster invocation. However, if deep guards are permitted, then parallel clause tries require the ability to sustain multiple environments and incur most of the problems associated with OR-parallel management of bindings under search for a single solution. Furthermore, with flat guards, the little amount of work within the clause tries may not justify the overhead of executing them in parallel. Crammond showed consistently negative speedups (3.8\% to $-$32\%) for small, flat-guarded benchmarks on JAM Parlog on a shared-memory multiprocessor \cite{crammond-phd}. In fact, compilation techniques such as decision graphs \cite{kliger-phd} remove redundant computations among the clause tries, furthering the argument that parallel tries do not pay for themselves. Sato and Goto \cite{sato-pisa} showed, for the shared-memory Panda system, that suspension induces execution overhead of 1--5\% for small benchmarks, because of the necessity to redo the clause tries on resumption. Especially with decision-graph compilation techniques, it is not easy to avoid recomputation since there is more sharing among the code generated. For Panda benchmarks with low suspension ratios of 1--8\%, depth-first scheduling mechanism effectively suppressed suspensions (with respect to breadth-first scheduling), but the benchmarks were quite simple. The one Panda benchmark with a high suspension ratio of 42\% was not suppressed by depth-first scheduling. Taylor \cite{taylor-book} measured suspension ratios of 0--56\% on the hypercube for small programs, including an assembler. The higher ratios are due to static pragma-driven scheduling on the hypercube, compared to the dynamic scheduling on Panda. Imai and Tick \cite{imai-spdp} measured 14 medium-sized benchmarks ranging from 0.3--67\% suspension ratios, with a geometric mean of 3.7\%. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % % Imai + Tick suspension:reduction ratio = % % (14.5 * 3.40 * 2.06 * 66.86 * 0.328 * 43.75 * 0.351 * % % 4.79 * 11.56 * 1.64 * 8.33 * 5.26 * 1.57 * 0.494)^(1/14) % % % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% All these statistics taken together indicate, among other things, that a significant number of programs are {\em active}, i.e., process groups are spawned to implement active objects that compute and communicate until the termination of the algorithm. This object-oriented programming style causes frequent suspensions because the processes composing the active objects are normally suspended, awaking only upon receiving a message upon a stream. Yet all the parallel systems previously mentioned implement {\em process-oriented scheduling} wherein a goal reduction leads to the enqueuing of its body goals onto runtime work queues, with one of the goals selected for local execution (analogous to tail recursion optimization). Such a model executes active programs inefficiently. Ueda and Morita proposed an alternative called {\em message-oriented scheduling} \cite{ueda-fgcs92,ueda-93} for more efficient active program execution. The main idea is to transfer control to a stream consumer at the point when the producer sends the message. In the case where buffering can be avoided, this method of task switching to an active process has less overhead than the standard execution mechanism. Ueda and Morita implemented this method for shared-memory multiprocessors by scheduling from a global work pool. Since control is transferred immediately upon message sending, effectively independent chains of message sends are executed by the processors. Their initial performance results are extraordinarily good: naive reverse executes on a single processor at 3.3 seconds (cf., {\tt jc} (Janus) runs at 3.9 seconds \cite{gudeman-jicslp92} and Monaco (FGHC) runs at 19.2 seconds \cite{tick-iclp93}). Furthermore almost linear speedups are achieved, as well as comparable performance to optimized `C' (the authors however do not compare the technique with process-oriented scheduling for an equivalent technology base, to determine the utility of the concept). For example, Kliger \cite{kliger-phd} reports that a set of 27 small-to-medium size FCP(:,$\commit$) benchmarks achieved 21\% geometric mean speedup due to a set of optimizations based on global analysis, including reduction of unification into assignment, in-lining arithmetic, manipulating unboxed objects, and excluding procedure clauses that can never commit for any invocation. \subsection{Guards and the Process Structure} \label{guard} Guards, similar in purpose to Dijkstra's {\em guarded commands}, were introduced to logic programming in the Relational Language \cite{ClaGre81}. They extend, from simple input matching, the expressivity of how to commit to a clause. In their most general form, guards among clauses defining the same procedure represent disjunctive processes racing to commit. Implementation difficulties occur 1) if these processes are allowed to bind variables, and 2) even if binding is outlawed, if processes are permitted to make nested calls. The former problem is indicative of ``unsafe'' languages, and the latter problem is indicative of languages with ``deep'' guards. Considering the range of complex to simple implementations, the languages fall into three basic categories: unsafe and deep (e.g., CP), safe and deep (e.g., Parlog), and safe and flat (e.g., FGHC). Unsafe clauses may compete with one another in the sense that each may wish to make conflicting bindings to the same variables. This is implemented by restricting bindings to a local environment, for eventual exportation upon commit. Exportation can, however, conflict with concurrent bindings made nonlocally. If this happens, the clause try fails. Detecting inconsistencies is a major implementation problem in these languages --- there is a choice among detection {\em before} commit or {\em after} commit. The former presents a clearer semantic model to the programmer, but is far more difficult to implement (see Section \ref{unify}). Deep guards effectively form a process hierarchy or tree, with local environments at each level. Local environments are needed, even if the language is safe, because incoming bindings (to local variables) must be saved across deep guard evaluation. One severe implementation problem is the management of multiple environments (one per deep-guard clause in the same procedure) if guards are evaluated concurrently. This {\em value access control problem} is similar to that of OR-parallel implementations of Prolog: how to efficiently ensure that only ancestor environments on a path to the root are accessible, and that all other environments are hidden. Another implementation problem is supporting {\em fair} execution while descending the hierarchy, while retaining low complexity and cost. If fair execution is not guaranteed then eagerly executed guards may loop, preventing later guards from failing and freeing up the computation. Shapiro \cite{shapiro-survey} states that the inability to achieve fairness at low cost motivated flat languages.\footnote{This problem still exists, in a less troublesome form, for unfair flat languages when early builtin guards suspend, preventing or delaying later guard failure. This can only reduce the failure set, which is not important because failure is a catastrophic event in flat languages.} He cites early CP implementations (e.g., \cite{miyazaki-cp-book}) as either unfair or of ``unacceptable'' complexity. JAM Parlog \cite{jam-paper} constrains deep guards to be used only in clauses bracketed by {\em sequentialized clause separators} (in some languages called {\em otherwise} guards). Such separators prevent subsequent clauses from being tried until all previous clause tries fail. This restriction obviates concurrent evaluation of deep guards, simplifying management to that of a single local environment per procedure. Crammond \cite{jam-paper} states that this restriction allows most of programmers' intended uses of deep guards, e.g., as if-then-else conditionals. Parlog offers the programmer a sequentialization operator {\tt g \& b} that guarantees goal {\tt g} executes to completion before goal {\tt b} is executed. In JAM, the implementation views a clause as compiled above with the guard {\tt g} and body {\tt b}. In other words, the same mechanism used to implement sequential goal execution does double duty for deep guard execution. Deep guards need to be evaluated concurrently to avoid deadlock; however, given mode information, flat guards can often be executed in-line for efficiency. The environment necessary for carrying local bindings over a sequentialization operator are not unlike a Prolog environment. Restricting the language to only flat (safe) guards engendered decision-graph compilation \cite{kliger-phd} because clause tries can be compiled in line without transfer of control nonlocally to other goals. A decision graph is composed of if-then-else and switch nodes which transfer local control conditionally upon a test. A graph is formed, rather than a tree, to guarantee space proportional to the number of clauses in the procedure. To ensure space linearity, a clause is propagated down one and only one branch of the graph as code is being generated. Thus clauses ambiguous to a test are conservatively placed in a {\em continuation} branch, and sibling branches jump to the continuation upon failure. For a suite of 27 medium-size benchmarks, decision graphs executed 3.2 times faster on average than WAM-like compilation \cite{kliger-phd}. The code size expanded by 30\% on average, with a particularly degenerate program (Salt \& Mustard \cite{tick-book}) doubling in size. An interesting problem is how to order the graph nodes, and how to generate optimal code for the tests, conditional branches and switches, to minimize execution time \cite{debray-jicslp92}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % % Appears to be no data on the following: % % Empirical characteristics of deep and flat % % guards and their overheads. % % % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% A main purpose of deep guards is to perform speculative computations that can fail allowing alternative solutions to succeed. Unsafe languages enriched this paradigm allowing bindings to be made along the speculative path. Experience has shown that support of both of these operations is too expensive for the low frequency with which they are used. The devolution to flat languages is complete in the sense that almost all research groups opted to reduce language expressibility in favor of easily-implementable flat guards. In a further extreme, Fleng \cite{fleng1} abolished guards in an effort to streamline execution. In general, guard tests must be pulled up and evaluated at each call site. This allows the optimization wherein certain guard tests need only be evaluated at certain call sites. Global analysis is needed to produce the information required for this optimization. Although FCP has guards, a similar optimization is enabled by Kliger's method of customizing decision-graph clause tries for different call sites \cite{kliger-phd}.\footnote{ Practical definitions of Fleng allow the programmer to specify guards, which are then pulled up to the call sites. Without global analysis, of the complexity required by Kliger, {\em all} guard tests must be pulled up to {\em each} site.} \subsection{Reading and Writing Logical Variables} \label{read} The costs of reading and writing logical variables can be calculated as the {\em frequency} of operations required, multiplied by the {\em cost} of the operations. For example, reading a logical variable incurs the incremental cost of suspending the variable at the rate of suspension. In shared-memory multiprocessors, all accesses are ``local,'' so that the relevant overheads are lock traffic on the shared bus and lock contention. Contention, i.e., multiple concurrent requests of the same lock, can be exacerbated when the host does not supply enough physical locks for all objects needing locks.\footnote{Because this hardware attribute cannot be easily modified for a given host, studies of lock contention {\em vs.} lock granularity have not been performed for concurrent logic languages.} For small benchmarks, Sato and Goto \cite{sato-pisa} reported that locking accounted for only 1--5\% performance degradation on the Sequent Balance. This conveys both the relative efficiency with which locks can be implemented with shared memory, as well as the retained significance of lock overhead. Interestingly, although most of the lock traffic they measured was for protecting bindings, most of the observed lock contention was for bookkeeping locks for scheduling and termination. Distributed-memory multiprocessors are significantly more problematic because of the overheads incurred in reading and writing nonlocal variables. Nonlocal reading requires sending a message requesting the variable's value, and receiving a reply. Nonlocal writing requires issuing the binding --- the receiver can update the variable locally (without explicit locking) and send either a success or failure acknowledgement. The incremental cost of resuming tasks hooked to the bound variable must be accounted for in a macro view of execution. There are of course variations on both of these protocols. Taylor \cite{taylor-book} discusses a protocol on a hypercube where nonlocal writes first request a remote lock, and upon receiving the lock, issue a remote write. He measured 61\%--100\% of all messages sent, for six FCP benchmarks, are nonlocal reads. The four smallest benchmarks required an arithmetic average of 99\% reads. Although write frequency is seen to be very low, its amplified cost can be felt. For example, Taylor demonstrated that for the incomplete message paradigm (where nonlocal reading and writing occur with equal frequency), the main execution overhead on a multiprocessor was not the sender reading the return value, but rather the receiver locking and writing the return value. In this simple example, locking proved to be extremely expensive (performance degradation of two times on two hypercube nodes) because the latency could not be hidden. Reducing the cost of reading is critical in distributed implementations. If a complex term is to be read nonlocally, an important design consideration is how much of the term should be eagerly transferred. Taylor also examined the affect of this {\em copy depth} parameter on performance. For standard paradigms such as producer-consumer and incomplete messages, performance improved significantly for initial increases in copy depth, after which no improvement was seen. The interpretation of these results is that the consumer is ``brought up to speed'' by increasing transfer size until the point at which it outruns the producer, after which no further improvement can be achieved. Because these are such pervasive programming techniques in concurrent logic programs, it is imperative to find ways to speed them up. Hardware support for message management (packing and unpacking, merging active messages straight into the execution pipeline) to more effectively hide latencies is one approach, similar to the goals of threaded architectures (e.g., \cite{nikhil92}). Another idea is to reduce the number of messages sent, either by introducing new programming paradigms, or by dynamically migrating tasks and streams so that communication is local. Yoshida \cite{yoshida-fgcs88} took the latter approach in the design of \aumc, discussed in Section \ref{streams}. \subsection{Unification} \label{unify} Unification is somewhat controversial because it stands out as one of the few unbounded-time operations required by logic programs compared to conventional languages. In many cases unification can be compiled into simple instructions, as was elegantly shown in the WAM \cite{wam-book,warren-tn}. Unification in committed-choice languages can be categorized as either {\em input} (also: {\em passive} and {\em ask}) or {\em output} (also: {\em active} and {\em tell}), reflecting the exportation of bindings. Ask unifications implement head matching either as explicitly compiled match instructions or as invocations of a fully general passive unify routine. Luckily, full passive unification is rarely executed: it occurs only when checking the equality of two incoming arguments. For example, Foster and Taylor \cite{flat-parlog} measured the execution of 153,800 matching operations and 15,300 general passive unifies (9\% of total) in an Assembler benchmark written in Flat Parlog. Furthermore, if sufficient type information is inferred, general passive unifies can be reduced to simpler tests. At the leaves of unification's recursive descent, rules for unifying primitive data types come into play. Read-only synchronization requires an extended set of rules \cite{miyazaki-cp-book,taylor-book} compared to procedure-level synchronization, potentially reducing performance. Whereas ask unification occurs before commit, the location of tell unification varies among the languages. Unsafe languages require {\em atomic} tell unification wherein no output bindings are seen until commit. This means that bindings must be locally trailed, and perhaps undone upon failure. Furthermore, atomic unification can bind two variables which are inputs of the same procedure invocation. This raises the issue of whether such bindings should be eagerly acknowledged, although such an implementation causes a performance degradation of 1--29\% \cite{flat-parlog}. This and trailing overheads led to the abandonment of atomic unification for implementation reasons alone. Safe languages place tell unifications after commit (called {\em body unification}). The implementations are thus free to perform body unifications on the fly, with failure causing procedure invocation failure. Even body unification is complex when considering multiprocessor implementations. The main problem is to avoid potential race conditions among concurrently executing tasks. Thus any logical variable that needs to be bound must be locked first. Furthermore, if both operands are unbound, they cannot be locked in an arbitrary order under threat of deadlock with a competing task trying to bind the pair in the opposite order (very unlikely, but possible, unless mode restrictions are known, as discussed below). Some ordering must be made, e.g., exploiting the nature of a shared-memory name space. By making the most frequent case fast, general unification on multiprocessors can be implemented efficiently. The most common case by far is binding a constant to a logical variable that does not require dereferencing. A fast stub can be constructed that tests if one operand is constant, and one is unbound and not hooked. On a shared-memory machine, the variable is then locked, bound, and unlocked. Otherwise, if the initial condition is not met, full unification is required. The two operands to be unified must be locked, dereferenced, and compared. On distributed-memory multiprocessors, the same algorithms can be naively used, with nonlocal accesses potentially at the leaves. Because of the implementation complexity and potential execution overheads of output unification on distributed-memory multiprocessors, and the evidence that it is infrequently used in its full generality, output unification further devolved in Strand to {\em assignment}. Thus recursive descent is obviated, and the left-hand side of the assignment is required to be a variable. However, this does not rule out the need to test for exceptions, or hooked variables, which need to be resumed. Thus binding is still expensive compared to imperative assignment.\footnote{With compile-time freeness analysis, e.g., \cite{ueda-93}, and hookedness analysis, e.g., \cite{yanoo}, these tests can be safely removed.} Program Composition Notation (PCN) \cite{pcn-book} is a multi-lingual language built from Strand and `C' that offers both {\em definition} (logical) and {\em mutable} variables. Thus safe and efficient assignment to mutable variables can be guaranteed by the programmer. Furthermore, memory usage can be reduced by destructive update of mutable variables. In a sense, PCN is the farthest devolution has progressed in concurrent logic languages. \subsection{Task Scheduling and Priority} \label{priority} There are various philosophies for automatic scheduling of parallel tasks. Compile-time analysis can be attempted to determine a fixed schedule mapping tasks to processors. Runtime profiling information can aid the static analysis. A radical departure is to perform all scheduling dynamically without any static aid, or a hybrid combination of static and dynamic. Another approach is to avoid automation and require the programmer to explicitly distribute tasks. Automatic scheduling in concurrent logic programming systems is usually dynamic (e.g., JAM, Panda, Monaco) because tasks are too small and numerous to allow practical static analysis. For shared-memory multiprocessors, the main implementation issue is how to efficiently manage the goal queues. A single shared queue would eliminate the need for load balancing, but contention for this scarce resource is too costly. Splitting the queues up, one per processor, removes contention but leads to potential unbalancing. Once queues have been split, there emerges the implementation paradigm, on large-grain process systems such as UNIX, of {\em task farming}. Here a single UNIX process, often called a ``worker,'' is responsible for coroutining between the execution and scheduling of goals. In {\em on-demand scheduling}, goals are not eagerly distributed among workers and only an idle worker searches for work, thereby minimally disturbing busy processors. Sato and Goto \cite{sato-pisa} and Crammond \cite{crammond-naclp90} examined variations of on-demand scheduling involving further splitting the local queue into {\em private} and {\em public} queues and allowing idle workers to steal only public work. Crammond reported that for eight medium-sized Parlog benchmarks, private/public queues offered slightly better and more consistent speedups than public-only scheduling, on the Symmetry and Butterfly II (on 16 Symmetry PEs, geometric mean efficiencies of 86\% and 83\%, respectively). These early studies measured multiprocessors with far slower processing elements than are available today. There has also been much work within the ICOT FGCS Project exploring automatic load balancing methods, e.g., \cite{furuichi90,ichiyoshi-fgcs92,pim/c,sato-pisa,sugie}. The most successful experiment has been the {\em multi-level load balancing} (MLLB) scheme for balancing OR-parallel search programs on a distributed-memory multiprocessor \cite{furuichi90}. The idea is to partition the available processors into groups, and allocate one distribution master per group. Slave processors within these groups request work from the master. There is a method of merging groups, and given the regular nature of OR-parallel search, this method has been shown to be quite effective, e.g., speedup of 50 on 64 processor Multi-PSI for the pentomino benchmark \cite{furuichi90}. The drawback of MLLB is its limited application domain. Thus even ICOT resorted to explicit user-defined ``pragma'' in the KL1 language for remote task scheduling on distributed multiprocessors. Strand and PCN also require pragma. In these latter languages, the user is encouraged to design load-distribution management networks, called {\em motifs}, e.g., MLLB could be specified as such \cite{motif2,motif1}. In PCN, motifs consist of several programming constructs implemented in the source language with libraries providing support. Simple pragma are enriched by allowing the definition of {\em virtual topologies}, which can be embedded within physical topologies. Topologies are collections of nodes, such as a hypercube network, implemented by process structures. User-defined tasks are mapped onto the nodes by passing the tasks as messages for meta-execution. A user program can be written to interface to a single virtual topology, which can then be automatically mapped onto whatever physical topology is offered by the hardware organization. There are several other constructs, such as {\em templates} and {\em ports}, which facilitate program creation, but do not present major implementation difficulties. An issue related to task scheduling is task priority. Early concurrent logic languages specified that goals were required to be executed in a {\em fair manner}. Fairness is difficult to define in a manner that can be easily implemented. One weak definition is that all tasks which {\em can} execute {\em are} attempted at some time. This guarantees avoidance of spurious deadlock. Normally, tail-recursion optimization (TRO) is implemented wherein a selected body goal is directly executed and all others are wrapped up as goal records and enqueued. By extending the life of a thread through a selected child in this manner, efficient use of registers for argument passing can be achieved. Fair execution is emulated in a number of systems by a time-slicing technique, wherein every $k$ reductions, TRO is replaced by enqueueing {\em all} body goals at the back of the queue, and switching in a goal from the front of the queue. Implementation incurs the overhead of updating a counter comparing it to $k$ for each reduction, as well as enabling queue access from both the front and back. The KL1 PIM systems took a different approach, discarding the notion of task execution fairness altogether. It is replaced by a goal priority scheme, wherein the scheduler makes its best effort to abide by priorities. This allows programming techniques such as speculative exploration of alternative solutions. KL1 allows goal pragma that set priorities relative to a parent goal or a collection of goals called a {\em shoen}. These {\em logical} priorities, potentially ranging from 0 to $2^{32}$, are retained in goal records, but also mapped into a smaller {\em physical} range for purpose of sorting. For example, if the physical range is 0 to $2^{14}$, then the PIMs use an array of $2^{14}$ queues. The non-empty queues are linked to allow efficient dequeueing across priorities. Insertion of a goal into an empty queue requires a linear search up to the nearest non-empty neighbors to update the links. This algorithm is sufficiently simple for its microcoded implementation, although software-based implementations might be better served by balanced, priority trees. \subsection{Granularity Control} Concurrent logic programs are fine grained: Alkalaj \cite{alkalaj-isca} measured from ``20 to several hundred single-cycle instructions'' per average goal reduction. Taylor \cite{taylor-book} measured FCP granularity on a hypercube as a ratio of reductions/messages-passed,\footnote{ In a distributed memory multiprocessor implementation, messages would be used for communicating values down a stream from producer to consumer, for example. } ranging from 3.5--220. It is clear that granularity is very much dependent on application and programming style, but even in the best case, granularity is still low compared to conventional approaches to parallel programming in imperative languages. The advantage of fine-grained concurrent languages is the abundance of potential parallelism. However, the main disadvantage is that too-fine granularity can lead to excess overheads in task management. Alkalaj \cite{alkalaj-isca} has shown that 50\% of the execution time of large FCP applications is spent on goal management for a reasonable machine execution model. His recommendation was a specialized hardware organization to support this efficiently. Such directions are promising, as echoed for instance in the hardware implementations of threaded architectures, mainly predicated on dataflow languages (e.g., \cite{monsoon}). Special hardware or not, it is necessary for to boost efficiency by ``collecting'' granularity at compile time \cite{sarkar-phd,traub92}. Ideas along these lines were developed for logic programs by Debray {\em et al.} \cite{debray:90}, King and Soper \cite{king-gran-tr}, and Tick and Zhong \cite{tick+xzhong}. Debray's design seeks to construct, as compile time, estimators of input argument size, and formulate these estimates into granularity estimations. At runtime, a granularity estimate is evaluated for each procedure invocation, and the weight value is used to make dynamic scheduling decisions. For example, if the weight is below a threshold, a task will not be spawned because of excessive overhead. King \cite{king-gran-tr} discusses an analysis technique with no runtime component. Similar to Debray's method, granularity is modeled as a function of argument size; however, these sizes are estimated by abstract interpretation, associating argument types with a finite lattice of sizes. The result is a purely static determination of granularity. King uses this analysis, as an alternative to profiling for example, to drive task sequentialization (see Section \ref{arch}). Zhong's approach \cite{tick+xzhong} attempts to remove the complexity of argument-size estimation (both at compile time and subsequent runtime evaluation costs) by introducing an abstract ``iteration parameter'' which is a proxy for relevant granularity information. The remainder of the scheme is similar to Debray's, with the major distinction that the estimators are easier to formulate, cheaper to evaluate, but far less accurate. Furthermore, the weights computed are relative, e.g., it can be estimated that one task is half the weight of another task, but it cannot be determined if either are below some absolute threshold weight. The verdict is not yet in on the utility of these granularity analyses, because empirical data is sparse. Robust analyzers and larger benchmarks are needed. \subsection{Abstract Instruction Set Architectures} \label{arch} The Warren Abstract Machine (WAM) \cite{wam-book,warren-tn} had a great influence on the various concurrent logic language implementations discussed in previous sections. The important differences among the abstract machines developed for committed-choice languages are in their storage models. The primary distinction is whether the heap is based in shared memory \cite{jam-paper,gudeman-jicslp92,houri-cp-book,KL1,kliger-phd,tick-iclp93}, or distributed memory \cite{pim/p,pim/m,sato91,taylor-book}. In general, all variables and terms are stored in a heap, and memory is reclaimed by explicit, periodic garbage collection. Goals are usually represented by heap terms that can be linked into work queues for scheduling. A goal that is suspended can ``float'' on the heap, to be relinked to a work queue upon binding its hooked variable. There are several variations to this basic model to gain efficiency. Goal records can be constrained to be fixed size and queued in free lists facilitating memory reuse. Furthermore, all data structures can be partitioned onto heaps corresponding to size, each with its own free list for ease of (de)allocation. Crammond \cite{jam-paper} split arguments away from goal records, allocating them on their own stacks to improve locality and reuse. With arguments allocated separately from goal records, goal-record locality improves, and arguments no longer need to be of fixed size if allocated in a stack-based fashion. However, this results in the creation of ``holes,'' i.e., deallocated frames trapped below the top of stack, which can require general garbage collection if they grow too large. Crammond \cite{crammond-phd} illustrates the extent of this problem for some small benchmarks. In addition, bookkeeping structures for evaluating deep guards and suspension management are necessary. Deep guards and sequential conjunctions require the use of environments which hold values of variables active throughout the clause try or sequential body evaluation. Practical implementations require at most one environment per invocation \cite{jam-paper}, which is deallocated upon body completion or guard failure (to be reallocated for the next clause try). In addition, a {\em trail} is needed for atomic tell unification wherein failure and suspension during unification must ``back out'' all bindings generated. Suspension management requires a suspension stack holding pointers to input arguments that are needed but are as yet unbound. A less important distinction among the systems are their abstract instruction sets. The instruction sets of the various machines follow the general WAM model, passing arguments through dedicated registers, and having a set of additional state registers for control and storage management. The instruction sets can be broken down into similar groups. Older models use WAM-like indexing control instructions, whereas decision-graph compilers for flat languages avoid shallow backtracking and much of the required control instructions. Head matching (ask unification) is compiled with {\bf wait} instructions that will push their corresponding argument onto the suspension stack if it is not instantiated. Tell unification is compiled into {\bf get} instructions that will make assignments or invoke a general unifier. Finally, body goals are generated with {\bf put} instructions for loading arguments and enqueueing goal records. As mentioned in Section \ref{priority}, usually a form of tail recursion optimization (TRO) can be implemented by loading the arguments of one of the body goals directly into the argument registers and jumping to the goal code. Additional instructions are needed for the (de)allocation of local environments (for non-flat languages and/or sequential conjunctions) and heap storage. Goal-management instructions are responsible for terminating a thread (in unit clauses), enqueueing a goal (creating threads), directly executing a goal (TRO, called {\em promoting} a thread in JAM), and initiating deep guards. Note that flat languages have threads that live very short lives, not counting promotions. Sequential conjunctions, introduced in Parlog, do not lengthen threads in practice because they are necessarily implemented with trees of local environments (see Section \ref{guard}). This stems from the fact that within a sequential conjunction, concurrent goals may execute. If total sequentialization of a goal and all its children can be specified or derived, then these local environments can be stacked, resulting in superior space and execution time efficiency. This would be a true elongation of threads, resulting in increased performance \cite{king92,lpar93}. Such an implementation requires a sequential call as well as stack (de)allocation instructions. Arithmetic instructions and builtin predicates must be able to suspend if executed before commit, or be enqueued as bonefide goals if executed after commit. JAM optimizes this by checking arithmetic operator inputs in the body, and if available, executing the arithmetic in place. Otherwise, a goal is created. In general, a lazy strategy of dynamically checking for data dependencies is taken by all concurrent logic programming systems. Threaded architectures for dataflow are slightly different in that they do static analysis to generate arithmetic threads that cannot internally suspend. However, both research communities treat array accesses in a similar manner: a builtin access goal is spawned, avoiding long latencies for nonlocal access. After all, statically analyzing array indicies is very difficult \cite{banerjee}. A trend towards reduced abstract machine design, following the principles of RISC design, has led to instruction sets such as Carmel \cite{CARMEL-4}, SAM \cite{strand}, the {\tt jc} machine \cite{gudeman-jicslp92}, Kliger's machine \cite{kliger-phd}, and the various PIM architectures \cite{pim/p,pim/m,sato91}. For example, Strand, FGHC, Fleng, Janus, and FCP(:,$\commit$) sufficiently simplify the execution model, obviating trailing, environments, atomic tell unification, and a process hierarchy for deep guards and sequential conjuncts. This allows these compilers to concentrate on optimizations, such as decision-graph generation, in-line arithmetic, and global register allocation. Sequential implementations offer further performance gains, obviating locking and allowing the leverage of compilation into `C'. Debray and Tick measured a mean speedup of 2.4 comparing {\tt jc} with Monaco for six small benchmarks \cite{tick-iclp93}, illustrating the potential advantages of sophisticated register allocation, and streamlined binding mechanisms. Readers interested in concurrent logic language instruction-set design are referred to Crammond \cite{jam-paper}, Foster and Taylor \cite{flat-parlog}, and Kliger \cite{kliger-phd} for the most complete expositions. \subsection{Stream Communication, Arrays and Garbage Collection} \label{streams} A major defect in concurrent logic languages and their implementations is inefficient use of memory. This problem is prevalent in the treatment of communication streams and data arrays, and is exacerbated in distributed-memory multiprocessors. A general, after-the-fact, solution to the problem is the construction of ever more efficient garbage collectors, about which we comment at the end of this section. We first discuss pre-emptive solutions, such as making stream communication efficient with buffers and migration. Streams are second-class citizens in most logic programming languages. Stream communication is programmed by having a producer write messages into a difference list, the head of which is read by a consumer. To nondeterminately merge multiple streams, a chain of active {\em merge} processes is needed. This methodology was stressed in the original literature because it is elegant and all that is offered at the {\em language} level. However, defining and implementing streams in this manner is highly inefficient. First, merged streams incur extra process reductions, lengthening transmission {\em delay}. Second, naive stream merging can result in {\em unfair} data transmission. Third, if the memory cells comprising a stream and the reader of that stream are located on different processors in a distributed-memory system, then reading a value requires the overhead of sending a request message.\footnote{Analogous to driving all over town to pick up mail at different post offices, instead of having all mail delivered directly to your house.} The fairness problem can be solved with more sophisticated, dynamically-balanced merge trees \cite{mierowsky-cp-book}, although this is expensive in time. The delay problem has been solved both in software and hardware. In software, a data type, called a {\em mutual reference}, interfaces multiple writers to a single reader \cite{safra-cp-book}. Writing to one of the streams will atomically write the merged output stream and update the mutual reference to point to the new output tail. This scheme, originally designed for FCP, is the essence of implementing streams as buffers in other languages also. For example, the PIMs implement mergers, in microcode, in a similar manner \cite{inamura-naclp89,ueda-84}. The critical difference is that the new data structure is hidden from the language definition. Furthermore, with MRB (discussed below), memory can potentially be reused when merging a new writer into the stream. The indirection problem could be corrected by locating the buffer with the consumer (similar in intent to message-oriented scheduling, see Section \ref{synch}); however, in most concurrent logic languages, multiple consumers are permitted, and single consumers are not recognized as such by the compiler. Global analysis might be used to determine single consumer streams, or the languages can be restricted. The latter solution is another devolutionary step (notably Janus \cite{janus} and \aum \cite{yoshida-fgcs88}) with a ``single writer/single reader'' restriction and abstract stream semantics that do not constrain implementations. Janus defines a {\em bag} data type that can be used as a multiple-writer stream, with no constraint on write order. Janus also defines standard arrays; however, the restriction permits an implementation to automatically reuse array locations. Neither bags, nor reusable arrays, have yet been implemented for Janus. \aumc-90 is called a stream-based concurrent object-oriented programming language (SCOOL) by its authors \cite{konishi-fgcs92}, emphasizing the first-class citizenship of streams. Streams are implemented as buffer objects that can migrate. The migration policy moves buffers to their (unique) consumers, thereby obviating the overhead of sending read requests. This is implemented by having the producer and consumer initially communicate with {\bf where} and {\bf here} messages, allowing them to locate each other and begin message copying. Future messages are forwarded automatically to the new location. For a generate-\&-test prime-number generator, migration achieved a speedup of 12\% (on 10 Symmetry processors) compared to no migration \cite{konishi-fgcs92}. This increased to 70\% speedup on two Sparcstations connected over an Ethernet. The general topic of garbage collection is too large to cover here, but it is important nonetheless. There are fundamentally two approaches to garbage collection: static and dynamic. Static collection requires compiler analysis to determine the guaranteed reusability of a data structure. Code can then be generated directly for memory reuse. Incremental dynamic collection involves runtime checking to determine reuseability of structures. Furthermore, dynamic garbage collections across an entire memory (local or global) are required when the previous incremental methods fail. Examples of these collection types within concurrent logic programming are abstract interpretation for local reuse \cite{renga92}, binary reference counting with multiple reference bits (MRB) \cite{MRB,MRB-perf}, and several stop \& copy schemes (e.g., \cite{lazy-ref-count,imai-spdp,inamura-naclp89,piling-gc}). The MRB scheme \cite{MRB} is a one-bit approximative reference count per data cell. If the flag is off, the cell can be reused because it is guaranteed to have a single reader. However, once the flag is set, it becomes stuck and no reuse is possible. Setting the flag requires nontrivial rules for many of the KL1 abstract instructions that manipulate memory. The advantages, however, can be significant, for example, array copying can be dynamically converted to destructive update by exploiting the MRB method. Nishida {\em et al.} \cite{MRB-perf} demonstrated the effectiveness of the MRB scheme for a shared-memory multiprocessor model. Depending on data cache configuration, two small benchmarks displayed bus traffic reduction from 20\%--57\% on 16 processors. The least beneficial result of 20\% reduction clearly showed a drastic increase in cache-to-cache traffic that was indicative of reused cells being transferred between processors. Overall the method achieved significant reduction in memory-to-cache (``swap in'') traffic, indicating success at improving locality by reuse. Duvvuru {\em et al.} \cite{duvvuru-iwmm92} showed that MRB running on SUN Sparcstation had associated management overhead of 5--7\% for five small benchmarks. With static analysis \cite{renga92} to determine guaranteed single-referenced cells, this overhead was removed. It is clear however that array indexing can baffle such analysis, thus dynamic schemes still have utility. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % % Discuss issue of *where* to reuse optimally, % % as researched by Debray and Winsborough. % % % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The challenges of implementing efficient garbage collection schemes for concurrent logic and functional programs are similar. There are several garbage collection schemes proposed (and some prototyped) for concurrent logic languages. These efforts have concentrated on stop \& copy schemes for shared-memory multiprocessors, mirroring the general sophistication of the corresponding runtime systems. Research has recently focused on 1) distributed memory garbage collection schemes that do not require barrier synchronization of all processors within the collector \cite{koike}; 2) efficient, parallel stop \& copy garbage collectors for shared memory \cite{crammond-phd,imai-spdp}, and 3) generation-scavenging garbage collectors for reducing collection latency by interning long-lived objects \cite{ozawa-naclp90}. For generation-scavenging schemes, a problem arises when the ``old'' space (the interned space) points into the ``new'' space. This can occur for interned logical variables and reused objects (e.g., via MRB). Methods of trailing these unsafe cells and implementing indirection tables are costly \cite{piling-gc}, but seemingly unavoidable. \newpage\section{Future Work and Conclusions} Issues that are raised: is the pure model of computation really practical? Is it really worthwhile to have to pay all that extra cost for de-referncing every variable when in reality only occassionaly will there be need to suspend. Can clever techniques be developed to hide this under the hood --- or is there a basic distinction between synchronization (active) variables, and local (passive) variables (those on which you are guranteed not to suspend). Can some mecahnisms be described to allow the user to provide extra information for the implementation, without letting this information change the semantics of the program, such that the information can be used by the compiler to improve performance? \paragraph{Relationship to other theories of concurrency.} Recently there have been radical new ideas about concurrency. Two of particular note are the so called Chemical Abstract Machine~\cite{Boudol90}, due to Boudol and Berry, and mobile processes, due to Milner and his co-workers~\cite{Milner89}. In both of these approaches the key new ingredient is that processes can alter their interactions and are, in efffect, mobile. In our approach the interactions between processes is dynamic too in the sense that there is no predetermined set of agents that a given agent is limited to interact with. The relationships need, however, to be understood carefully. It would be particularly interesting to understand how a lambda abstraction mechanism could be incorporated into the concurrent constraint paradigm. Understanding the relationships with the work on mobile processes or Boudol's gamma calculus would be very helpful, as it is known that the latter can encode the lazy lambda calculus~\cite{Milner90,Jagadeesan90}. \subsection{Real-time languages} \subsection{Logic and state-change} \subsection{Higher-order languages} Be able to pass processes as parameters. Even though this may be simulated at the level of the language, isn't particularly a nice way of doing it. \subsection{Some broad themes from the viewpoint of programming languages} Why bother to develop stand-alone programming languages. Software architectures are more important than programming languages. The difference is one of syntax! Architectures can be embedded in any kind of host computing framework, e.g. C++. Model-based computing. \section*{Acknowledgements} E. Tick was supported by an NSF Presidential Young Investigator award, with matching funds from Sequent Computer Systems Inc. We thank T. Chikayama, J. Crammond, and S. Debray for sharing their expert knowledge, greatly assisting in the completion of the article. {\footnotesize \newpage \bibliography{/tilde/saraswat/bib/master-0} %\bibliography{/herbrand/saraswat/bib/master-0} %%\bibliography{/herbrand/saraswat/bib/master-0,/herbrand/saraswat/bib/master-2} \bibliographystyle{plain} %% \bibliographystyle{plain} %% \bibliography{a_k,l_z} } \end{document}