\documentstyle[12pt]{article}
\addtolength{\topmargin}{-.3in}
%\addtolength{\textwidth}{-.5in}
\addtolength{\textheight}{1in}
\setlength{\parindent}{.5in}
\setlength{\parskip}{.14in}
%\setcounter{totalnumber}{5}
%\setcounter{topnumber}{3}
%\setcounter{bottomnumber}{3}
\newcommand{\ohm}{$\Omega$ }
\newcommand{\ca}{$\approx$ }
\newcommand{\mnote}[1]{\marginpar{{\scriptsize #1}}}
\newcommand{\id}[1]{$<$#1$>$}
\newcommand{\prognam}[1]{{\bf #1}}


\input epsf

% One-column figure
% macro taken from ~bern/PAPERS/DUAL/draft1.tex

\newcommand{\OCfigure}[4]{\begin{figure}
\rule{5in}{1pt}
   \centerline{\vbox to #3 {\vfil
     \epsfysize=#3 \epsfbox{#1.ps}
     }}
  \caption{#2}
  \label{#4}
\rule{5in}{1pt}
  \end{figure} }


\title{Thesis Proposal --- \\Distributing Information for Collaborative 
Filtering on Usenet Net News}
\author{David Maltz}
\date{November 15, 1993}

\begin{document}
\maketitle

	As part of the ``Information Revolution,'' the amount of
information available to computer users has increased as never before.
Unfortunately, there has been a corresponding jump in the amount of
unrelated information users must search through in order find information of
interest.  Harnessing the power of multiple users to form a
collaborative filter may provide a robust way of helping direct users to
the information that will be most useful to them.  To test this idea, we
will create a large scale collaborative filtering system tuned to help
users extract information from Usenet Net News.

%This is introductory spew.  It will invite and captivate the reader to read
%further, once I figure out how to captivate you.  Please assume for the
%moment you are captivated and read on.

\section{Background}

	The Usenet is a loosely organized network of heterogeneous computers
stretching across all 7 continents.  These computers communicate via a number
of protocols such as \prognam{uucp}\cite{uucp} and TCP/IP data streams.  A
primary use of the Usenet network is the exchange of Usenet Net
News.\cite{netnews:sum} Net News functions like a huge distributed bulletin
board system (BBS) such that messages created at one site are eventually seen
at all sites on Usenet.

	Information on Net News takes the form of {\em articles} written by
individual users.  Users enter articles into the Net News system by {\em
posting} them.  Each site which participates in Usenet Net News runs a program
called a {\em news server} that exchanges articles with news servers at other
sites using the Network News Transfer Protocol~(NNTP).\cite{rfc977:NNTP} The
news server stores all the articles it receives for a number of days and
deletes older articles to make room for the new.  Until the articles are
deleted, the news server makes them available to be read by users at the site.
Users request articles from the news server using any of a number of {\em news
reader} programs which provide an user interface for reading and browsing the
articles.

	To help users find articles they are interested in reading, the
articles are arranged into a hierarchy of {\em newsgroups} which is organized
by topic.  To keep discussions on topic, some newsgroups are {\em moderated}.
Any article posted to a moderated newsgroups is first sent to a human moderator
who decides whether or not the article will be distributed across the Usenet.

	Usenet is currently growing to an enormous size.  Estimates show that
there are over 2.6 million users of Net News at some 87 thousand sites
throughout the world.  These users generate over 26 thousand new articles a
day, amounting to 57 Mbytes of data.\cite{reid:may93} In the past users
attempted to control the number of articles they had to read a day by only
subscribing to newsgroups on topics they are interested in.  However, the
continuing increase in the number of newsgroups and the number articles posted
daily has had the result that many users are presented with far more messages a
day than they can possibly read.  To cope with this overload of incoming data,
users have adopted reading strategies for dealing with the flood, but these
strategies often sacrifice any chance the user has of finding useful
information in the data.  In fact, one of the few things the users of Net News
seem to agree on is that they need a simple way of filtering messages to find
ones they will be interested in.\cite{jol:survey}

	Many filtering systems of different types have been created, but none
of them have all the characteristics that are needed for filtering Net News.
Most current filtering systems are designed for use by a single user.  These
systems attempt to extract patterns from the observed behavior of the user, and
then predict which other items from the information stream will be selected or
rejected based on the patterns.  An example of such systems is the LyricTime
project by Shoshana Loeb which compiles a profile of musical tastes for each
user of an on-line jukebox.\cite{loeb:lyrictime} LyricTime uses the user's
profile to filter songs from a collection of available music.  As the user
accepts and rejects songs suggested by the system, the system refines the
user's profile with the goal of delivering only songs which the user will
approve of.  Such systems suffer from a ``cold-start'' problem in that new
users start off with nothing in their profile and must train a profile from
scratch.  A better system would allow new users some type of access to the
experiences of current users to help create an initial profile.

	A more general problem with systems that rely on user profiles is that
the user can become circumscribed by their profile.  The profile only selects
articles similar to the ones the user has already read, and new areas which
might be of interest can be missed completely.  Further, if the user tries to
explore areas outside the knowledge domain of the profile, the profile can
provide little if any filtering ability, so the user will again face the
cold-start problem and be left facing a staggering amount of data to search
through manually.  To support the exploration of new areas what is needed is a
method of tapping the knowledge of users currently well versed in those areas.

	A collaborative filtering system supports such exploratory searching by
providing users with information derived from the experiences of previous
users.  Collaborative filtering is based on the observation that people are
good editors and filters for their friends.  Currently, when a user reads an
article that he thinks will be of interest to a friend, he copies the article
and sends it to the friend.  Often the friend does not even know that the
information contained in the article is available, and so could not have 
constructed a filter or query to extract the information directly.

%A system which provides such support could be called a
%collaborative filtering system.  People make good editors.  People can
%evaluate features important to people that might be AI-complete to detect
%otherwise.  More spew on collaborative filtering....

	To experiment with collaborative filtering, the Tapestry group at Xerox
PARC developed a system for retrieving documents from a growing corpus.  The
system selects documents to retrieve for a user by searching for keywords
present in the documents or annotations added to the documents by other
readers.\cite{tapestry} Tapestry's ability to answer queries which take into
account the opinions of other users is sufficient to enable collaborative
filtering.  Unfortunately, the system used by Tapestry suffers from two
distinct problems which prevent it from being widely applied to Usenet Net
News.  First, the system requires users to specify requests for information in
the form of queries in an SQL-like language.  Writing such a query requires the
user to have a firm sense of what types of articles he wants to read; this is a
hindrance to exploration of new areas.  Second, the architecture of the
Tapestry system incorporates a commercial database program and can not be
scaled to the size of the whole Usenet.  For this thesis we propose a
collaborative filtering system for Net News that does not need the full power
of a commercial database or SQL.

\section{Goal of the Thesis}

	Simply put, the goal of this thesis is to create and install a working
system for collaborative filtering on the Usenet.  At a minimum this system
will allow users of a selected set of news readers to vote for articles on a
numerical scale and thus provide future readers of the articles a way of using
the already recorded votes to filter their news.  The system will be scalable
so that it can grow from some small number of test sites to encompass the size
of the entire Usenet.

%  The system should be able to handle XX sites or a total of YY users
%without excessive load on CPUs or network bandwidth.\footnote{where I will
%figure out numbers for XX and YY based on PARC's voting patterns such that
%if YY people were available to vote, most groups would have votes in them.
%This is what I'd predict to be the minimum critical mass.}

	The construction of this system will pull together several elements
from projects such as the World Wide Web (gopher file transfer and
URL)\cite{rfc1436:gopher,URL}, NNTP (quick, robust transport)
\cite{rfc977:NNTP} and Tapestry (collaborative annotations; separate
storage of user annotations on document)\cite{tapestry}.  The evaluation of
the project will yield quantitative information about how Net News users
spend their time reading, and qualitative information about the users'
reading behaviors and strategies.  Most importantly, the project will
leave a working system to help the users of Net News filter their news.


%	The architecture of the system should provide for rapid transport
%of votes between the sites since if the votes are not available until after
%the users have already read the articles, the votes will be useless.  

	The architecture of the system should be robust against attempts to
``stuff the ballot box.''  Ideally, the system will allow each user to cast
only one vote per article.  At the very least, the system will make it
difficult to cast multiple votes for the same article.

	Under the current Net News system only a few people can be elected to
moderate a newsgroup.  Our collaborative system will provide a means whereby
any user who chooses to can act as a newsgroup moderator.  For example, if Joe
announces that he will moderate a newsgroup, other users who trust Joe's
opinions will be able to make requests of the form, ``Show me only the articles
in this newsgroup that Joe liked.''  Users who do not trust Joe's evaluations
will continue to see the articles in the newsgroup as usual.

	We assume that there is a large class of users who would be willing
to vote, but for a multitude of reasons might not want others to find out
what they have voted for.  For these users the collaborative system should
provide a level of anonymity at least as good as the current Net News
system.  To make the collaborative system comfortable for the most users,
this will be the default level of privacy.  

%At this level of privacy, only `trusted' systems at the local site will
%have access to information detailing which articles a user has voted for.
%These systems would then be covered by the site's own policies regarding 
%user privacy.

%	\mnote{should mention privacy issues in background} The system
%should provide at least two levels of privacy for the users.  There will be
%a a default level which provides at least as much anonymity as the current
%Net News system does, and a special level of privacy whereby people can
%publicize their opinions of articles.  The default will be privacy such
%that only trusted systems at the user's local site will have access to
%information about which groups or articles a user reads.  This is as the
%current NNTP system is now.  Outside the local site, only summary
%information will be available.  The special level of privacy allows people
%to disseminate their opinions in a form directly attributable to them.
%MAKE THIS WHOLE LOTS CLEARER!!  ABOUT THE PUBLIC .VS. ANON OPINIONS THAT IS

%-------------------------------- 

\section{Behaviors of Net News Users}

	Given that the intent of the thesis is to create a system that will be
useful for filtering Net News, it is crucial that the system be designed in
such a way that the system supports the users.  In order to support the users,
we first needed to find out how they currently go about reading Net News so our
system can be consistent with their demands.  Based on informal surveys, we
formed several hypothesis to begin the investigation.

	We claim that since Net News is being used primarily for a casual,
serendipitous search for information, users are only willing to spend a
certain amount of time using Net News.  The number of newsgroups a user
will read is tightly limited by the work she must do to process the
articles in those groups.  If a user can't find articles she is interested
in, or she feels overwhelmed by the amount of searching she must do to
find interesting articles, she will give up the search in that newsgroup
and move on to an other.

%\footnote{this is really just a substantiation
%that {\em filtering in general} is a good idea, not necessarily that collab
%filtering is a good idea - I suppose we won't know if collab filtering
%works or not till we're done?  Could survey Tapestry users to see if they
%have queries that look at likeit/hateit and find out how they like the
%stuff they get from those}

	To substantiate this claim, we instrumented the
\prognam{nn} and \prognam{xrn} news readers at PARC to record information 
about how users read news.  Analysis of the collected data highlighted the
reading behavior of the users as a group, and suggests some limits on the
length of time that users are willing to spend reading news.  Any system that
tries to make Net News more readable will have to operate within these time
limits, and bring the user more useful information during that time
span.

\subsection{Obtaining the data}
	In order to obtain data which would give an
accurate picture of the behavior of most users, we wanted to collect data
from all the Net News users at PARC.  We chose not to ask for a group of
volunteers to let us collect data on their reading habits for fear of biasing
our survey with a self-selected sample group.  Unfortunately,
this choice meant that since we did not have explicit permission from each
individual to record information about him or her, we had to record data in
such a way that no individual could be identified from collected data.

	To preserve the anonymity of the user community all the data was
recorded in a global append-only log file.  The log file consists of a series
of records; each record representing one {\em session} with a news reader --- a
session being defined as the time between when a user starts a news reader
program to when he exits it or has a four hour period of inactivity.  For every
newsgroup the user looked at during the session an entry was made in the
log file.  We considered an article to have been ``read'' if the text of the
article was displayed to the user.  An excerpt of raw data from the global log
file is shown in figure~\ref{ex-logfile}.  Because \prognam{nn} and \prognam{xrn}
have different user interfaces, they recorded similar, but not identical,
types of information for each newsgroup.

\begin{figure}[htb]
\rule{5in}{1pt}
\begin{verbatim}
-- xrn entry -- Mon Jul 19 15:49:06 1993
  GRP: AVAIL: 1   READ: 1   TIME: 125 *
  GRP: AVAIL: 2   READ: 1   TIME: 1 *
  GRP: AVAIL: 11   READ: 3   TIME: 69 *
  GRP: AVAIL: 7   READ: 3   TIME: 89 *
-- nn entry -- Mon Jul 19 18:16:21 1993
  GRP: AVAIL: 33   READ: 0   TIME_READ: 0   TIME_SCAN: 4665 
  GRP: AVAIL: 44   READ: 0   TIME_READ: 0   TIME_SCAN: 15 
  GRP: AVAIL: 3   READ: 0   TIME_READ: 0   TIME_SCAN: 6 
\end{verbatim}
\caption{Excerpt of raw log file data from \prognam{nn} and \prognam{xrn}.}
\label{ex-logfile}
\rule{5in}{1pt}
\end{figure}

	\prognam{xrn} begins the record for each session with a line
identifying the record as coming from \prognam{xrn} and the time at which
the record was entered into the log.  The record contains one line
beginning with {\tt GRP:} for each newsgroup the user was shown.  For each
newsgroup, \prognam{xrn} recorded: 1) How many articles were available to be
read in the group({\tt AVAIL:}).  2) How many articles were actually read
({\tt READ:}). 3) How much time, in seconds, was spent both reading and
selecting articles ({\tt TIME:}). 4) Whether one of the ``catch up''
buttons was used to ignore a set of articles with one action.
The use of a catch up button is shown by an asterisk.

	\prognam{nn} begins the record for each session with a line
identifying the record as coming from \prognam{nn} and the time at which
the record was entered.  The record contains one line beginning with {\tt
GRP:} for every newsgroup the user was shown.  For each newsgroup,
\prognam{nn} recorded: 1) How many articles were available to be read in
the group.  2) How many articles were actually read.  3) How much time was
spent reading the selected articles (in seconds).  4) How much time the
user spent scanning the subjects of available articles choosing which
articles to read (in seconds).


\subsection{Trends in the data}

	Once analyzed, the collected data show the following trends which
we believe are relevant to users' ability to find interesting articles
in the Net News system:

\begin{enumerate}

%- number of groups read = 12
\item Users read an average of 12 newsgroups each session.
	
%- back off in scanning based on num articles available
\item A plot of the length of time a user spends searching for articles to
read in a newsgroup versus the number of articles available to be read in
the newsgroup shows a telling breakpoint at around 200 articles (see
figure~\ref{scn-v-avail-gr}).  The graph also shows an elbow at around 50
articles.  We infer from this graph that users reading a group with fewer
than 50 some articles will take the time to actually read the subject lines
of the available articles.  In groups where there are 50 to 200 articles
available to be read, users scan or in some way process the subject lines
more rapidly.  In groups with over 200 articles available, users seem to
skip all the articles.

\OCfigure{/tilde/maltz/w/logs/scan-per-avail-real-gr}{Plot showing the
length of time users spent choosing articles to read versus 
the number of articles presented as available.}{3in}{scn-v-avail-gr}

%\begin{figure}[htb]
%\rule{5in}{1pt}
%\vspace{2in}
%\par
%\caption{Plot showing the length of time users spent choosing articles to read versus the
%number of articles presented as available.}
%\label{scn-v-avail-gr}
%\rule{5in}{1pt}
%\end{figure}

\item Far more people read Net News than post.  Based on
data gathered from 632 sites by the Network Measurement Project at the DEC
Network Systems Laboratory, it is clear that for all groups, there are more
``lurkers'' than ``posters.''\cite{reid:aug93sum, reid:aug93full} This is
true regardless of the number of people who read the group or the number
who post articles to it.
Table~\ref{grp-post-tab} shows some sample data for groups with the largest
readership, the smallest readership, the greatest traffic, and several
others.

\begin{table}
\rule{5in}{1pt}
\caption{{Posting volume and estimated worldwide readership for several news
groups. (Data from Reid, USENET Readership report for Aug 93)}}
\begin{center}
\begin{tabular}{|l|r|r|c|} \hline
newsgroup & est. readers & \# posts/month & comment\\ \hline \hline
misc.jobs.offered & 200,000 & 2,016 & most readers \\ \hline
rec.humor & 150,000 & 2,087 & very popular \\ \hline
comp.unix.questions & 130,000 & 1,062 & \\ \hline
talk.abortion & 50,000 & 1,590 & controversial \\ \hline
rec.arts.tv.soaps & 45,000 & 5,125 & most traffic \\ \hline
bit.listserv.hellas & 12,000 & 802 & \\ \hline
aus.sport & 2,500 & 238 & fewest readers \\ \hline
\end{tabular}
\end{center}
\label{grp-post-tab}
\rule{5in}{1pt}
\end{table}

\item The time a user spends on each article, even after he or she has
taken an action to display the full text of the article, is very short.  The
data taken so far show that users spend an average of 40 seconds reading an
article.  Data broken down by article length and group will be available after
the results of a second study are in.

\end{enumerate}


\subsection{Why people will vote}
	A common concern of those hearing about this project for the first time
is, ``Why would any user take the trouble to vote?  Especially since you just
showed how little time people are willing to spend on Net News.  Won't you end
up with no votes?''  To test this hypothesis we constructed a simple prototype
collaborative system at PARC by modifying the \prognam{nn} news reader so that
its users could cast votes.  Estimates from the log data described above show
that there are about 40 users of \prognam{nn} a day.  During a span of 54 days
the prototype system recorded 222 votes from 25 users demonstrating that over
half the population of \prognam{nn} users was willing to cast votes.  It is
possible that we are observing a tendency among PARC researchers to be willing
to try some new feature for the novelty effect, or a desire to help another
researcher's project.  However, the data from PARC shows that it is plausible
that the overall user community will be willing to take the effort to vote on
articles.


%----------------

\section{Design Criteria}

	If any collaborative filtering system for Usenet Net News
is ever to be successful, it will need to meet the requirements of two
major communities.  First is the community of users, who will have to use
the system and derive benefits from it.  Second is the community of net
news system administrators who will have to build, install and maintain the
system.  The experiments described above give us a feel for what the users'
requirements will be.  To determine the requirements of system
administrators, we have turned to the informal opinions of administrators
expressed on the news.admin.misc, news.admin.technical, and news.future
newsgroups on Usenet itself.

	  Based on the behavioral data gathered from users and informal
knowledge derived from reading the posts of various current system
administrators, we have chosen the following as general requirements:

\subsection{Architecture criteria}
\begin{enumerate}

\item Avoid centralized servers or clearinghouses.  Usenet is an anarchy
with very little central control.  There is no one site or entity with
responsibility for all of Usenet, and no one site or entity is able to control,
censor, or shut-down Usenet.  This decentralization is useful from a legal
standpoint as there is no organizational entity to sue, and important to
administrators who need and want control over their sites.
%In the end, it is the individual sites who own the machines and disk space.
 
\item Traffic and disk space requirements should not be unreasonable.  We
do not have a good estimate of what would be unreasonable.  We believe that
an extra ten percent overhead would be reasonable.  %\footnote{for some data
%on ftp/nntp/mail volume, see ~/Mail/p/65 or now in ~/th/nsf-9308.*}
 
\item Minimize the amount of software re-writing that needs to be done.
Unless collaborative filtering proves itself as a extremely useful, we
assume we will be the only ones writing software to support it.

\item Provide a simple, yet sufficient, application programming interface
(API) to the collaborative software.  To encourage the writers of news reading
programs to include our collaborative features, we must make it as easy as
possible for them.  We will provide the framework for the collaborative
extensions in self-contained modules.
 
\item It is probably easier to convince administrators to change their news
readers than their NNTP news server software.  While we think many designs for
a collaborative system could be simplified if they had some support from the
news servers, informal opinions show that most news administrators spent large
quantities of time and chewing gum sticking together and customizing their news
servers for their particular sites.  News readers, on the other hand, are
reasonably easy to build or recompile.  Unlike the news servers, news readers
also contain relatively little state so that if a new version fails it is easy
to go back to the old one.

\end{enumerate}

\subsection{User interface criteria}
\begin{enumerate}

\item Help the majority of Usenet users.  There is a huge number of silent 
readers --- all groups have far more lurkers than active posters.
These lurkers are the people we believe collaborative filtering can help the
most, and also the people who least want to wade through all the traffic on
Usenet.  These are also the people least willing to announce their presence
(ie: they do not post), and most likely the ones who will least tolerate
any extra overhead in their news reading.

\item Do not interrupt the existing flow.  On average, users spend so
little time reading most articles that any operation we expect a majority of
users to perform must be exceedingly quick and consistent with the flow of the
interface they already use.  Voting for articles must be seamlessly integrated
into the way the user processes Net News. %mentioned voting yet?

\item The filter must behave in a simple way that is easy to understand.
If users can not easily understand how the system is trying to help them,
the system will only be getting in the way of the user's tasks.

\item No articles may be suppressed.  If a user desires to see all the
articles posted to a group, the system must provide them.  Further, the
system may not alter in any way an existing article --- to do so violates
the rights of the article's author.

\end{enumerate}

\section{Architecture}

	
\OCfigure{arch-portrait}{Block diagram for the proposed architecture.}{8in}{sys-arch-ps}

	The proposed architecture for the system is shown in
figure~\ref{sys-arch-ps}.  The fundamental object on which the system works is
a vote.  A vote represents a single user's evaluation of a single article.  The
architecture is broken up into two main pieces.  The first piece is the
interface module.  This interface stands between the news reader programs and
the collaborative system.  The interface sends votes off into the collaborative
system, and answers requests from the news reader for information about the
popularity of an article.  The interface obtains information about the
popularity of an article by consulting various vote sources which hold
collections of previous users' votes.  The second piece of the architecture is
the vote server.  The vote server acts as a vote source and is responsible for
maintaining a vote database which contains the collected votes of all Usenet.
The vote server incorporates votes from its own site and other sites into its
database, and exchanges the votes cast by local users with the vote servers at
other sites.  We will now describe each of these parts in more detail.

\subsection{Votes}
	Users evaluate the articles they read to create objects we call
votes.
%users create a numerical
%evaluation for an article which is called a vote.  
A vote contains the name of the newsgroup in which the evaluated article resides,
the unique message-id assigned to the article, and the user's evaluation of the
article.  Evaluations are one of {\sl terrible, ok, good,} or {\sl great}.
Although a single article can appear in more than one newsgroup, the vote
only contains the name of the newsgroup in which the user read the article.  This
decision should make the accumulated votes more useful for filtering as
newsgroups are organized around topics.  An article cross posted to several
newsgroups may be extremely relevant and important in one newsgroup, but
totally irrelevant in the others.  This decision will also discourage users
from posting their articles to more than one newsgroup in hopes of garnering
more votes for the article.

\subsection{Vote sources}
	A collection of votes is termed a vote source.  We plan to support
two types of vote sources in order to support the two major classes of
queries we expect users to make of the collaborative system.  The first
type of vote source will be a specially formatted vote file.  The second
type of source will be a vote server program.  

	Vote files are the more secure type of vote source.  Much work has
already been done to create filesystems that allow tight control over who can
read or write a file.  Votes written into a file can be guaranteed to both
originate only from one of the people with write permission on the file, and to
be read only by those with read permission on the file.  The intention is that
if an individual creates a vote file, other users can consult that vote source
to determine how the individual evaluated an article.

	The presence of these individualized vote sources makes it possible
for users to create filters of the form, ``Show me the articles that
Jane Doe liked.''  The system can answer the query by consulting Jane Doe's
vote file provided that Jane Doe has provided the user making the
query with access to her vote file.  Jane Doe can control which people have
access to her opinions in the same way she currently controls who has
access to her other files.  % Say something here about how this enables 
%multiple moderators for a news group

	Vote servers as a source provide access to the evaluations of anonymous
users throughout the Usenet.  When an evaluation of an article is requested
from a vote server, the returned response represents a summary of all the votes
posted for that article by any user anywhere in Usenet. The information
contained in the vote server supports the exploratory searching described
earlier as presumably the best articles in a newsgroup will have received the
most votes.  New users exploring a newsgroup can quickly locate the articles
previous users of the newsgroup thought were most relevant.  Based on these
articles, the users can decide whether or not to continue searching the newsgroup
for information.  
%As discussed below in the section on security, the evaluations
%retrieved from the vote server may suffer from ballot box stuffing.

	In order for the collaborative filtering system to work with multiple
vote sources, there must be a naming convention to describe and identify each
vote source.  Uniform Resource Locators (URLs) \cite{URL} provide such a
mechanism so vote sources are referred to by their URL.  For example, a vote
file with a local name of /usr/maltz/votes available by anonymous ftp from
intrepid.xerox.com would have the URL {\sl
ftp://intrepid.xerox.com/usr/maltz/votes}.  A vote server that listened for TCP
connections at port 3636 of host news.xerox.com would be known as {\sl
telnet://news.xerox.com:3636}.

	While sites are assumed to have a local vote server whose URL is well
known at that site, some method must exist for informing users of the URL for a
vote file.  Any of several approaches could be used for informing other users
where a vote file lives.
%The location of a person's vote file is just another piece of
%localizing information similar to a telephone number.  
The simplest mechanism for finding a person's vote file would be to ask them
personally.  This mechanism is identical to the way phone numbers are currently
exchanged.  A more complicated mechanism would be for users to include the URL
for their vote file as an extra header line in their posts to Net News --- many
people already distribute their phone numbers and addresses in this fashion.
News reader programs could take advantage of the extra header line by
extracting the URL from articles the user liked.  These URLs might then be
added to the list of sources the user uses to filter Net News.
	
\subsection{Interface module}

	The interface module provides simple methods for a news reader program
to send votes into the collaborative system, or to request information from the
collaborative system.

	Exactly how a user indicates his or her evaluation of an article is
a user interface question left up the designer of the news reader program.
The news reader designer can choose whatever mechanism fits
in best with the existing user interface.  %could mention auto-voting based
					   %on observations here
When the news reader program invokes
the interface module to enter a vote into the collaborative system, the
news reader specifies where the vote should be sent.  The vote can either
be sent on to a vote server where the vote will be combined with other
votes and distributed to the entire network, or the vote can be placed in a
vote file.

	When answering requests for information the interface module can
combine the votes from several sources, each source being a vote server or
a file of votes previously written by another interface module.  Combining
different vote sources in the interface module makes possible queries of
the form: ``Show me all the articles that Joe User or Jane User liked.''
The collaborative information returned by the interface module will consist
of lists of articles and the articles' accumulated evaluations.  The news
reader program can use this information as it sees fit to help the user
filter his or her news.

%	The interface module must be complicated enough to read and write
%files because this ability is what enables users of the collaborative
%system to choose their own moderators for a newsgroup.  Consider the choice
%of a user who has just cast a vote.  If she wishes her vote to be
%relatively anonymous, she sends it to the vote server where it will be
%combined with all the other votes from her site so that she can not be
%individually identified.  If she wishes to act as moderator for a
%newsgroup, she has her votes posted to a file.  Other users who want to see
%the newsgroup as if she were moderating it instruct their interface modules
%to gather collaborative information from the file she wrote out.  


\subsection{Application Program Interface}
	The application program interface (API) will provide a structured way
for the news reader program to communicate with the interface module.  The API
has two main parts.  The first section deals with how the news reader casts
votes for articles, and the second section deals with how the news reader gains
access to the collected information for an article.  A specification for the
procedures in the API can be found in table~\ref{api}.


\begin{table}
\rule{5in}{1pt}
\caption{Specification of the API to the interface module.}
\begin{small}
\begin{verbatim}
/******* Routines for casting votes *********/
/* return status: 0 = failed, 1 = successful */
int collab_enter_vote_anonymously(/* char *message_id, vote, char *group_name */);
int collab_enter_vote(/* char *message_id, vote, char *group_name */);
int collab_send_off_votes(/* char *file_name*/); 

/******* Routines for querying sources for votes and information *******/
char* collab_available_groups(/* int *num_groups */);
/* return a new string of the group names separated by \n characters;
   the number of groups will be left in the loc pointed to by num_groups */

vote_t* collab_get_votes_for_group(/* char *group_name */);
/* return a linked list of the votes for articles in group_name,
    return NULL if there are no votes available */

/**** change which sources we collect data from ****/
/* return status: 0 = failed, 1 = successful */
int collab_add_source(/* char *source_url */);
int collab_remove_source(/* char *source_url */);

/* free memory for the list returned from collab_get_votes_for_group */
void collab_destroy_votes(/* vote_t *votes */);

/* notify someone of problems in the collaborative system */
int issue_error(/* varargs : fmt, args */);
\end{verbatim}
\end{small}
\label{api}
\rule{5in}{1pt}
\end{table}

\begin{table}
\rule{5in}{1pt}
\caption{Structure of vote\_t record.}
\begin{small}
\begin{verbatim}
/****** Structure of a vote ******/
struct _vote_data_t {
  int 			num_high;
  int			num_med;
  int			num_low;
  int			num_terrible;
  char 			*stuff;
}; typedef struct _vote_data_t vote_data_t;

struct _vote_t {
  struct _vote_t	*next_vote;
  vote_data_t		*vote_data;
  char			*message_id;
  void			*internal_magic;
}; typedef struct _vote_t vote_t;
\end{verbatim}
\end{small}
\label{vote-t}
\rule{5in}{1pt}
\end{table}

	Casting a vote requires only that the news reader provide the interface
with information required to create the vote: the name of the newsgroup in
which the article appears, the message-id contained in the article, and the
user's evaluation of the article.  Before exiting, the news reader must call
the procedure {\sl collab\_send\_off\_votes()} to send the votes to the
collaborative system.  If the news reader provides a file name as an argument
to {\sl collab\_send\_off\_votes()}, any votes not cast anonymously will be
written to that file.

	Retrieving information about the votes available to help filter the
articles in a news group requires first setting up a list of sources (eg:
vote files and/or a vote server) that should be searched for votes.  The
{\sl collab\_add\_source()} and {\sl collab\_remove\_source()} routines are
provided for this purpose.  The strings passed in as source names should be
the URLs of the vote files or vote servers.

	Once a list of sources has been set, a string containing the names of
all the news groups which have votes in them can be obtained by calling {\sl
collab\_available\_groups().} The votes for the articles in a group can be
obtained by calling {\sl collab\_get\_votes\_for\_group().} This will return a
linked list of {\sl vote\_t} structures (see table~\ref{vote-t}) each of which
contains the vote information for an article.  The vote information tells the
number of {\sl terrible, ok, good,} and {\sl great} votes the article received.  The linked
list of {\sl vote\_t} structures should be destroyed when it is no longer
needed by calling {\sl collab\_destroy\_votes().}

\subsection{Vote server}

	The vote server is the part of the system responsible for accumulating,
marshaling and redistributing votes from one site to another.  Votes sent to
the server by an interface module are accumulated into the server's vote
database and marshaled into a packet with other votes to be sent on to
neighboring vote servers.  Since we have chosen to handle only numerical votes
in this thesis, we can perform the accumulation by simply adding the new votes
into running tallies of the number of {\sl terrible, ok, good,} and {\sl great}
votes an article has received.  If votes contained more diverse information
(perhaps character string keywords) the accumulation process could be extended
to include a process such as string concatenation.

	The vote servers exchange information in a fashion similar to the news
servers.  At regular intervals each vote server contacts its neighbors and
offers to exchange packets of votes with them.  Before contacting other vote
servers, a server combines all the votes received from local users since the
last information exchange into a single packet.  The vote server then offers to
send each neighbor the packet of new votes and any packets the server has
recently received from other neighbors.

	When a server receives a packet for the first time, the server combines
the votes from the packet into the server's vote database.  The server records
the packet as having been accepted so the server will not accept the same
packet twice.  The server continues to store the packet however, so that the
packet can be offered to the server's other neighbors.  Once the server has
offered the packet to all its neighbors, the packet will be deleted.
%If a reduction in network traffic is required,
%some servers could be designated to combine all the packets they receive to
%together and repost only this summary.

\subsection{Transport of information}
	
	So that the collaborative system will work at as many sites as
possible, we will attempt to keep the transport mechanisms as general as
possible.  For example, communication between the collaborative interface and
the vote server can take place via a remote file system abstraction; TCP/IP
connections; or email.  Any site receiving Net News will support at least one
of these communication mechanisms.  Likewise, information sent between vote
servers can be in the form of email or TCP/IP.

	To retrieve vote files created by users at other sites, the
interface module will use the URL provided as the name of the source to
create a local copy of the vote file.  For this thesis we will support the
afs and ftp protocols for accessing vote files.  See \cite{URL} for more
information about these protocols.


% The interface module then can
%combine the votes from the now local file with the votes from the local
%vote server or other vote files in order to compile collaborative
%information for a news reader program.  

\subsection{Security and privacy offered by the system}

	The primary security issue the collaborative system must protect
against is ballot-box stuffing.  If a large enough group of people start
using the system to filter their incoming news, stuffing the ballot-box for
or against articles will become a way of harassing and censoring groups of
users on the net.  To prevent the opinions of any individual from having
undue effect on the outcome of the collaborative filtering process, the
collaborative system should ensure that each individual can only cast one
vote for each article.  Unfortunately, there is an inherent trade off
between the desire to allow users to vote anonymously, and the desire to
prevent ballot-box stuffing.  If every vote is transported through the
system authenticated with a digital signature\cite{RSA} from the user
casting the vote, it is possible to prevent ballot-box stuffing completely
as the final consumer of the votes can ensure that each user votes only
once.  The computational and organizational costs of such an authentication
system are very high however, and gone is the ability for a user to vote
anonymously.  If the vote distribution system is completely anonymous, then
there is no way of stopping a person from voting as many times as they want
since there is no way to tell who has voted and who has not.
%\footnote{With
%the possible exception of an economic incentive for honesty in the form of a
%per vote poll tax.  If you're willing to pay enough, you can swing the vote as
%much as you want.}

	We propose a system which is not perfect, but should provide protection
against the most common means of attack.  If users have their votes posted
to vote files, then the existing systems for file protection can insure that
only authorized users read or write votes to the vote file.  We assume that
users will not deliberately alter their vote files to stuff the ballot box.  If
some users do alter their vote files, we rely on other users noticing this
tampering and exposing the tamperer.  Since vote files are associated with
individual people, the number of people who use a vote file to filter their Net
News will be a strong function the reputation of the vote file's owner.  Once a
person has been exposed as keeping a dishonest vote file, other users will
likely exclude the vote file from the list of sources they use for
collaborative information.  If no one uses the altered vote file to filter Net
News, it can not do any damage.

	Preventing a user from sending multiple votes for a single article to a
vote server is harder problem to solve.  Since the votes distributed by the
vote servers are anonymous, there is no way to detect if single user has voted
multiple times once the votes are entered into the stream of distributed votes.
However, some security {\em can} be provided if the communication between an
interface module and a vote server is not anonymous.  An honest vote server can
enforce a one vote per article per user rule by requiring users to authenticate
themselves to the vote server before transmitting any votes.  The local vote
server can record which users have cast votes for each article, and prohibit
the same user from casting a second vote for the article.  Under this system
the local vote server will know which articles a user has read.  Since the vote
server is local to the user's site, it can be trusted to obey the local policy
determining the privacy of personal information in electronic form.  Many sites
now have such policies which declare who on the system can legitimately access
system logs containing personal information.

	 Users will authenticate with their vote server by sending their user
identification number (uid) along with their votes to the vote server.  This
will prevent casual attempts at ballot-box stuffing, but would not stop a
person who writes a program which masquerades as an interface module to the
vote server.  This stuffing program could generate an arbitrary sequence of
numbers for uids --- effectively casting votes for all the users on the system.

	Vote servers can be configured to only exchange votes with specific
peers, so there is little risk of an individual inserting fake vote packets
into the stream of vote packets exchanged between honest vote servers.
Unfortunately, there is little that can be done to stop dishonest vote servers
from generating fake vote batches to stuff the ballot-box.  Once a dishonest
vote server is identified (most likely by human operators), the only option to
secure the system is for peers to derecognize the dishonest vote server and
cease exchanging votes with it.  This approach is currently used on Net News
itself to control users who forge messages.


\subsection{Type of information exchanged by vote servers}
	Will numerical rankings from users provide enough information to help
other users filter articles?  Is there other information we could gather
inoffensively which would be cheap to transport and useful for filtering?  This
is an important area of inquiry for tuning the collaborative filtering system
and making it work well.  However, this project will not directly address these
issues.  The vote transport system we create will be flexible enough to
encompass extra information as an add on.  Once some system is operational, it
will be possible to experiment with exchanging different types of information.

\subsection{Comparison of architecture and design criteria}

	The proposed architecture meets the requirements set out in the design
criteria section.  The vote servers keep the vote information in a separate
stream from the news articles so that system administrators do not have to
modify their current news servers.  By placing a vote server local to every
participating site and providing a means for them to exchange information
between themselves, there is no one central vote server authority.  Since votes
are stored separately from the articles, there is no change made to any
article.  Any user who wants to see all the articles posted to a newsgroup in
their original form can do so by simply ignoring the votes, so there is no
issue of censorship.  The API provides an interface between the collaborative
system and the news reader program, so modifying existing news readers to use
collaborative filtering should be easy.  Integrating collaborative filtering
into existing news reader programs will both prevent users from having to learn
a new user interface, and allow news readers to use the collaborative filtering
data in whatever way the program's user interface supports best.

\section{Evaluation Criteria}

	We propose to evaluate the success of this project along two
dimensions.  The first dimension will be the level of acceptance the system
achieves in the Usenet community.  The second dimension will be based on the
improvement the project provides to users.

	To evaluate to acceptance of the system we have several obvious
quantitative measures.  How many sites are using our system?  How many
votes are being cast?  Does the number of votes increase or decrease over
time?

	Evaluating the ability of the system to provide a useful service will
depend on both quantitative and qualitative measures.  We plan to take informal
surveys of users to find a rough estimate of user happiness with the filtering
system.  To quantitatively evaluate any improvement made by the system, we will
look for changes in the user behaviors measured before implementation of the
system.  Specifically, we hope to see an increase in the number of newsgroups a
user reads (on the assumption that the user now has an easier time finding
relevant articles in a newsgroup and so can broaden his or her search to
include more newsgroups).  We also hope to see an increase in the number of
groups in which the user reads some articles (this will demonstrate users are
not being overwhelmed by the number of articles available in the group).

\section{Timeline for Thesis Work}

	At present we have a small, working system consisting of a vote server
and two news reader programs which have been modified to send and request
votes.  This current system is deficient from the proposed one in four major
ways.  First, the interface between the collaborative system and the news
readers is {\it ad hoc} and poorly formed.  Second, the system does not permit
votes to be stored in or read from files.  Third, the system does not scale up
well.  Fourth, the system provides little of the privacy or security discussed
in the architecture proposal.  We plan to work on these problems in roughly the
order listed while integrating into the design new insights we gain from the
operation of our existing system.

	Since the available time may not permit the implementation of the
entire system, we will strive hardest to develop the vote server to the point
that it can be released to the Usenet network.  Once this server is available
for release, it will be possible to make some evaluation of the system no
matter how {\it ad hoc} the other components of the system are.

\begin{description}

\item[By Late October]  The API for the news reader programs will be
frozen.  The interface module will be able to save votes to files and retrieve
and combine votes from several sources.  We will have extended the system to
include the MIT Athena environment.

\item[By Late November]  A vote server which can exchange articles in
the fashion described by the architecture design will be ready for release.  We
will begin exporting the system to any site that will accept it.

\item[Through December]  Perform damage control and create bug fixes for
problems in the system unearthed by the release.  Re-release of software if
required.  Begin writing the text of the thesis.

\item[Through January]  Continue thesis write up.  Evaluate system
performance according to evaluation criteria.

\item[By May]  Continue to maintain released system, handing it off to
others if possible.  Complete thesis writing to the satisfaction of David
Goldberg and Karen Sollins.

\end{description}

\section{Conclusion}
	The system presented in this thesis proposal will become the first
attempt at providing collaborative filtering to users across the Usenet.
We believe our architecture will be accepted by the Net News community as
its design is based on the observed behaviors of Net News users and system
administrators.  Once operational, the system will provide an effective
way for users to locate information of greatest interest to them.
	
\begin{thebibliography}{00}

\bibitem{rfc1436:gopher} Anklesaria, F.; McCahill, M. et al.; ``The
Internet Gopher Protocol,'' RFC 1436, University of Minnesota, March 1993.

\bibitem{URL} Berners-Lee, Tim; ``Uniform Resource Locators,'' working
draft, CERN, July 1993. available via ftp from info.cern.ch as
/pub/www/url6.[txt,ps]

\bibitem{jol:survey} Jolicoeur, Lucie, Newsreaders survey RESULTS
(long!),\linebreak[0] Usenet news.software.readers, 10 Aug 93, Message-ID:
\id{1993Aug10.205558.18192@IRO.UMontreal.CA}; [185 lines].

\bibitem{rfc977:NNTP} Kantor, Brian; Lapsley, Phil; ``Network News Transfer
Protocol,'' RFC 977, U.C. San Diego, February 1986.

\bibitem{tapestry} Goldberg, David et al.; ``Using Collaborative Filtering to
Weave an Information Tapestry,'' {\em Communications of the ACM,} December
1992, Vol 35, No 12, pp. 61-70.

\bibitem{loeb:lyrictime} Loeb,
Shoshana;  ``Architecting Personalized Delivery of Multimedia Information,''
{\em Communications of the ACM,} December 1992, Vol 35, No 12, pp. 39-48.

\bibitem{uucp} O'Reilly, Tom; Todino, Grace; {\em Managing uucp and
Usenet,} O'Reilly \& Associates, Inc., 1992.

\bibitem{reid:aug93full} Reid, Brian; USENET Readership report for Aug 93,
Usenet news.lists, 9 Sep 1993, Message-ID:
\id{26m3ae\$n5k@usenet.pa.dec.com}; [2342 lines].

\bibitem{reid:aug93sum} Reid, Brian; USENET READERSHIP SUMMARY REPORT FOR AUG 93,
Usenet news.lists, 9 Sep 1993, Message-ID:
\id{26m39n\$n4l@usenet.pa.dec.com}; [394 lines].

\bibitem{reid:may93} Reid, Brian; USENET READERSHIP
SUMMARY REPORT FOR MAY 93, Usenet news.lists, 2 Jun 1993, Message-ID:
\id{1uibjb\$8de@usenet.pa.dec.com}; [402 lines].

% really, this should probably be a cite to the Diffie Hellman paper,
% since they described dig sigs and any implementation would work...
\bibitem{RSA} Rivest, R.; Shamir, A.; Adleman, L.; ``A Method for Obtaining
Digital Signatures and Public Key Cryptosystems,'' {\em Communications of
the ACM,} Feburary 1978, Vol 21, No 2, pp. 120-126.

\bibitem{netnews:sum} Weinstein, Sydney; ``Special Issue: Network News,''
{\em The C Users Journal,} February 1991, pp. 109-116.

\end{thebibliography}

\end{document}