Personalization and Users' Semantic Expectations
Raymond Suke Flournoy, Ryan Ginstrom, Kenichi Imai, Stefan Kaufmann, Genichiro Kikui, Stanley Peters,
Hinrich Schütze, Yasuhiro Takayama
Introduction
A fundamental expectation users have when they submit a query to a
retrieval system is that the words they submit have the meaning they
are familiar with. We refer to this type of expectation as semantic
expectations about query terms.
Unfortunately, semantic expectations
are often thwarted. The reason can be ambiguity. The user may only be
thinking of one of the senses of a term even if she knows other senses
and would readily recognize them when attention is drawn to the
ambiguity of the search term. Another reason for a thwarted semantic
expectation can be that there is a mismatch between the context of the
user and the context of the document collection that is searched. For
example, a New Zealander who launches a World Wide Web query "National
Party" may be unaware that Australia and Scotland also have National
Parties, so that many hits will be relevant to Australia and Scotland,
not to what the New Zealander (presumably) is looking for.
In this paper, we propose to better meet users' semantic expectations
by means of personalization. Instead of using the customary
decontextualized representation of search terms as strings of letters,
we instead use word space vectors (or word vectors for
short),
personalized representations of each search term which are derived
from the user's email. The word space vector of word w is
an aggregate representation of the set of words that w is
strongly associated with.
Using word vectors for retrieval is similar to two other strategies
for managing expectations: searcher profiling and query expansion. The
set of word vectors derived from a user's email are in a sense a very
detailed and complex profile of what this user is interested in.
Word vector retrieval has a flavor of query expansion in that we
not only use information about the word itself for searching, but also
words that are closely associated with it.
The following sections describe our retrieval model; subjects,
materials, and procedure of our initial experiment; and the analysis
we have done so far. We conclude with some general observations.
Retrieval Model
Term vectors are derived from a term-by-term matrix which records term
co-occurrence in the corpus of interest. Entry cij of the matrix
records the number of times that words i and j co-occur in the corpus
within a window of 51 words.
This matrix is reduced to a low-dimensional space using SVD. The
vector of term t is the column corresponding to t in the reduced
matrix. We include both words and pairs (frequently occurring bigrams)
as terms.
Queries and documents are represented as context vectors, that is,
as (normalized) vector sums (or centroids) of the terms that occur in
them. For example, the context vector of a query consisting of three
terms is computed by looking up the three corresponding term vectors,
computing their vector sum, and normalizing to unit length. Retrieval
proceeds by ranking document context vectors according to their
correlation (or cosine) with the query context vector. We eliminate
duplicates and near-duplicates from the retrieval results.
Our retrieval model is similar to LSI, except that we use a
word-by-word instead of a word-by-document matrix.
Subjects
Subjects were recruited at Stanford University by posting flyers on
bulletin boards, posting messages on electronic bulletin boards and by
telling colleagues and friends about the experiment. Prospective
subjects were given a description of the experiment and promised $25
for participation. Twenty people signed up as participants. So far we
have been able to complete the experiment with 11. All subjects have
some affiliation with Stanford, either as undergraduate or graduate
students or as staff.
Materials
A condition for participation was that subjects had several megabytes
of email and were willing to make this email available to the
experimenters. Withholding a subset of the email deemed too sensitive
to distribute by the participant was admissible.
Three corpora were used in the sessions with a particular participant.
The first corpus was the participant's personal email. The second
corpus was the email of one of the authors (S. Peters). This corpus
was intended as a control to test true personalization (participant
and email are matched) vs general effects of personal email. The third
corpus was comprised of two years of the New York Times newswire (1995
and 1996).
Word vectors were derived for each corpus, so that each term had three
different representations, corresponding to the three different corpora.
Each participant was asked to compose ten queries. We encouraged
subjects to compose long queries, to make use of the ability of our
system to handle phrases, and to focus on specific content words.
In addition, participants were asked to fill out a questionnaire to
record data like age, sex, email usage etc.
Experiment
For each subject, the subject's 10 queries were run on the New York
Times corpus in three different systems, corresponding to three levels
of the factor PERSONALIZATION: one using the subject's personal
context vectors (PERSONAL, that is, personal and subject-matched), one
using S. Peters' personal context vectors (PETERS: personal, but not
subject-matched), and one using the general context vectors derived
from the New York Times newswire (NYT). (The same NYT corpus was used
for the derivation of general context vectors and as the corpus that
was searched for relevant articles by the three systems.)
For each of the subject's
queries, the ten top ranking articles from each of the three systems
were presented to the subject. The order of systems was random. For
each system, articles were ordered by retrieval rank (highest scoring
articles first). The subject was asked to rate each article on a
5-point scale.
Scores of the five first articles (those ranked highest
by the system) were then multiplied by 2. The resulting scores were
added, resulting in one score for each personalization-subject-query
triple. Giving higher weights to the first 5 articles rewards correct
ranking by the system while not giving as much weight to the ranking
as the measure of average precision would. Subjects needed two to
three hours to rate all 300 articles retrieved.
Results
We analyzed the 330 scores using ANOVA as a 3-by-11 design with 10
replications (corresponding to the 10 queries for each subject) for
each combination of PERSONALIZATION (PERSONAL, PETERS, NYT) and
subject. The following labels were used for the 11 subjects: aa, ab,
ac, ad, ae, af, ag, ah, ai, aj, ak.
The effects of both PERSONALIZATION and subject were highly
significant (PERSONALIZATION: F=34.0, significant for alpha=0.001;
subject: F=16.3, significant for alpha=0.001). However, the
significant differences did not have the sign we expected. Here are
the means of the three levels of PERSONALIZATION.
mean of scores
| level of factor PERSONALIZATION
|
22.6 | PERSONAL
|
12.8 | PETERS
|
28.1 | NYT
|
So the experiment clearly demonstrates an effect of personalization.
In particular, personalization with another person's associations
leads to inferior retrieval results. However, contrary to the
experimenters' expectations "matched" personalization significantly
decreased the quality of returned results when compared to
general associations, that is, no personalization at all.
Analysis and Discussion
As a first step in trying to understand the unexpected inferior
performance of personalized retrieval we looked at the 10 queries with
the highest positive difference PERSONAL-NYT and the 10 queries with
the highest negative difference of PERSONAL-NYT. These are 20 queries
for which personalization has the largest effect (either positive or
negative).
The following table lists the 20 queries together with the following
information: difference in scores PERSONAL-NYT (column 1), score of
PERSONAL (column 2), score of NYT (column 3), score of PETERS (column 4), subject
(column 5), query number (column 6). Phrases in queries are shown as strings
concatenated with one or more pluses.
1
| 2
| 3
| 4
| 5
| 6
| query
|
47 | 59 | 12 | 44 | ak | 2 | superstring+theory
|
43 | 45 | 2 | 0 | ak | 9 | monopole polikarpov qcd confinement qed
|
25 | 49 | 24 | 10 | ae | 0 | academic literature criticism gossip phd scandal academy
|
23 | 47 | 24 | 13 | ae | 8 | library funding politics ala collection books librarian
|
20 | 39 | 19 | 9 | ak | 1 | interwest+partners jack+mcdonald
|
19 | 51 | 32 | 16 | ae | 2 | university job+market tenure literature phd hire professor
|
19 | 43 | 24 | 28 | ab | 0 | new+zealand election national+party
|
17 | 51 | 34 | 38 | ag | 0 | nobel prize physics literature
|
17 | 32 | 15 | 1 | ad | 7 | michael+flatley irish+dancing
|
14 | 36 | 22 | 9 | af | 7 | recipes french+food cassoulet
|
-46 | 0 | 46 | 0 | ab | 2 | socialized+medicine advantages costs
|
-46 | 14 | 60 | 2 | ah | 5 | clayoquot forestry protest logging environment preserve tofino
|
-41 | 0 | 41 | 5 | aj | 9 | ergonomics keyboards companies
|
-30 | 0 | 30 | 0 | aj | 4 | automobiles reviews
|
-29 | 5 | 34 | 1 | af | 3 | wine+bar equipment start+up+costs survival+rate+of+small+businesses
|
-28 | 36 | 64 | 0 | ab | 3 | tour+de+france winner 1995
|
-28 | 8 | 36 | 6 | aj | 1 | religion christianity judaism gnostic scripture
|
-27 | 7 | 34 | 3 | ai | 4 | marriage satisfaction characteristics
|
-26 | 19 | 45 | 21 | ac | 1 | military taiwan defense policy
|
-26 | 2 | 28 | 0 | ac | 6 | france paris eiffel+tower reconstruction
|
One trend in the table seems to be that queries for
which personalization did well are more specific than queries for
which personalization did not do well. For example, superstring+theory
and polikarpov are very specific terms, whereas queries like
"automobiles reviews" contain much less specific terms.
This is contrary to what we expected. Recall that we were hoping that
unspecific, and hence ambiguous, terms would benefit most from
personalization by adding the context of the user's email to their
representation.
One possible explanation is that the
quality of word space vectors depends on a variety of factors
including the frequency with which the word occurs. If a word only
occurs once, then there is only one context to generate its vector
from, a context which may be unrepresentative. The specific terms are
all terms that are likely to have a higher relative frequency in the
personal email than in the New York Times corpus. Therefore, their
email representation could be better than their nyt representation.
An extreme case of a poor representation due to lack of data is when a
word does not occur at all in one of the corpora. Indeed, when
analyzing cases where PERSONALIZATION did much worse than NYT we found
many query words that had no occurrence in personal email and hence a
zero vector for PERSONALIZATION. Examples are "start-up-costs",
"survival-rate", "socialized medicine", and "ergonomics". A zero
vector results in a poor query representation and as a consequence in
poor retrieval results.
For some queries, the system arguably performed correctly, but the
results were not what the subject was looking for. For example, in the
subject ai's personal email, marriage is mainly mentioned in the
context of religion. As a result, most retrieval results over the new
york times corpus with personal vectors were on religion. This example
shows that one of our basic assumptions in the experiment will often
not be true: that the current frame of mind of a user is well
represented by their email. Even if subject ai mostly thinks of marriage
as a religious institution, she apparently was looking for general
information on marriage in her query 4. This mismatch between a
particular query and the email associations suggests that the user
should have some way of interacting with the set of associations that is
chosen for a particular query.
Finally, the retrieval results were
confounded by a Heisenberg-type effect. The email corpus of at least
one subject contained the email with the queries that were used in the
experiment. This gave rise to false associations among the words in
all 10 queries of that subject. Since many of the words are rare, one
use in a random context like this can have a significant effect on the
word's associations.
Obviously, this is just a very preliminary analysis. Some other
factors we are planning to look at are the following.
- In the anova, we neglected the correlation between the factors
PERSONALIZATION and SUBJECT. However, there is a clear correlation,
for example, PERSONAL did much better for subject ak than for subject
aj. The reason could be differences in subjects' email such as the
proportion of non-personal email (talk announcments, mailing lists
etc). We will try to determine the proportion of truly personal email
and analyze its effect on retrieval quality.
- There is an asymmetry between NYT and PERSONAL in that word vectors
and retrieval corpus are "matched" in the first case (word vectors
derived from newswire, corpus of application is newswire); and not
matched in the second (word vectors derived from personal email,
corpus of application is newswire). Part of this mismatch for PERSONAL
("global" vs personal associations) is intended by experimental
design. However, there may be other types of mismatch, for example, in
the vocabulary used, that we need to understand better.
- We are looking at two techniques for meeting user expectations at
the same time in our experiment: profiling users and query expansion.
It might have been better to separate the two and investigate each
question separately. For example, we asked users to give us specific
query terms, but expansion is least likely to be helpful with specific
terms. In fact, it can hurt retrieval performance for specific terms.
We will analyze the experimental data with the goal of separating the
effects of profiling and expansion.
- There is a substantial discrepancy in the sizes of the corpora
used for deriving "personal" word vectors and "general" word
vectors. Some of the email corpora only have a size of about 2
percent of the New York Times corpus. This imbalance could well
be one of the causes for the superior performance of NYT word vectors.
Related Work
Haystack is a project at MIT whose goal is
user-specific customization of information, repositories, and
retrieval processes. We share with Haystack the premise that the
information a person stores in their personal work space is likely a
good indicator of their knowledge and interests.
However,
Haystack draws personalization from the trail people follow
in finding documents. Word association is a complimentary source of
information. It is also more easily transferable to other collections.
Document associations are harder to transfer than word associations
since the former can only be directly exploited within one collection.
Future Directions
Here are some plans for future work.
- We need to address the problem of words that do not occur at
all in a corpus.
- We should concentrate on types of queries that could benefit
from personalization. A substantial subset of queries is
handled well by word-match methods because they contain a good
content descriptor (for example, a proper noun like "Tour de France").
- We need to make sure to avoid the above-mentioned
Heisenberg-type effects. In general, to address the problem of
the fragility of associations of rare words, we may consider
looking at a fourth type of word vector representation: one
that is derived from a mixture of personal and more general text.
Conclusion
The work presented here is an initial and preliminary attempt to make
information about the larger context available for improving document
retrieval. We say "larger context" to set our approach apart from work
on exploiting the immediate context of a single user session.
What we present here looks at two points on the spectrum of
personalization: the user's personal email and a national US
newspaper, the New York Times (very much an "unpersonal" corpus). We
are interested in developing a framework in which the user has control
over the degree of personalization appropriate for a query or other
information task. Depending on the desired degree of personalization,
the corpus that is used for personalizing word representations could
be
-
a specific email folder or a directory of personal files (e.g.,
those related to a particular project);
- all personal files, including email;
-
the files of a workgroup;
-
all files residing on the servers of an institution (a university, a
corporation);
-
a corpus typical of the discourse of a particular country;
-
a global corpus that would give rise to a "universal" word
representation, to the extent this is possible.
We think that this type of flexible personalization, if done right,
could be a quite powerful addition to the set of tools available today
for intelligent information access, an addition that would help IR
systems better meet users' semantic expectations.