Personalization and Users' Semantic Expectations

Raymond Suke Flournoy, Ryan Ginstrom, Kenichi Imai, Stefan Kaufmann, Genichiro Kikui, Stanley Peters, Hinrich Schütze, Yasuhiro Takayama

Introduction

A fundamental expectation users have when they submit a query to a retrieval system is that the words they submit have the meaning they are familiar with. We refer to this type of expectation as semantic expectations about query terms.

Unfortunately, semantic expectations are often thwarted. The reason can be ambiguity. The user may only be thinking of one of the senses of a term even if she knows other senses and would readily recognize them when attention is drawn to the ambiguity of the search term. Another reason for a thwarted semantic expectation can be that there is a mismatch between the context of the user and the context of the document collection that is searched. For example, a New Zealander who launches a World Wide Web query "National Party" may be unaware that Australia and Scotland also have National Parties, so that many hits will be relevant to Australia and Scotland, not to what the New Zealander (presumably) is looking for.

In this paper, we propose to better meet users' semantic expectations by means of personalization. Instead of using the customary decontextualized representation of search terms as strings of letters, we instead use word space vectors (or word vectors for short), personalized representations of each search term which are derived from the user's email. The word space vector of word w is an aggregate representation of the set of words that w is strongly associated with.

Using word vectors for retrieval is similar to two other strategies for managing expectations: searcher profiling and query expansion. The set of word vectors derived from a user's email are in a sense a very detailed and complex profile of what this user is interested in. Word vector retrieval has a flavor of query expansion in that we not only use information about the word itself for searching, but also words that are closely associated with it.

The following sections describe our retrieval model; subjects, materials, and procedure of our initial experiment; and the analysis we have done so far. We conclude with some general observations.

Retrieval Model

Term vectors are derived from a term-by-term matrix which records term co-occurrence in the corpus of interest. Entry cij of the matrix records the number of times that words i and j co-occur in the corpus within a window of 51 words. This matrix is reduced to a low-dimensional space using SVD. The vector of term t is the column corresponding to t in the reduced matrix. We include both words and pairs (frequently occurring bigrams) as terms.

Queries and documents are represented as context vectors, that is, as (normalized) vector sums (or centroids) of the terms that occur in them. For example, the context vector of a query consisting of three terms is computed by looking up the three corresponding term vectors, computing their vector sum, and normalizing to unit length. Retrieval proceeds by ranking document context vectors according to their correlation (or cosine) with the query context vector. We eliminate duplicates and near-duplicates from the retrieval results. Our retrieval model is similar to LSI, except that we use a word-by-word instead of a word-by-document matrix.

Subjects

Subjects were recruited at Stanford University by posting flyers on bulletin boards, posting messages on electronic bulletin boards and by telling colleagues and friends about the experiment. Prospective subjects were given a description of the experiment and promised $25 for participation. Twenty people signed up as participants. So far we have been able to complete the experiment with 11. All subjects have some affiliation with Stanford, either as undergraduate or graduate students or as staff.

Materials

A condition for participation was that subjects had several megabytes of email and were willing to make this email available to the experimenters. Withholding a subset of the email deemed too sensitive to distribute by the participant was admissible.

Three corpora were used in the sessions with a particular participant. The first corpus was the participant's personal email. The second corpus was the email of one of the authors (S. Peters). This corpus was intended as a control to test true personalization (participant and email are matched) vs general effects of personal email. The third corpus was comprised of two years of the New York Times newswire (1995 and 1996).

Word vectors were derived for each corpus, so that each term had three different representations, corresponding to the three different corpora.

Each participant was asked to compose ten queries. We encouraged subjects to compose long queries, to make use of the ability of our system to handle phrases, and to focus on specific content words. In addition, participants were asked to fill out a questionnaire to record data like age, sex, email usage etc.

Experiment

For each subject, the subject's 10 queries were run on the New York Times corpus in three different systems, corresponding to three levels of the factor PERSONALIZATION: one using the subject's personal context vectors (PERSONAL, that is, personal and subject-matched), one using S. Peters' personal context vectors (PETERS: personal, but not subject-matched), and one using the general context vectors derived from the New York Times newswire (NYT). (The same NYT corpus was used for the derivation of general context vectors and as the corpus that was searched for relevant articles by the three systems.) For each of the subject's queries, the ten top ranking articles from each of the three systems were presented to the subject. The order of systems was random. For each system, articles were ordered by retrieval rank (highest scoring articles first). The subject was asked to rate each article on a 5-point scale. Scores of the five first articles (those ranked highest by the system) were then multiplied by 2. The resulting scores were added, resulting in one score for each personalization-subject-query triple. Giving higher weights to the first 5 articles rewards correct ranking by the system while not giving as much weight to the ranking as the measure of average precision would. Subjects needed two to three hours to rate all 300 articles retrieved.

Results

We analyzed the 330 scores using ANOVA as a 3-by-11 design with 10 replications (corresponding to the 10 queries for each subject) for each combination of PERSONALIZATION (PERSONAL, PETERS, NYT) and subject. The following labels were used for the 11 subjects: aa, ab, ac, ad, ae, af, ag, ah, ai, aj, ak.

The effects of both PERSONALIZATION and subject were highly significant (PERSONALIZATION: F=34.0, significant for alpha=0.001; subject: F=16.3, significant for alpha=0.001). However, the significant differences did not have the sign we expected. Here are the means of the three levels of PERSONALIZATION.

mean of scores level of factor PERSONALIZATION
22.6 PERSONAL
12.8 PETERS
28.1 NYT
So the experiment clearly demonstrates an effect of personalization. In particular, personalization with another person's associations leads to inferior retrieval results. However, contrary to the experimenters' expectations "matched" personalization significantly decreased the quality of returned results when compared to general associations, that is, no personalization at all.

mean of scores	level of factor PERSONALIZATION
22.6	PERSONAL
12.8	PETERS
28.1	NYT

Analysis and Discussion

As a first step in trying to understand the unexpected inferior performance of personalized retrieval we looked at the 10 queries with the highest positive difference PERSONAL-NYT and the 10 queries with the highest negative difference of PERSONAL-NYT. These are 20 queries for which personalization has the largest effect (either positive or negative).

The following table lists the 20 queries together with the following information: difference in scores PERSONAL-NYT (column 1), score of PERSONAL (column 2), score of NYT (column 3), score of PETERS (column 4), subject (column 5), query number (column 6). Phrases in queries are shown as strings concatenated with one or more pluses.

1 2 3 4 5 6 query
47 59 12 44 ak 2 superstring+theory
43 45 2 0 ak 9 monopole polikarpov qcd confinement qed
25 49 24 10 ae 0 academic literature criticism gossip phd scandal academy
23 47 24 13 ae 8 library funding politics ala collection books librarian
20 39 19 9 ak 1 interwest+partners jack+mcdonald
19 51 32 16 ae 2 university job+market tenure literature phd hire professor
19 43 24 28 ab 0 new+zealand election national+party
17 51 34 38 ag 0 nobel prize physics literature
17 32 15 1 ad 7 michael+flatley irish+dancing
14 36 22 9 af 7 recipes french+food cassoulet
-46 0 46 0 ab 2 socialized+medicine advantages costs
-46 14 60 2 ah 5 clayoquot forestry protest logging environment preserve tofino
-41 0 41 5 aj 9 ergonomics keyboards companies
-30 0 30 0 aj 4 automobiles reviews
-29 5 34 1 af 3 wine+bar equipment start+up+costs survival+rate+of+small+businesses
-28 36 64 0 ab 3 tour+de+france winner 1995
-28 8 36 6 aj 1 religion christianity judaism gnostic scripture
-27 7 34 3 ai 4 marriage satisfaction characteristics
-26 19 45 21 ac 1 military taiwan defense policy
-26 2 28 0 ac 6 france paris eiffel+tower reconstruction
One trend in the table seems to be that queries for which personalization did well are more specific than queries for which personalization did not do well. For example, superstring+theory and polikarpov are very specific terms, whereas queries like "automobiles reviews" contain much less specific terms.

1	2	3	4	5	6	query
47	59	12	44	ak	2	superstring+theory
43	45	2	0	ak	9	monopole polikarpov qcd confinement qed
25	49	24	10	ae	0	academic literature criticism gossip phd scandal academy
23	47	24	13	ae	8	library funding politics ala collection books librarian
20	39	19	9	ak	1	interwest+partners jack+mcdonald
19	51	32	16	ae	2	university job+market tenure literature phd hire professor
19	43	24	28	ab	0	new+zealand election national+party
17	51	34	38	ag	0	nobel prize physics literature
17	32	15	1	ad	7	michael+flatley irish+dancing
14	36	22	9	af	7	recipes french+food cassoulet
-46	0	46	0	ab	2	socialized+medicine advantages costs
-46	14	60	2	ah	5	clayoquot forestry protest logging environment preserve tofino
-41	0	41	5	aj	9	ergonomics keyboards companies
-30	0	30	0	aj	4	automobiles reviews
-29	5	34	1	af	3	wine+bar equipment start+up+costs survival+rate+of+small+businesses
-28	36	64	0	ab	3	tour+de+france winner 1995
-28	8	36	6	aj	1	religion christianity judaism gnostic scripture
-27	7	34	3	ai	4	marriage satisfaction characteristics
-26	19	45	21	ac	1	military taiwan defense policy
-26	2	28	0	ac	6	france paris eiffel+tower reconstruction

This is contrary to what we expected. Recall that we were hoping that unspecific, and hence ambiguous, terms would benefit most from personalization by adding the context of the user's email to their representation.

One possible explanation is that the quality of word space vectors depends on a variety of factors including the frequency with which the word occurs. If a word only occurs once, then there is only one context to generate its vector from, a context which may be unrepresentative. The specific terms are all terms that are likely to have a higher relative frequency in the personal email than in the New York Times corpus. Therefore, their email representation could be better than their nyt representation.

An extreme case of a poor representation due to lack of data is when a word does not occur at all in one of the corpora. Indeed, when analyzing cases where PERSONALIZATION did much worse than NYT we found many query words that had no occurrence in personal email and hence a zero vector for PERSONALIZATION. Examples are "start-up-costs", "survival-rate", "socialized medicine", and "ergonomics". A zero vector results in a poor query representation and as a consequence in poor retrieval results.

For some queries, the system arguably performed correctly, but the results were not what the subject was looking for. For example, in the subject ai's personal email, marriage is mainly mentioned in the context of religion. As a result, most retrieval results over the new york times corpus with personal vectors were on religion. This example shows that one of our basic assumptions in the experiment will often not be true: that the current frame of mind of a user is well represented by their email. Even if subject ai mostly thinks of marriage as a religious institution, she apparently was looking for general information on marriage in her query 4. This mismatch between a particular query and the email associations suggests that the user should have some way of interacting with the set of associations that is chosen for a particular query.

Finally, the retrieval results were confounded by a Heisenberg-type effect. The email corpus of at least one subject contained the email with the queries that were used in the experiment. This gave rise to false associations among the words in all 10 queries of that subject. Since many of the words are rare, one use in a random context like this can have a significant effect on the word's associations.

Obviously, this is just a very preliminary analysis. Some other factors we are planning to look at are the following.

In the anova, we neglected the correlation between the factors PERSONALIZATION and SUBJECT. However, there is a clear correlation, for example, PERSONAL did much better for subject ak than for subject aj. The reason could be differences in subjects' email such as the proportion of non-personal email (talk announcments, mailing lists etc). We will try to determine the proportion of truly personal email and analyze its effect on retrieval quality.
There is an asymmetry between NYT and PERSONAL in that word vectors and retrieval corpus are "matched" in the first case (word vectors derived from newswire, corpus of application is newswire); and not matched in the second (word vectors derived from personal email, corpus of application is newswire). Part of this mismatch for PERSONAL ("global" vs personal associations) is intended by experimental design. However, there may be other types of mismatch, for example, in the vocabulary used, that we need to understand better.
We are looking at two techniques for meeting user expectations at the same time in our experiment: profiling users and query expansion. It might have been better to separate the two and investigate each question separately. For example, we asked users to give us specific query terms, but expansion is least likely to be helpful with specific terms. In fact, it can hurt retrieval performance for specific terms. We will analyze the experimental data with the goal of separating the effects of profiling and expansion.
There is a substantial discrepancy in the sizes of the corpora used for deriving "personal" word vectors and "general" word vectors. Some of the email corpora only have a size of about 2 percent of the New York Times corpus. This imbalance could well be one of the causes for the superior performance of NYT word vectors.

Related Work

Haystack is a project at MIT whose goal is user-specific customization of information, repositories, and retrieval processes. We share with Haystack the premise that the information a person stores in their personal work space is likely a good indicator of their knowledge and interests. However, Haystack draws personalization from the trail people follow in finding documents. Word association is a complimentary source of information. It is also more easily transferable to other collections. Document associations are harder to transfer than word associations since the former can only be directly exploited within one collection.

Future Directions

Here are some plans for future work.

We need to address the problem of words that do not occur at all in a corpus.
We should concentrate on types of queries that could benefit from personalization. A substantial subset of queries is handled well by word-match methods because they contain a good content descriptor (for example, a proper noun like "Tour de France").
We need to make sure to avoid the above-mentioned Heisenberg-type effects. In general, to address the problem of the fragility of associations of rare words, we may consider looking at a fourth type of word vector representation: one that is derived from a mixture of personal and more general text.

Conclusion

The work presented here is an initial and preliminary attempt to make information about the larger context available for improving document retrieval. We say "larger context" to set our approach apart from work on exploiting the immediate context of a single user session.

What we present here looks at two points on the spectrum of personalization: the user's personal email and a national US newspaper, the New York Times (very much an "unpersonal" corpus). We are interested in developing a framework in which the user has control over the degree of personalization appropriate for a query or other information task. Depending on the desired degree of personalization, the corpus that is used for personalizing word representations could be

a specific email folder or a directory of personal files (e.g., those related to a particular project);
all personal files, including email;
the files of a workgroup;
all files residing on the servers of an institution (a university, a corporation);
a corpus typical of the discourse of a particular country;
a global corpus that would give rise to a "universal" word representation, to the extent this is possible.

We think that this type of flexible personalization, if done right, could be a quite powerful addition to the set of tools available today for intelligent information access, an addition that would help IR systems better meet users' semantic expectations.