当前位置:首页 >> 韩语学习 >>

Abstract Phrase Recognition and Expansion for Short, Precision-biased Queries based on a Qu


Phrase Recognition and Expansion for Short, Precision-biased Queries based on a Query Log

delima@ims.uni-stuttgart.de

IMS - University of Stuttgart Azenbergstrasse 12 70174 Stuttgart, Germany

Erika F. de Lima

jpederse@infoseek.com

Infoseek Corporation 1399 Mo ett Park Drive Sunnyvale, CA 94089

Jan O. Pedersen

Abstract In this paper we examine the question of query parsing for World Wide Web queries and present a novel method for phrase recognition and expansion. Given a training corpus of approximately 16 million Web queries and a handwritten context-free grammar, the EM algorithm is used to estimate the parameters of a probabilistic context-free grammar PCFG with a system developed by Carroll 5 . We use the PCFG to compute the most probable parse for a user query, re ecting linguistic structure and word usage of the domain being parsed. The optimal syntactic parse for a user query thus obtained is employed for phrase recognition and expansion. Phrase recognition is used to increase retrieval precision; phrase expansion is applied to make the best use possible of very short Web queries. 1 Introduction World Wide Web queries di er substantially from those studied in standard Information Retrieval settings. For example, TREC topic descriptions average 15 tokens in length 28, 27, 15 while analysis of Web search engine logs reveal an average query length of about 2.3 tokens 19, 26 . Moreover, World Wide Web searching is precision biased; of greatest concern are the results presented in the rst page | typically the top ten. In contrast, standard IR evaluation measures, such as uninterpolated precision, are strongly in uenced by precision at high recall points. This suggests that Web search engines should place a premium on extracting as much high-precision information as possible from the query the user is willing to o er. A natural step in this direction is automatic phrase extraction from free text query input. Web search engines typically implement at their core some variant of vector-space retrieval; that is, documents are scored based on the number of query terms they contain, with rarer queries terms having higher weight. In this context phrases, multiword search terms, are simply treated as rare, and hence highly weighted, terms. It is well known that the use of phrasal search terms increases precision, since
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR '99 8/99 Berkley, CA USA Copyright 1999 ACM 1-58113-096-1/99/0007 . . . $5.00

phrases tend to be unambiguous concept markers see e.g. 24 . For example, documents matching the phrase buy out will be a sharp subset of the documents containing either of the two constituents buy or out and will more likely be relevant to the intended concept. All major search engines permit users to employ query syntax to express explicit Boolean and proximity constraints. For example, it is fairly standard that the query syntax +" denotes a required term and enclosing a phrase in quotes denotes required adjacency. However, analysis of Web search engine query logs reveals that very few users avail themselves of these advanced features, even though they are advertised as an easy and e ective method to improve retrieval accuracy 26 . More typically, users enter free-text queries which are often noun phrases, occasionally sequences of keywords, and even more rarely full sentential expressions. World Wide Web search engines vary widely in their treatment of free-text queries. For example the query council of economic advisors will generate di erent results on the Alta Vista search engine than the query advisors economic of council 2 while one gets the same results for the two queries on the Excite engine 12 . This suggests that Alta Vista pays attention to word order, presumably to extract precision-enhancing phrases, while Excite does not. Given the premium placed on high-precision results it is reasonable to hypothesize that Web search engine performance would be enhanced by automatic parsing of free-text query input for the purpose of recognizing and extracting phrasal query terms. Numerous IR systems have employed automatic query parsing to extract from free-text, or natural language, input query words and phrases to construct queries in the vectorspace model see e.g. 28, 27, 15 . However, this work has been in the context of fairly lengthy query input and hence the general conclusion that phrases o er a modest, but signi cant, positive performance bene t and that simple recognition and extraction methods, such as statistical phrases, su ce may be speci c to that relatively information rich query environment see e.g. 22 . In this paper we explore the question of whether these results apply to the more demanding Web retrieval environment. In particular, we propose a novel method for highaccuracy phrase recognition and extraction which compares favorably with simplier methods in the Web environment. We employ a system for the acquisition of language models developed by Carroll 5 which makes use of the EM algorithm 10 in order to estimate the parameters of a probabilistic context-free grammar PCFG, given a corpus of Web retrieval queries and a hand-written context-free gram-

145

mar. The PCFG is used to identify an optimal query parse re ecting word usage and linguistic structure of the domain being parsed. The optimal syntactic parse for a user query thus obtained is employed to identify putative phrases. While phrase recognition has been motivated by the enhancement of retrieval precision, our method additionally employs phrase expansion in order to extend short-length user queries. A word bigram is considered an expanded phrase if its elements are not adjacent words in the original user query. Consider, for instance, the noun phrase scheduled maintenance of aircraft. The prepositional phrase of aircraft modi es the noun maintenance. We use this linguistic relation in order to postulate, among others, the expanded phrase aircraft maintenance. Note, extended phrases are not available to simple extraction methods, such as statistical phrases. This paper is organized as follows. Section 2 introduces the method used for phrase recognition and expansion. Section 3 describes the application of the method to a training corpus of approximately 16 million Web queries and Section 4 presents an evaluation of the results obtained. Conclusions are presented in Section 5. 2 Method We address the problem of query parsing. We assume the system is presented with free-text input, that is a sequence of characters denoting query terms expressed in natural language without reference to special query syntax. The goal is to generate a formal query using the supported operator set of the underlying retrieval engine that will best match documents re ecting the query concept. For example, given the user input council of economic advisors one might generate the formal query council" AND economic" AND advisors" or +council +economic +advisors in search engine standard notation which will match only documents containing all the query terms and then score them in some engine-speci c fashion. Alternatively, the query parser may note that economic advisors is a phrase and hence that a better query might be +council + economic advisor". Clearly many variations are possible, including using OR rather than AND, and adding the phrasal constituents economic and advisor to the query, however the key issue is the handling of multi-word query terms such as economic advisors. A standard approach to the recognition and extraction of phrases from free text input is the use of statistical phrases which considers every pair of contiguous non-function words as a candidate phrase, however only those that occur with frequency above a given threshold in a relevant collection 22 are retained. An alternative method involves the computation of the mutual information measure between word combinations using co-occurrence frequency statistics 14 . Yet another approach has employed tagging and linguistic rules in order to identify syntactic phrases 11 . However, empirical studies have generally found no performance advantage in the application of more sophisticated methods for phrase recognition and extraction see e.g. 22 , at least in standard, long query, IR settings, such as TREC. We hypothesize that more sophisticated, and presumably higher-accuracy, phrase extraction methods can pay o in the Web retrieval environment. To that end we propose a method that exploits both grammatical and statistical information and compare it to simplier statistical phrases. The rst, novel, method has access to more information and hence should be capable of producing more numerous and

more accurate phrasal query terms. In addition, linguistic structure may be exploited to generate expansion phrases whose constituents are not adjacent. In particular, we consider the following linguistic phenomena: Modi cation: correct inference of phrases such as advanced system from advanced tra c management system Prepositional Modi cation: correct inference of the phrase economic advisors council from the input council of economic advisors Entity Phrases: correct inferences of surface forms such as Orlando, Valerio and V. Orlando from the recognized name Valerio Orlando. Given a hand written context-free grammar that describes our domain, Web queries, and a training corpus, we train a probabilistic context-free grammar. We then use probabilistic parsing to identify the most likely linguistic parse for a previously unseen user query. Given the optimal parse for the user query, we apply a method for phrase identi cation and expansion and subsequent formal query generation. The grammar formalism is described in section 2.2, training in section 2.3, probabilistic parsing in section 2.4 and the treatment of phrases is described in section 2.5. 2.1 The Gramotron lexicon acquisition and robust parsing software We use the Gramotron software system 6, 5 in order to estimate the parameters of the query PCFG and to produce the most probable parse for each query. The Gramotron system is a toolbox for the acquisition of syntactic language models based on the EM algorithm. The tool set includes the parser supar to produce estimated unlexicalized rule frequencies the E step of the EM algorithm, the parser ultra to bootstrap a lexicalized model, and hypar to produce estimated lexicalized rule frequencies and for producing most probable parses, given a syntactic model. The toolbox also contains genDists, a program for estimating distributions the M step of the EM algorithm, as well as charge, a program for chart visualization used for grammar development. 2.2 A Query Grammar  The query grammar is a X context-free grammar 17 in the formalism of Carroll and Rooth 7 and based on the grammar in 7 . The grammar de nes linguistic chunks 1 and phrases, such as noun phrases. For instance, the topmost NC1 node in gure 1 corresponds to the entire noun phrase national association of emergency room physicians. Chunks are distinguished from phrases in that, for instance, they do not include modi ers to the right of its main lexical item. In gure 1, national association is an example of a noun chunk. Examples of grammar rules are given below:
N1 ! Noun Sg ' N1 ! Noun Pl ' NC ! N1 N1' NC1 ! NC' PC1

According to the above rules, an N1 may be expanded to a singular or plural noun. A noun chunk NC may be expanded to two adjacent N1 chunks, and a noun phrase NC1 may be expanded to a noun chunk NC followed by a prepositional phrase PC1. A single quote is used to mark the head of each rule. A head, roughly, is the most important lexical item in

146

NC1 PC1 PC NC1 NC N1 A1 Adj national N1 Noun Sg association P1 Prep of N1 Noun Sg emergency N1 N1 Noun Sg room N1 Noun Pl physicians NC

Figure 1: A noun phrase a chunk or phrase. For instance, the head of a noun chunk is its main noun; the head of the noun chunk emergency room physician is the rightmost noun physician. Heads are used in the lexicalization of the model and in phrase expansion, described in section 2.5. The grammar consists of approximately 300 rules. Constructs covered by the grammar include noun phrases, prepositional phrases, determiner and adjectival phrases. In addition to canonical phrasal constituents in 7 , the grammar de nes entity phrases. Entity phrases are de ned in the current implementation as person and company names. An example of an entity phrase is shown in gure 2. 2.3 Training In an initial preprocessing step, each token in the training corpus is normalized with respect to capitalization and tagged with the set of possible part-of-speech tags using a commercially available morphological analyzer. For instance, the query national association of emergency room physicians is tagged as follows:
national association of emergency room physicians Adj, Noun Prep Noun Verb Noun Noun Sg Sg Sg Pres Pl

timal parse to both queries shown in this gure; the queries display identical part-of-speech tags. If we collect statistics on the behavior of words, we would possibly nd that, in the training corpus, hard modi es disk more often than utilities, and that international is more likely to modify associations than nursing. The lexicalized model adopted is similar to that in 8 . For a formal description of the model, see 7 . For a description of the Gramotron software and training regimes used, see 5, 7 . 2.4 Probabilistic Parsing We use probabilistic parsing in order to address the problem of syntactic ambiguity, i.e., alternative analysis licensed by the grammar. For instance, an optimal parse for the query national association of emergency room physicians is shown in gure 1. Note that the token national has been assigned the part-of-speech Adj, rather than the alternative Noun Sg , and the token room has been assigned the part-ofspeech Noun Sg , rather the alternative Verb Pres Non3sg . Further, note that the parse contains the structure emergency room physicians, rather than the competing analysis emergency room physicians. We use the PCFG to rank alternative analyses according to their probabilities; we employ the analysis with highest probability as an approximation of the linguistic structure of a user query. 2.5 Phrase Recognition and Expansion Phrase recognition and expansion are applied to the most likely syntactic parse obtained for a user query according to the PCFG estimated from the query log. In the current implementation, only noun phrases are considered for phrase recognition and expansion. 2.5.1 Recognition As in other approaches to syntactic phrasing cf. 22 , we consider phrases to be pairs of single tokens within noun phrases of length greater than 1. However, unlike 22 , only pairs compatible with the internal structure of the

Non3sg , Noun Sg

Starting from some initial random parameter setting associated with the hand-written grammar described in section 2.2, the query log is parsed, producing a set of estimated rule frequencies. Using these frequencies, the parameters are adjusted, so that the likelihood of the training data increases. In the rst training stage, an unlexicalized model is estimated with the parser supar 6 . With this model, the partof-speech tags associated with each token in the input are considered to be the terminal symbols of the grammar. The unlexicalized model is used as a starting point for a second training stage, in which a lexicalized model is constructed with the parsers ultra and hypar 6 . The lexicalized model makes use of statistics on the behavior of individual words. Consider, for instance, the queries shown in gure 3. An unlexicalized model could not be used to assign a correct op-

147

NC1 NC EC1 EC FN1 FN1 Prop Giv John NC1 NC N1 A1 Adj hard N1 Noun Sg disk N1 Noun Pl utilities A1 Adj international N1 Noun Sg nursing FN1 Init F. LN1 Prop Fam Kennedy NC1 NC N1 N1 Noun Pl associations N1 Noun Sg airport

Figure 2: Noun phrase containing an entity phrase EC1

Figure 3: Structurally ambiguous queries noun phrase are considered. For example, consider the noun phrase hard disk utilities, shown in gure 3. According to the structure in gure 3, hard is an adjective modifying the noun disk. The noun chunk N1 yielding these two tokens in turn modi es the noun utilities. We use the head of the modi er chunk, hard, and the head of the modi ed chunk, disk, to produce the phrase hard disk. Analogously, modi er modi ed heads are employed to produce the phrase disk utilities. Since hard does not modify utilities, the pair hard utilities is not considered a phrase cf. 22 . 2.5.2 Expansion A word bigram is considered an expanded phrase if its elements are not adjacent words in the original phrase. Expanded phrases are extracted from modi cation structural con gurations and entity phrases. Modi cation. Nominal and adjectival modi cation are employed in the generation of expanded phrases. As an example, consider the noun phrase international nursing associations, whose structure is shown in gure 3. According to this structure, the noun nursing modi es the noun associations. The noun chunk N1 yielding nursing associations is in turn modi ed by the adjective international. Using this modi cation structure, we infer an expanded phrase consisting of the the head of the modi ed noun chunk, in this case associations, and the head of the modi er, in this case international yielding the extended phrase international associations. in gure 1. The prepositional phrase of emergency room physicians modi es the noun association. Using this modi cation structure, we infer an expanded phrase consisting of the the head of the modi ed noun chunk, in this case association, and the head of noun phrase within the prepositional phrase, in this case physicians yielding the extended phrase physicians association. Entity Phrases. Consider for instance the entity phrase Valerio Orlando. This entity may occur in a text collection as V. Orlando and Orlando, Valerio and so forth. The recognition and expansion of an entity phrase allows the generation of these variations, in addition to the straightforward phrase Valerio Orlando.
Prepositional Modi cation. Consider the structure shown

2.6 Related Work The application of probabilistic parsing to the problem of syntactical disambiguation has been investigated for instance in 8, 9, 20 . Unlike previous approaches, we estimate the parameters of the PCFG based on a Web retrieval query log, rather than standard" English corpora, such as the Penn Wall Street Journal Treebank 21 . Our approach to phrase recognition is similar to Mitra et al. 22 , Strzalkowski et al. 11 , Fagan 13 and Jacquemin et al. 18 in that we make use of linguistic structure in order to identify phrases. Unlike 22, 11, 18 , we do not make use of a part-of-speech tagger as a preprocessing step; tagging is considered a subtask of the parsing task. Unlike 22 , which uses a grammar containing 17 rules in order to identify nominal phrases, we

148

employ a complex context-free grammar with over 300 rules. Further, unlike 22 , which generates all lexicographically ordered word bigrams within nominal constituents as putative phrases, we consider word order to be relevant to phrase recognition. Additionally, rather than producing all possible word pairs within a nominal constituent, we consider only word pairs which are compatible with a constituent's linguistic structure to be possible phrases. Our approach to phrase expansion is similar to 11, 13, 18 in that it makes use of head-modi er relations. Unlike previous work, which uses head-modi er relations in order to normalize phrases, we use modi cation structures in order to expand phrases at query time, in order deal with very short Web retrieval queries. Our handling of structural ambiguity is similar to 11 in that we make use of corpus-based disambiguation. However, unlike 11 , we do not disambiguate alternative analyses in a post-parsing step, but produce only the highest ranked analysis directly with a PCFG. Further, unlike 11 , we do not consider entities to be simply sequences of proper nouns, but de ne CFG rules which are able to handle entities displaying complex internal structure. 3 Experiment The methodology outlined above was applied to a hand constructed context-free query grammar of approximately 300 rules and a corpus of Web queries. The resulting PCFG was used to parse randomly selected test queries. The parses were evaluated for linguistic correctness and were used to generate a variety of formal queries. We considered three experimental conditions: baseline query generation without phrase recognition, queries including phrase terms, and queries including extended phrases. In addition, we used statistical phrases to generate a simple phrase baseline. These queries were run against the Infoseek search engine 16 and the results judged for relevance. The Web query corpus consisted of 19,933,187 queries extracted from the Infoseek 16 query log of July 1998. A subset of roughly 580,000 queries were extracted per day after eliminating queries containing query syntax. The queries from the rst three weeks of the month were used for training 15,820,330 queries; 36,075,443 tokens and the queries from the last week of the month for testing 4,112,857 queries; 8,048,488 tokens. Silverstein et al. 26 suggest that Web query log statistics are quite stable over time and broadly similar across search engines, hence we do not believe the choice of training set or the training test split to be critical to this experiment. Test queries were analyzed by the probabilistic parser and the most likely parse returned. This parse was then used to drive a formal query generator using a bison grammar 3 . Again, we considered four query generators: one using no phrases, one using statistical phrases, one using phrases from the PCFG, and one using extended phrases from the PCFG. Generated queries are a sequence of terms. Terms, which may be words or phrases, are enclosed in quotes to prevent further query processing by the retrieval engine. Statistical phrases were recognized by considering all consecutive token pairs and retaining only those pairs that appear more than 25 times in the training corpus. This is similar to the de nition of statistical phrases in 22 , however we use the statistics of a query corpus rather than a document corpus for ltering. Once generated, sets of formal queries were run automatically against the Infoseek query engine using a simple

program that simulates a search session. The top 10 results for each formal query were evaluated for relevance by a volunteer assessor. 4 Results We considered two classes of evaluation: linguistic correctness of the most probable parse and retrieval e ectiveness of formal queries using phrase extraction and expansion. 4.1 Linguistic Structure In order to evaluate the linguistic structure of the optimal parses produced by the system, eight evaluation sets were constructed from the test corpus: random a random set of 100 queries, common the 100 most frequent queries, random3 a random set of 100 queries of at least length three, common3 the 100 most common queries of at least length three, random6 a random set of 100 queries of at least length six, common6 the 100 most common queries of at least length six, nyse 100 random company names extracted from the company listings of the New York Sock Exchange 23 and name 100 random queries with at least one token tagged as rst or last name. The evaluation sets were manually tagged with part-ofspeech PoS information and manually bracketed. The results produced by the system were compared to the manually annotated corpus. The results for tagging accuracy, syntactic structure, and grammar coverage are shown in table 1. 4.1.1 PoS Tagging In the tagging scheme employed in the manually annotated corpus, only major categories were used. Categories produced by the system were mapped accordingly. So for instance, Noun Sg and Noun Pl were both considered to be Noun. The tagging set used in the evaluation consists of 32 unique tags. The percentage of PoS tags produced by the system which matched that of the hand-tagged evaluation sets was computed. The results are shown in table 1, together with the average number of tags per token. Most problems in tagging stemmed from the di culty in di erentiating between adjectives and nouns, and between participles, adjective and nouns. Note that the training data is untagged, and that structural con gurations are often shared by these categories. The tagging accuracy gures are lower than those reported for state-of-the-art taggers 4, 25 . However, our initial experiments indicate that deploying a state-of-the-art HMM-based tagger produced poor tagging results for this data. We believe this is due to the fact that taggers are trained on full sentences while query input consists mostly of noun phrases, usually with no speci ers. 4.1.2 Labeled bracketing In order to evaluated the linguistic structure of the optimal parses produced by the system, labeled precision was computed with respect to the hand-bracketed corpus. Precision is the number of correct chunks phrases found by the parser summed over all queries in each evaluation set, divided by the total number of chunks phrases postulated by the parser. A chunk phrases is considered to be correct if it starts with the correct token, terminates with the correct token, and is labeled correctly. Labeled precision for

149

Evaluation Average Grammar Tagging Average Labeled Set Query Length Coverage Accuracy Tags Token Precision random 1.88 91.6 94.77 1.41 79.83 common 1.21 95.6 93.64 1.43 82.00 random3 3.72 78.0 92.46 1.52 81.39 common3 3.46 85.2 93.20 1.58 80.49 random6 7.01 61.2 92.53 1.55 67.91 common6 7.12 61.8 92.68 1.53 68.12 nyse 3.10 90.1 97.68 1.22 87.86 name 3.89 80.0 92.41 1.54 77.70 Table 1: Linguistic Evaluation the evaluation sets according to this bracketing scheme is shown in table 1. Major error sources included: Entity recognition involving multi-word lexemes, such as new jersey in new jersey voter registration. This entity is misidenti ed by the system as a noun chunk, since its constituents new and jersey are tagged as adjective and common noun, respectively. Long noun adjective sequences, such as twin turbo cowl induction hood scoop. Even if noun chunks, such as cowl induction, are correctly identi ed, their modi cation structure within the larger noun phrase is often erroneous. Participles used verbally, adjectivally or nominally. As an example, the phrase christian dating service is assigned an analysis in which dating is tagged as progressive verb, leading to a verbal chunk dating service rather than a nominal one. Prepositional phrase attachment ambiguity. For instance, in the phrase schools of nursing in the uk, both prepositional phrases of nursing and in the uk modify the noun schools; the most likely analysis according to the model contained the incorrect structure nursing in the uk. Words unknown to the morphology and spelling erros leading to the postulation of incorrect entity chunks. Also, queries in the evaluation sets were assigned structures ignoring spelling errors, for instance NC N1 amstredam canal photographs. Ambiguous phrases containing spelling errors were often analyzed incorrectly by the system. 4.1.3 Grammar Coverage Table 1 displays grammar coverage gures for each of the evaluation set, i.e., percentage of queries in the set with at least one parse according to the grammar described in section 2.2. Over 90 of random queries random are assigned a parse by the system, and over 95 of common random queries common. The high coverage for these sets re ects the usually very short length of Web retrieval queries. As expected, the coverage of the grammar decreases as query length increases. Sentence-like queries are not covered by the grammar, which in the current implementation contains rules for noun phrases only. This is one factor responsible for the lower coverage for the random6 and common6 sets. The higher coverage for the set nyse is a consequence of its fairly uniform syntactic structure. 4.2 Retrieval E ectiveness To evaluate retrieval e ectiveness we extracted three query sets from the test corpus, consisting of queries with at least three and at most ve tokens for which the system produced at least one parse. The length restriction re ects the interesting range of application for our grammar: short enough to re ect the brevity of Web queries but long enough to have exploitable linguistic structure. The queries extracted were random, with the exception of queries containing vulgarities, which were deleted from the evaluation sets. The set prep, used to test prepositional modi cation, consists of 15 queries containing the preposition of. The set mod, used to test adjectival nominal modi cation, consists of 15 queries with at least three adjacent nouns and or adjectives. Finally, the set ent, used to test the handling of entities, consists of the rst 15 queries in the test corpus containing a person name. Examples of queries in the evaluation sets prep, mod and ent are shown below:
american academy of orthopedic surgeons transmission model of communications history of stock market free java games work flow management indian baby names lenny von dohlen howard m. dean eric robert rudolph

For each evaluation set we generated the following formal query variations: base, queries containing single tokens with no phrases, stat queries containing single tokens and recognized statistical phrases, phrase, single tokens and phrases recognized by the PCFG, and expanded, the terms for phrase and expanded phrases as de ned in section 2.5. The top ten results for each query were independently assessed for relevance by one judge. We then computed average precision for the top ten as our evaluation measure, re ecting the precision bias prevailing in Web retrieval. The average precision at ten for base, stat, phrase and expanded are presented in table 2. Percentage improvement over the baseline is given in parentheses. These results are preliminary since the query sets are small, however the direction is clear and quite striking. As expected statistical phrases o er a modest, positive bene t. However, it appears that the PCFG does indeed recognizes a greater number of high-quality phrases and hence enjoys a signi cantly greater performance increase over the baseline for all query sets. Consider for example the query university of pennsylvania course guide. Only the pair course

150

Evaluation Set base stat phrase prep .42 .40 -5 .66 60 mod .34 .37 9 .54 59 name .58 .63 9 .73 26

expanded .68 65 .49 44 .65 12

Table 2: Average precision at top 10 documents
guide is a statistical phrase, while the PCFG also recognizes university of pennsylvania. It may be possible to improve the performance of statistical phrases by lowering the frequency ltering threshold, but then one risks introducing noise terms. Extended phrases appear only to bene t the prep queries. This may re ect the fact that the parsing error rate for adjectival nominal modi cation structures is much greater than the error rate for prepositional modi cation structures. Also, the use of adjectival nominal modi cation may lead to decreased precision due to over generalization. For instance, the query free java games lead to the expanded phrase free games, which is structurally correct, but leads to links which do not necessarily pertained to java, a ecting precision. For the name set, the decrease in precision with extended phrases may be related to the systematic introduction of name variation with initials. Detailed per-query analysis indicates that last name- rst name inversion works well, however introducing initials is counterproductive. In general the per-query results for the name set were very uneven; there were relatively many queries for which phrases made no difference, and a few for which phrases saved the day. Unlike the other sets, the results were frequently either all relevant or all irrelevant. This suggests looking at common" and rare" names separately.

of noun phrases. Extending the grammar to cover natural language questions and sentence-like queries should open new possibilities for phrase recognition and expansion. 6 Acknowledgments We would like to thank Mats Rooth, Glenn Carroll, Franz Beil, Detlef Prescher and the Gramotron project at the University of Stuttgart for their support in the development of the query model used in our experiment. We would specially like to thank Glenn Carroll for sharing his experience in training stochastic CFGs and for his assistance. References 1 S. Abney. Parsing by chunks. In D. Bouchard and K. Le el, editors, Views on Phrase Structure. Kluwer Academic Publishers, 1991. 2 AltaVista. http: www.altavista.com. 3 Bison, GNU Project, Free Software Foundation. http: www.fsf.org. 4 Eric Brill. A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy, 1992. 5 G. Carroll. The Slam tools guide. Reference LE-2111 WP7 D7.1, University of Stuttgart, http: www.ims.uni-stuttgart.de projekte sparkle, 1997. 6 G. Carroll. Manual pages for supar, ultra, hypar, gendists, charge and cf2gram. Manuscript. IMS, University of Stuttgart, 1997-98. 7 G. Carroll and M. Rooth. Valence induction with a head-lexicalized PCFG. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing EMNLP-2, 1998. 8 E. Charniak. Statistical parsing with a context-free grammar and word statistics. In Proceedings of the Fourteenth National Conference on Arti cial Intelligence AAAI-97, 1997. 9 M.J. Collins. A new statistical parser based on bigram lexical dependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, 1996. 10 A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J.R.Statis. Soc. B, 39:1 38, 1977. 11 T. Strzalkowski et al. Natural language information retrieval TREC-6 report. In Proceedings of the Sixth Text REtrieval Conference TREC-6, 1998.

5 Summary and Future Work In this paper we present evidence that the automatic recognition and extraction of phrasal query terms from free-text query input can signi cantly improve performance for short, precision-biased queries as measured by average precision at the top ten. Moreover, there is evidence that linguistic phrases | ones extracted from a trained PCFG | out perform statistical phrases in this context. The use of extended phrases is not clearly advantageous, although there is some evidence of their usefulness in last name rst name inversion for proper names and for inversion in prepositional modi cation. We present a novel method for phrase recognition and expansion. Given a hand-written context-free grammar, tailored for Web queries, and a Web query log, we estimate the parameters of a probabilistic context-free grammar. Note this work varies from the literature in being optimized for the relatively short expressions found in Web queries. The PCFG is used to determine the most-likely parse for a Web query, which is then employed to recognize and expand phrases. Phrase recognition is used in order to increase retrieval precision; phrase expansion is applied in order to make the best use possible of very short Web retrieval queries. In the experiment described in this paper, we used a very small section of the Infoseek 16 query log in order to estimate the model used for parsing. Increasing the size of the query log segment used for training should lead to improved tagging and labeled precision gures. Currently, our CFG grammar is limited to the de nition

151

12 Excite. http: www.excite.com. 13 J.L. Fagan. Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. PhD thesis, Department of Computer Science, Cornell University, 1987. 14 F.G. Gey and A. Chen. Phrase discovery for English and cross-language retrieval at TREC-6. In Proceedings of the Sixth Text REtrieval Conference TREC-6, 1997. 15 D. K. Harman. Overview of the fourth text retrieval conference TREC-4. In Proceedings of the Fourth Text REtrieval Conference TREC-4, 1996. 16 Infoseek. http: www.infoseek.com.  17 R. Jackendo . X Syntax: A Study in Phrase Structure. MIT Press, Cambridge, MA, 1977. 18 C. Jacquemin, J.L. Klavans, and Evelyne Tzoukermann. Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, 1997. 19 B.J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information retrieval: A study of user queries on the web. SIGIR FORUM, 321, 1998. 20 D.M. Magerman. Statistical decision-tree models for parsing. In Proceedings of the 33th Annual Meeting of the Association for Computational Linguistics, 1995. 21 M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated corpus of English: the penn treebank. Computational Linguistics, 19, 1993. 22 M. Mitra, C. Buckley, A. Singhal, and C. Cardie. An analysis of statistical and syntactic phrases. In Proceedings of the Fifth RIAO Conference RIAO-97, 1997. 23 New York Stock Exchange. http: www.nyse.com. 24 G. Salton. Automatic Text Processing the Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley Publishing Co., Reading, MA, 1989. 25 Helmut Schmid. Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIGDAT-Workshop, pages 47 50, 1995. 26 C. Silverstein, M. Henzinger, H. Marais, and M. Moricz. Analysis of a very large Alta Vista query log. Technical Report 1998-014, Digital Systems Research Center, 1998. 27 E. Voorhees and D. K. Harman. Overview of the fth text retrieval conference TREC-5. In Proceedings of the Fifth Text REtrieval Conference TREC-5, 1997. 28 E. Voorhees and D. K. Harman. Overview of the sixth text retrieval conference TREC-6. In Proceedings of the Sixth Text REtrieval Conference TREC-6, 1998.

152


相关文章:
Abstract Phrase Recognition and Expansion for Short....pdf
Abstract Phrase Recognition and Expansion for Short, Precision-biased Queries based on a Qu In this paper we examine the question of query parsing for ...
...of Multidimensional Indexstructures for Range Querie.pdf
Modeling and Improving the Performance of Multidimensional Indexstructures for Range Querie_专业资料。OLAP technology is indispensable for the analysis of ...
optimizing object querie....pdf
Optimizing object querie... 暂无评价 54页 免费 Optimizing object querie... 暂无评价 48页 免费 Towards an effective cal... 暂无评价 12页 免费 An ...