Accession Number : ADA631830


Title :   Improving English and Chinese Ad-hoc Retrieval: Tipster Text Phase 3


Descriptive Note : Conference paper


Corporate Author : QUEENS COLL FLUSHING NY DEPT OF COMPUTER SCIENCES


Personal Author(s) : Kwok, Kui-Lam


Full Text : http://www.dtic.mil/dtic/tr/fulltext/u2/a631830.pdf


Report Date : Oct 1998


Pagination or Media Count : 11


Abstract : We investigated both English and Chinese ad-hoc information retrieval (IR). Part of our objectives is to study the use of term, phrasal and topical concept level evidence, either individually or in combination, to improve retrieval accuracy. For short queries, we studied five term level techniques that together lead to improvements over standard ad-hoc 2-stage retrieval some 20% to 40% for TREC5 & 6 experiments. For long queries, we studied linguistic phrases as evidence to re-rank outputs of term level retrieval. It brings small improvements in both TREC5 & 6 experiments, but needs further confirmation. We also investigated clustering of output documents from term level retrieval. Our aim is to separate relevant and irrelevant documents into different clusters, and to rerank the output list by groups based on query and cluster-profile matching. Investigation is still on-going. For Chinese IR, many results were confirmed or discovered. For example, accurate word segmentation is not as important as first thought, but short-word segmentation is preferable to long-word (phrase). Simple bigram representation can give very good retrieval. A stopword list is not necessary; and presence of non-content terms does not hurt evaluation results much. One only needs screening out statistical stopwords of high frequency. Character indexing by itself is not competitive, but is useful for augmenting short-words or bigrams. Best results were obtained by combining retrievals of bigram and short-word with character representation. Chinese IR retums better precision than English, and it is not clear if this is a language-related, or collection-related phenomenon.


Descriptors :   *CHINESE LANGUAGE , *ENGLISH LANGUAGE , *INFORMATION RETRIEVAL , ACCURACY , CLUSTERING , DOCUMENTS , HIGH FREQUENCY , INDEXES , INTERROGATION , LINGUISTICS , OUTPUT , PRECISION , SEGMENTED , SHORT RANGE(TIME) , TEXT PROCESSING , WORD ORGANIZED STORAGE , WORKSHOPS


Subject Categories : Information Science
      Linguistics
      Computer Programming and Software


Distribution Statement : APPROVED FOR PUBLIC RELEASE