日本語文書検索のための頻度情報を用いた効率的部分文字列索引の提案

小川, 泰嗣

文書における部分文字列の出現を記録する部分文字列索引は単語切り出しのための言語処理が不要という点で日本語文書検索向きである. しかし出現位置情報を捨象しているため誤検索が発生する検索語の長さに応じて検索時間が増大するという問題がありこうした点を改善するためには索引サイズを大きくせざるをえない. すなわち誤検索率・検索時間・索引サイズのすべてに優れた部分文字列索引を実現することは困難であった. 本論文ではユーザの使用する可能性の高い検索語の処理を優先的に高速化し誤検索率・索引サイズを悪化させることなく平均検索時間を短縮する効率的な部分文字列索引を提案する. このために文字および部分文字列の2レベルの頻度を用いる. 文字レベルの出現頻度は二文字組から索引エントリを決定するためのハッシングに利用され検索の高速化および誤検索の低減に作用する. 一方部分文字列レベルの出現頻度は長い部分文字列を独立した索引エントりとして選択するために利用され出現頻度が高く長い検索語の処理の高速化に作用する. 特許要約文10万件(14MB)を用いて検索時間・検索精度・索引サイズの評価実験を行い本手法の有効性を確認できた.

Substring indexing method is suitable for Japanese document retrieval systems, because it requires no natural language processing to identify words. A substring index does sometimes create false drops and retrieval time is proportional to the query length, however, an index must be large to reduce false drops or shorten retrieval response. In other words, it's difficult to get good performance measures for false drops, retrieval time as well as index size. We propose a new substring indexing method that shortens the average response time by selectively fastening retrievals of frequently used words, and yet does not either increase false drops or enlarge indexes. We use two kinds of frequency data: (1) Character frequency is used to generate a hash table for character-pairs, both increasing retrieval speed and reducing false drops, (2) Substring frequency is used to select special substring index entries, resulting in quick retrieval for long but highly frequent words. We evaluated our method using 100,000 patent abstracts. Measurement results, including response time, index size and false drop rate, confirmed the effectiveness of our method.

日本語文書検索のための頻度情報を用いた効率的部分文字列索引の提案

書誌事項

この論文をさがす

抄録

収録刊行物

被引用文献 (3)*注記

参考文献 (27)*注記

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

日本語文書検索のための頻度情報を用いた効率的部分文字列索引の提案

書誌事項

この論文をさがす

抄録

収録刊行物

被引用文献 (3)*注記

参考文献 (27)*注記

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

参加プロジェクトリスト

詳細情報詳細情報について