Wikipediaのカテゴリグラフ解析による語句の確率的分類とその応用

白川, 真澄, 中山, 浩太郎, 原, 隆浩, 西尾, 章治郎

語句をカテゴリ（トピック）に分類した概念辞書は，文書分類をはじめ様々なアプリケーションの基盤リソースとして必要とされている．代表的な概念辞書である WordNet は一般語を網羅的に定義しているが，固有名詞や専門用語，新語はあまり網羅されていない．一方，大規模 Web 百科事典である Wikipedia はそのような語句を数多く定義しており，また，語句を分類するためのカテゴリ構造を有している．しかし， Wikipedia のカテゴリ構造は，複数の親やループを許容するネットワーク構造であるため，ある語句がどのカテゴリに属しているかを判別するのは難しい．そこで本研究では，グラフ理論に基づいて Wikipedia のカテゴリネットワークを解析し，確率的に語句を分類する手法を提案する．また，語句の確率的分類の結果を教師データとし，ナイーブベイズによる文書分類を行う． Web 検索のスニペットを代表的な 8 カテゴリに分類するタスク，および科学に関するニュースのスニペットを 8 つの領域に分類するタスクにおいて評価を行い，提案手法の有効性を確認した．

Taxonomies, which classify terms into categories (topics), are required as fundamental resources for many applications including text classification. WordNet is one of the representative taxonomies and defines general terms, though it defines few named entities, specific terms and new words. On the other hand, Wikipedia, a large-scale free online encyclopedia, defines such terms and classifies them into a variety of categories. However, because the category structure is a network that allows multiple parents and loops, it is hard to determine whether a term belongs to a category or not. In this paper, we propose a method to probabilistically classify terms by analyzing Wikipedia category network based on graph theories. We also propose a text classification method using the result of probabilistic term classification and Naive Bayes. In the experiments on both Web snippet dataset and science news dataset, we confirmed the effectiveness of our method for classifying texts into several categories.

Wikipediaのカテゴリグラフ解析による語句の確率的分類とその応用

Bibliographic Information

Search this article

Abstract

Journal

Related Projects

Keywords

Details 詳細情報について

Export

Report a problem

Wikipediaのカテゴリグラフ解析による語句の確率的分類とその応用

Bibliographic Information

Search this article

Abstract

Journal

Related Projects

Keywords

Details 詳細情報について

Export

Report a problem

Project list