カテゴリ階層を考慮した構造化パーセプトロンによる固有表現抽出

東山 翔平, ブロンデル マチュー, 関 和広, 上原 邦昭

固有表現抽出は，テキスト中に現れる人名などの語句の同定を目的とする自然言語処理の基本的な問題である．抽出する固有表現は，人名や組織名など 10 種類程度を対象とすることが一般的であり，これらのカテゴリの間の関係は考慮しないことが多い．しかし，これらのカテゴリは階層性を有する場合があり，その場合，階層的に近い（遠い）という情報は抽出の際に活用できる可能性がある．本研究では，階層構造が定義された固有表現を対象に，階層的な近さの値を与えるコスト関数を定義する．機械学習手法である構造化パーセプトロンにコスト関数を導入し，カテゴリの階層性を考慮した固有表現抽出法を提案する． GENIA コーパスを用いて階層構造を持つ固有表現の抽出実験を行い，提案手法により，抽出の誤りの程度を小さくするとともに，抽出の精度を高めることが可能になることを示した．Named Entity Recognition (NER) is a fundamental natural language processing task concerned with the identification and classification of expressions into predefined categories (e.g., person, organization, location, etc). Existing NER systems usually target around ten categories and do not take into account category relations. However, it is often the case that categories naturally belong to some predefined hierarchy. When such is the case, the distance between categories in the hierarchy becomes a rich source of information which can be exploited and is intuitvely particularly useful when the categories are numerous. In this paper, we propose a NER system which can leverage category hierarchy information by introducing, in the structured perceptron framework, a cost function that penalizes more strongly category predictions which are far in the hierarchy from the correct category. We demonstrate the effectiveness of the proposed method through experiments on the GENIA biomedical text corpus, in particular in comparison to methods which do not take into account category hierarchy.

カテゴリ階層を考慮した構造化パーセプトロンによる固有表現抽出

書誌事項

この論文をさがす

抄録

収録刊行物

詳細情報詳細情報について

書き出し

問題の指摘

カテゴリ階層を考慮した構造化パーセプトロンによる固有表現抽出

書誌事項

この論文をさがす

抄録

収録刊行物

詳細情報 詳細情報について

書き出し

問題の指摘

参加プロジェクトリスト

詳細情報詳細情報について