日本語固有表現抽出におけるわかち書き問題の解決

浅原, 正幸, 松本, 裕治

一般的に日本語固有表現抽出で提案されている手法は形態素解析とチャンキングの組合せによる．形態素解析出力結果をそのままチャンカの入力にすると，形態素解析結果より小さい単位の固有表現を抽出することは困難である．そこで，文字単位でチャンキングを行う手法を提案する．まず，統計的形態素解析器で入力文を冗長的に解析を行う．次に，入力文を文字単位に分割し，文字，字種および形態素解析結果のn 次解までの品詞情報などを各文字に付与する．最後に，これらを素性として，サポートベクトルマシンに基づいたチャンカにより決定的に固有表現となる語の語境界を推定する．CRL 固有表現データを用いて評価実験（交差検定5-fold ）を行った結果，F 値0.87 という高精度の結果が得られた．

Named Entity (NE)extraction is a task in which proper nouns and numerical information are extracted from texts.A method of cascading morphological analysis and chunking is usually used for NE extraction in Japanese.However,such a method cannot extract smaller NE units than morphological analyzer outputs.To cope with the unit problem,we propose a character-based chunking method.Firstly,input sentences are redundantly analyzed by a statistical analyzer.Secondly,the input sentences are segmented into characters.The characters are annotated with the character types and POS tags of the top n-best answers that are given by the statistical morphological analyzer.Finally,we do chunking deterministically based on support vector machines.We apply our method to IREX NE task using CRL Named Entities data.The cross validation result of the F-value being 0.87 shows the effectiveness of the method.

日本語固有表現抽出におけるわかち書き問題の解決

Bibliographic Information

Search this article

Abstract

Journal

Citations (5)*help

References(11)*help

Related Projects

Keywords

Details 詳細情報について

Export

Report a problem

日本語固有表現抽出におけるわかち書き問題の解決

Bibliographic Information

Search this article

Abstract

Journal

Citations (5)*help

References(11)*help

Related Projects

Keywords

Details 詳細情報について

Export

Report a problem

Project list