Read/Search this Article
Abstract
PantelらがCBCという類似度の高いサブクラスタをあらかじめ作成しておく事でサブクラスタに基づいた揺れの少ない統合と語義を考慮した再統合を行うクラスタリング手法を提案したが,本研究ではCBCを基に係り受けパターンを利用した名詞クラスタリングを行い同義語・類義語クラスタの獲得を目指す.本論文ではCBCの既存の式ではなく確率分布を用いた類似度計算式(Jensen-Shannon)の使用,並びにサブクラスタ候補を決定する新しいスコアリング方法を用いた日本語の名詞クラスタリング手法を提案する.毎日新聞94年度1年分を用いてCBCに用いられる類似度計算式とJensen-Shannonの比較を行いJensen-Shannonの有効性を示し,さらにスコアリング式をいくつかのパターンで提案・比較を行い適切にサブクラスタ候補を決定するスコアリング方法を求める.
In this paper we propose a noun clustering approach on the basis of CBC proposed by Pantel. CBC is a clustering approach that carefully extracts clusters by finding sub-clusters regarded as committees with the same meanings, and try to extract unknown clusters from the remaining elements. In preliminary experiments of Japanese noun clustering, however, we found that CBC does not work well at the measurement of basic similarity between words with context vectors and scoring method that decides to merge sub-clusters. To these problems in this paper we propose to apply Jensen-Shannon formula as a measurement and a new scoring method. In the experimental results of constructing sub-clusters of Japanese nouns from a new paper article we will show that our proposed approaches overcome the approaches in CBC at the clustering accuracy.
Journal
- IEICE technical report. Natural language understanding and models of communication [List of Volumes]
-
IEICE technical report. Natural language understanding and models of communication 108(408), 31-35, 2009-01-19 [Table of Contents]
The Institute of Electronics, Information and Communication Engineers