正規化頻度による形態素境界の推定  [in Japanese] Statistical Estimation of Word Boundaries Using Normalized Frequency  [in Japanese]

Access this Article

Search this Article

Author(s)

Abstract

本論文では著者がすでに提案している統計的手法による形態素境界の推定法の改善について検討した結果を述べる.著者の手法では,分かち書きされていない日本語テキストから形態素境界を推定するために,まずNグラムの統計情報(頻度,頻度の期待値や分散)から文字列の正規化頻度を計算し,次にそれを用いて語の境界を獲得する.しかし解の探索条件を変えることによって,より解析精度を向上させることができることがわかった.EDR日本語コーパスを用いた実験の結果,得られた境界と形態素境界の適合率は82.00%,再現率は82.20%であった.なお本手法は既存の形態素解析の手法と重複する部分がないため,両者の併用によってさらに解析精度の向上が期待できる.This paper describes a improvement of a method, which we proposed formerly, to acquire boundaries of words automatically from not-segmented Japanese sentences. Morphological analysis is necessary to recognize words from a Japanese text. There exist, however, a problem of the unregistered words recognition, so dictionaries and complex heuristics are necessary to resolve them. Our method is based on the statistical method which need not a traditional morphological analysis. It includes 2 steps: (1) calculation of Normahzed Frequency for each substring included in the Japanese text using the N-gram statistics (frequency, expectation, cariance), (2) determination of boundaries between words by the Normalized Frequency. We confirmed that a revision of the second step improved the performance of our method. Experiments were done on EDR Japanese corpus. we obtained the correct recognition scores of 82.00% and the recall scores of 82.20%.

This paper describes a improvement of a method, which we proposed formerly, to acquire boundaries of words automatically from not-segmented Japanese sentences. Morphological analysis is necessary to recognize words from a Japanese text. There exist, however, a problem of the unregistered words recognition, so dictionaries and complex heuristics are necessary to resolve them. our method is based on the statistical method which need not a traditional morphological analysis. It includes 2 steps: (1) calculation of Normalized Frequency for each substring included in the Japanese text using the N-gram statistics (frequency, expectation, variance), (2) determination of boundaries between words by the Normalized Frequency. We confirmed that a revision of the second step improved the performance of our method. Experiments were done on EDR Japanese corpus. we obtained the correct recognition scores of 82.00% and the recall scores of 82.20%.

Journal

  • IPSJ SIG Notes

    IPSJ SIG Notes 1996(56(1996-NL-113)), 13-18, 1996-05-28

    Information Processing Society of Japan (IPSJ)

References:  15

Cited by:  6

Codes

  • NII Article ID (NAID)
    110002934917
  • NII NACSIS-CAT ID (NCID)
    AN10115061
  • Text Lang
    JPN
  • Article Type
    Journal Article
  • Data Source
    CJP  CJPref  NII-ELS  IPSJ 
Page Top