日本語の情報量の上限の推定  [in Japanese] An Estimate of an Upper Bound for the Entropy of Japanese  [in Japanese]

Access this Article

Search this Article

Author(s)

    • 森 信介 MORI Shinsuke
    • 京都大学工学研究科電子通信工学専攻 Department of Electronics and Communication, Kyoto University
    • 山地 治 YAMAJI Osamu
    • 京都大学工学研究科電子通信工学専攻 Department of Electronics and Communication, Kyoto University

Abstract

本論文では,形態素単位のn?gramモデル(1〓n〓16)による日本語の情報量の上限の推定方法とその結果を示す.各n?gramモデルは,データスパースネスの問題に対応するため,低次のn?gramモデルとの補間を行ってある.補間係数は,最も有効であると考えられている削除補間法により求める.実験ではEDRコーパスの約9割からモデルのパラメータを推定し,残りの1割に対して情報量を計算した.その結果,n=16のときに1文字あたりの情報量は最小の4.30330ビットであった.また,学習コーパスの大きさとモデルの次数による情報量の変化を調べた結果,モデルの次数を上げることによる情報量の減少量は微小であるが,学習コーパスを大きくすることによる情報量の減少量はかなりあるということが分かった.さらに,パラメータ数とエントロピーの関係についても議論する.これは,実際の日本語処理にn?gramモデルを応用する際に,適切にnの値を選ぶ指標となる.In this paper we present an estimate of an upper bound for the entropy of Japanese by morpheme n-gram model(1〓n〓16).Each n-gram model is interpolated with lower order n-gram models.The deleted interpolation method is applied for estimating interpolation coefficients.We estimated the parameters from 90% of the EDR corpus and calculated the entropy on the rest 10%.As the result,the minimum entropy was 4.30330[bit]a character with n=16.The relation between the size of learning corpus or the order of model and entropy showed that incresing the order decreases entropy slightly and increasing the size of learning corpus decreases it noteworthily.In addition,we discuss the relation between the number of parameters and entropy.This is usefull to select the value of n to apply n-gram model to the practical Japanese processing.

In this paper we present an estimate of an upper bound for the entropy of Japanese by morpheme n-gram model (1≤n≤16). Each n-gram model is interpolated with lower order n-gram models. The deleted interpolation method is applied for estimating interpolation coefficients. We estimated the parameters from 90% of the EDR corpus and calculated the entropy on the rest 10%. As the result, the minimum entropy was 4.30330 [bit] a character with n=16. The relation between the size of learning corpus or the order of model and entropy showed that incresing the order decreases entropy slightly and increasing the size of learning corpus decreases it noteworthily. In addition, we discuss the relation between the number of parameters and entropy. This is usefull to select the value of n to apply n-gram model to the practical Japanese processing.

Journal

  • Transactions of Information Processing Society of Japan

    Transactions of Information Processing Society of Japan 38(11), 2191-2199, 1997-11-15

    Information Processing Society of Japan (IPSJ)

References:  13

Cited by:  19

Codes

  • NII Article ID (NAID)
    110002721669
  • NII NACSIS-CAT ID (NCID)
    AN00116647
  • Text Lang
    JPN
  • Article Type
    Journal Article
  • ISSN
    1882-7764
  • NDL Article ID
    4332450
  • NDL Source Classification
    ZM13(科学技術--科学技術一般--データ処理・計算機)
  • NDL Call No.
    Z14-741
  • Data Source
    CJP  CJPref  NDL  NII-ELS  IPSJ 
Page Top