低頻度語の利用によるテキスト分類性能の改善と評価(情報検索)  [in Japanese] Improving the Performance of Text Categorization Using Low Frequency Terms (Information Retrieval)  [in Japanese]

Abstract

本論文ではテキスト分類における低頻度語の利用とその効果について述べる.テキストに含まれる多数の低頻度語を手がかりとして利用するために,線形判別関数に基づく単純なテキスト分類法に注目し,(1)情報量的な観点に基づく重み付け尺度,(2)確率的言語モデルにおける統計的ディスカウンティング法の適用,(3)形態素解析ツールを利用した複合語抽出処理による性能の改善を目指す.実験では,ともにスケーラビリティに優れた手法である単純ベクトル法やサポートベクタマシンを用いて,大規模なテキスト分類問題における改善や特性を考察する.

This paper aims at investigating the effect of low frequency terms in text categorization problems. In order to utilize information carried by numbers of low frequency terms in text, we use simple text categorization methods with linear decision functions and apply a term weighting scheme based on (1) the concept of probability weighted amount of information, (2) discounting technique in probabilistic language modeling, and also, (3) compound noun extraction based on speech-of-tags generated by a standard morphological analyzer. The effects with term vector-oriented and support vector machine-based methods are examined using a large-scale text categorization problem.

Journal

IPSJ Journal   [List of Volumes]

IPSJ Journal 44(7), 1720-1730, 2003-07-15  [Table of Contents]

Information Processing Society of Japan (IPSJ)

References:  27

You must have a user ID to see the references.If you already have a user ID, please click "Login" to access the info.New users can click "Sign Up" to register for an user ID.

Cited by:  4

You must have a user ID to see the cited references.If you already have a user ID, please click "Login" to access the info.New users can click "Sign Up" to register for an user ID.

Preview

Preview

Codes

  • NII Article ID (NAID) :
    110002711767
  • NII NACSIS-CAT ID (NCID) :
    AN00116647
  • Text Lang :
    JPN
  • Article Type :
    Journal Article
  • ISSN :
    03875806
  • NDL Article ID :
    6643827
  • NDL Source Classification :
    ZM13(科学技術--科学技術一般--データ処理・計算機)
  • NDL Call No. :
    Z14-741
  • Databases :
    CJP  CJPref  NDL  NII-ELS