Conditional Random Fieldsを用いた日本語形態素解析  [in Japanese] Applying Conditional Random Fields to Japanese Morphologiaical Analysis  [in Japanese]

Access this Article

Search this Article

Author(s)

    • 工藤 拓 KUDO Taku
    • 奈良先端科学技術大学院大学 情報科学研究科 Graduate School of Information Science, Nara Institute of Science and Technology
    • 松本 裕治 MATSUMOTO Yuji
    • 奈良先端科学技術大学院大学 情報科学研究科 Graduate School of Information Science, Nara Institute of Science and Technology

Abstract

本稿では Conditonal Random Fields (CRF) に基づく日本語形態素解析を提案する. CRFを適用したこれまでの研究の多くは 単語の境界位置が既知の状況を想定していた. しかし 日本語には明示的な単語境界が無く 単語境界同定と品詞同定を同時に行うタスクである日本語形態素解析にCRFを直接適用することは困難である. 本稿ではまず 単語境界が存在する問題に対するCRFの適用方法について述べる. さらに CRFが既存手法(HMM MEMM) の問題点を自然にかつ有効に解決することを実データを用いた実験と共に示す. CRFは 階層構造を持つ品詞体系や文字種の情報に対して柔軟な素性設計を可能にし label biasやlength biasを低減する効果を持つ. 前者はHMM の欠点であり 後者はMEMMの欠点である. また 2つの正則化手法(L1-CRF/L2-CRF) を適用し それぞれの性質について論じる.This paper presents Japanese morphological analysis based on Conditional Random Fields (CRF). Previous work in CRF assumed that observation sequence (word) boundaries were fixed. However, word boundaries are not clear in Japanese, and hence a straightforward application of CRF is not possible. We show how CRF can be applied to situations where word boundary ambiguity exists. CRF offer an elegant solution to the long-standing problems in Japanese morphological analysis using HMM or MEMM. First, flexible feature designs for hierarchical tagsets become possible. Second, influences of label and length bias are minimized. The former compensate weakness in HMM, while the latter overcomes noticed problems in MEMM. We experiment with CRF, HMM, and MEMM on Japanese annotated corpora, and CRF outperform the other approaches.

This paper presents Japanese morphological analysis based on Conditional Random Fields (CRF). Previous work in CRF assumed that observation sequence (word) boundaries were fixed. However, word boundaries are not clear in Japanese, and hence a straightforward application of CRF is not possible. We show how CRF can be applied to situations where word boundary ambiguity exists. CRF offer an elegant solution to the long-standing problems in Japanese morphological analysis using HMM or MEMM. First, flexible feature designs for hierarchical tagsets become possible. Second, influences of label and length bias are minimized. The former compensate weakness in HMM, while the latter overcomes noticed problems in MEMM. We experiment with CRF, HMM, and MEMM on Japanese annotated corpora, and CRF outperform the other approaches.

Journal

  • IPSJ SIG Notes

    IPSJ SIG Notes 2004(47(2004-NL-161)), 89-96, 2004-05-14

    Information Processing Society of Japan (IPSJ)

References:  21

Cited by:  26

Codes

  • NII Article ID (NAID)
    110002911717
  • NII NACSIS-CAT ID (NCID)
    AN10115061
  • Text Lang
    JPN
  • Article Type
    Journal Article
  • ISSN
    09196072
  • NDL Article ID
    6985183
  • NDL Source Classification
    ZM13(科学技術--科学技術一般--データ処理・計算機)
  • NDL Call No.
    Z14-1121
  • Data Source
    CJP  CJPref  NDL  NII-ELS  IPSJ 
Page Top