異なる文書中の文間関係の特定  [in Japanese] Identifying a cross-document relation between sentences  [in Japanese]

Access this Article

Search this Article

Author(s)

    • 宮部 泰成 MIYABE Yasunari
    • 東京工業大学大学院 総合理工学研究科知能システム科学専攻 Department of Computational Intelligence and Systems Science, Interdiciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology
    • 奥村 学 OKUMURA Manabu
    • 東京工業大学 精密工学研究所 Precision and Intelligence Laboratory, Tokyo Institute of Technology

Abstract

一つのトピックについて書かれた複数の新聞記事に対し、異なる記事中の文間が同じ内容を述べているか(同等関係)を特定する機械学習に基づく手法を提案する。提案手法では、2つの文の類似度でデータを複数のクラスに分け、分けられたクラスに合った特徴で学習することによって、データを分けずに学習するより、優れた結果を得られることを示した。また、2つの文の類似度があまり高くない文間ペアのクラスにおいて、「同等」の数は、クラス内の全ての文間ペア数と比べて、大変少ない。このため、2つの文が同じ内容を述べていても、「同等」関係であると特定できないときがある。この問題を解決するために、異なる記事中の文間に存在する、同じ内容を簡潔に述べたり、詳しく述べたりする「同等と似た関係」を利用する。最初に「同等」と「同等と似た関係」を一つの粗いクラスにまとめて特定し、次に粗いクラスから「同等」のみを特定する、2段階の特定手法を提案する。この2つの手法を組み合わせることによって、高い正解率が得られることを示した。We propose a machine learning based method that identifies an equivalence relation between sentences in different newspaper articles on a topic. We showed that our method,which divided the corpus into several classes by sentence similarity and learned a classifier,yielded a suoerior result than without dividing it. In addition,compared with the number of total sentence pairs,the number of sentence pairs in an equivalence relation is too small in a relatively less similar class. Therefore,the classifier sometimes cannot identify equivalence relations. To solve this problem,we use "relations similar to equivalence"that describe a same content more briefly or in more detail in different newspaper articles. We also propose a two-stage method that first identifies a coarse class that includes both an equivalence relation and "relations similar to equivalence",and then identifies an equivalence relation from a coarse class. We showed that high accuracy was yielded by combining these two methods.

We propose a machine learning based method that identifies an equivalence relation between sentences in different newspaper articles on a topic. We showed that our method, which divided the corpus into several classes by sentence similarity and learned a classifier, yielded a superior result than without dividing it. In addition, compared with the number of total sentence pairs, the number of sentence pairs in an equivalence relation is too small in a relatively less similar class. Therefore, the classifier sometimes cannot identify equivalence relations. To solve this problem, we use "relations similar to equivalence" that describe a same content more briefly or in more detail in different newspaper articles. We also propose a two-stage method that first identifies a coarse class that includes both an equivalence relation and "relations similar to equivalence", and then identifies an equivalence relation from a coarse class. We showed that high accuracy was yielded by combining these two methods.

Journal

  • IPSJ SIG Notes

    IPSJ SIG Notes 2005(73(2005-NL-168)), 35-42, 2005-07-22

    Information Processing Society of Japan (IPSJ)

References:  14

Cited by:  9

Codes

  • NII Article ID (NAID)
    110002952440
  • NII NACSIS-CAT ID (NCID)
    AN10115061
  • Text Lang
    JPN
  • Article Type
    Journal Article
  • ISSN
    09196072
  • NDL Article ID
    7385603
  • NDL Source Classification
    ZM13(科学技術--科学技術一般--データ処理・計算機)
  • NDL Call No.
    Z14-1121
  • Data Source
    CJP  CJPref  NDL  NII-ELS  IPSJ 
Page Top