新聞・ニュース文の大語彙連続音声認識  [in Japanese] Large - vocabulary speech recognition experiments using newspaper and broadcast news corpora  [in Japanese]

Access this Article

Search this Article

Author(s)

Abstract

本研究では大規模なテキストと音声のコーパスを用い、音声認識のための精度の良いN-gram言語モデルの構築を検討した。N-gram言語モデルはタスクに依存するため、タスクに関する大量のデータベースを用いて構築される必要がある。そこで新聞記事テキストデータベースを用いて、同一ジャンルの過去の記事を用いるタスク適応化の方法とその有効性を調べた。また新聞記事には、使用頻度の高い(特殊)表現や固定的な複数形態素から成る定型表現が多いため、それらを自動的に抽出し、1形態素として捉えた上でN-gram言語モデルを構築する方法を検討した。以上の言語モデルを朗読音声(read speech)に適用し、評価した。更にもう一つの利用可能な大規模なテキストコーパスとして、NHKニュース原稿コーパスを用いた言語モデルを自然発話(spontaneous speech)に適用し、比較評価した。In this paper, we describe a method that constructs language models using a task-adaptation strategy and idiomatic expressions of news articles. To build an effective n-gram based language model, it should be noted that the training data must be prepared as much as possible. However, for given task/topic, it is very difficult to prepared a sufficient size of data. First, we investigated the effect of a task adaptation method of N-gram language model using a limited amount of target articles. Second, we investigated the effect of the use of idiomatic expressions as morpheme units, since some specific expressions and idiomatic expressions are frequently observed in news articles. Some experiments using news article read speech database were conducted to investigate the effectiveness of these methods for constructing N-gram language models. Experimental results using a broadcast news speech (spontaneous speech) and text corpora is also presented.

In this paper, we describe a method that constructs language models using a task-adaptation strategy and idiomatic expressions of news articles. To build an effective n-gram based language model, it should be noted that the training data must be prepared as much as possible. However, for a given task/topic, it is very difficult to prepare a sufficient size of data. First, we investigated the effect of a task adaptation method of N-gram language model using a limited amount of target articles. Second, we investigated the effect of the use of idiomatic expressions as morpheme units, since some specific expressions and idiomatic expressions are frequently observed in news articles. Some experiments using news article read speech database were conducted to investigated the effectiveness of these methods for constructing N-gram language models. Experimental results using a broadcast news speech (spontaneous speech) and text corpora is also presented.

Journal

  • IPSJ SIG Notes

    IPSJ SIG Notes 1998(49(1998-SLP-021)), 97-104, 1998-05-28

    Information Processing Society of Japan (IPSJ)

References:  20

Cited by:  9

Codes

  • NII Article ID (NAID)
    110002917063
  • NII NACSIS-CAT ID (NCID)
    AN10442647
  • Text Lang
    JPN
  • Article Type
    Journal Article
  • ISSN
    09196072
  • Data Source
    CJP  CJPref  NII-ELS  IPSJ 
Page Top