音響尤度を用いた書き起こしデータ選択による音響モデル構築コストの削減  [in Japanese] Transcription cost reduction for Acoustic model construction by speech data selection based on acoustic likelihoods  [in Japanese]

Access this Article

Search this Article

Author(s)

Abstract

音声認識技術の向上により高速で高精度な認識が可能となり,今や実用化の段階に入っている.不特定話者を対象としたシステムでの高精度な音声認識を行うためには,ターゲットのシステムで収集した音声とその書き起こしを用いた音響モデルの学習が欠かせない.しかし,書き起こしには多くの時間と労力が必要であり,コストがかかる.これが不特定話者を対象としたシステムの普及を妨げる一つの要因である.モデル構築に有効な音声データを事前に選択することで,書き起こしのコストを下げることができる.本報告では,学習に使用するデータセットを事前に選択し,選択されたもののみ書き起こしを作成することにより,モデル構築コストを削減する手法を提案する.実環境音声情報案内システム「たけまるくん」によって自動収集された音声から,学習に使用するデータセットを音響尤度に基づいて自動選択することで,書き起こし量を50%~90%削減し,学習した音響モデルの精度を評価した.収集データの量が少ない場合は,学習データを選択することで,全データを使う場合と同等以上の性能が得られることが分かった.収集データの量が多い場合は,全データを使用する場合に比べて性能は低下するが,1%未満の認識精度の低下で書き起こし量を30%まで削減することができた.In recent days, the research in automatic speech recognition (ASR) system has dramatically improved. This leads to the direction of its practical use. In order to achieve the peak performance of ASR system, we need to use proper acoustic and language models adapted for a specific task. To train task-adapted models, actual data samples and its transcription have to be collected. However, transcribing utterances is a time-consuming and laborious process. This burden has become a critical issue for the practical use of ASR system. If we select utterances which are effective for training before transcribing, we can reduce transcription effort. This paper describes a likelihood-based method which reduces the transcription effort used in the construction of the task-adapted acoustic models. In the proposed method, we automatically select informative training data samples to be transcribed from a large-sized speech corpus, based on acoustic likelihood. In order to demonstrate the effectiveness of the proposed method, we perform several experimental evaluations in the framework of ' Takemaru-kun ', a practical speech-oriented guidance system. The result shows that we can reduce the number of utterances to be transcribed to 30% with less than 1970 of recognition performance deterioration.

In recent days, the research in automatic speech recognition (ASR) system has dramatically improved. This leads to the direction of its practical use. In order to achieve the peak performance of ASR system, we need to use proper acoustic and language models adapted for a specific task. To train task-adapted models, actual data samples and its transcription have to be collected. However, transcribing utterances is a time-consuming and laborious process. This burden has become a critical issue for the practical use of ASR system. If we select utterances which are effective for training before transcribing, we can reduce transcription effort. This paper describes a like-lihood-based method which reduces the transcription effort used in the construction of the task-adapted acoustic models. In the proposed method, we automatically select informative training data samples to be transcribed from a large-sized speech corpus, based on acoustic likelihood. In order to demonstrate the effectiveness of the proposed method, we perform several experimental evaluations in the framework of 'Takemaru-kun', a practical speech-oriented guidance system. The result shows that we can reduce the number of utterances to be transcribed to 30% with less than 1% of recognition performance deterioration.

Journal

  • IPSJ SIG Notes

    IPSJ SIG Notes 2005(127(2005-SLP-059)), 229-234, 2005-12-22

    Information Processing Society of Japan (IPSJ)

References:  12

Codes

  • NII Article ID (NAID)
    110003494761
  • NII NACSIS-CAT ID (NCID)
    AN10442647
  • Text Lang
    JPN
  • Article Type
    Technical Report
  • ISSN
    09196072
  • NDL Article ID
    7768384
  • NDL Source Classification
    ZM13(科学技術--科学技術一般--データ処理・計算機)
  • NDL Call No.
    Z14-1121
  • Data Source
    CJP  NDL  NII-ELS  IPSJ 
Page Top