超並列計算機を用いた入力音声の変動に頑健な音声対話システムの検討  [in Japanese] Spoken dialogue system robust against speech variations based on massively parallel computing  [in Japanese]

Access this Article

Search this Article

Author(s)

Abstract

入力音声の変動のうち,事前に予測できる変動にはそれらに適したモデルを予め用意し,予測が困難な変動にはモデルを逐次適応化することで,音声認識の頑健性が向上する.計算量が莫大となるこれらの手法を組み合わせて,実時間での処理が要求される音声対話システムに適用するために,超並列計算機を用いることを検討する.本稿では,複数の音声認識器を同時並行に駆動して得られた複数の認識仮説を尤度を基準に選択し,適応処理をバックグラウンドで行うアーキテクチャをGRID上に実装した.飲食店舗検索タスクにおいて,発話内容(話題・発話カテゴリ)による入力音声の言語的変動を表す複数の言語モデルと,話者の違いによる音響的変動を表す複数の特定話者音響モデルを用いた.事前に収録された対話音声による評価実験を行ったところ,単一の音響モデルと言語モデルによる従来のシステムと比べ,75台の音声認識ノードと15台の話者適応ノードを駆動することで,構築したシステムではキーワード認識誤り率を25.5%削減することができた.Robustness of speech recognition increases by preparing models suitable to acoustic and linguistic variations when they can be predicted. This also increases by incrementally adapting the models when the variation is difficult to predict. In order to combine these methods which need huge amount of computation, and implement them in spoken dialogue systems which need real time processing, this paper investigates using a massively parallel computer. Architecture of selecting a recognition result having the maximum likelihood from the results obtained by multiple speech recognizers driven in parallel and running adaptation processes in the background has been implemented on a GRID computing system. In a restaurant information retrieval task, multiple language models representing linguistic variations of input speech according to utterance contents (topics/utterance categories) and multiple speaker-dependent acoustic models representing speaker variations have been used. Results of evaluation experiments using pre-recorded dialogue utterances show that the proposed system achieves 25.5% reduction in the keyword recognition error rate in comparison with a conventional system using a single acoustic as well as language model, when 75 recognition nodes and 15 speaker-adaptation nodes are driven.

Robustness of speech recognition increases by preparing models suitable to acoustic and linguistic variations when they can be predicted. This also increases by incrementally adapting the models when the variation is difficult to predict. In order to combine these methods which need huge amount of computation, and implement them in spoken dialogue systems which need real time processing, this paper investigates using a massively parallel computer. Architecture of selecting a recognition result having the maximum likelihood from the results obtained by multiple speech recognizers driven in parallel and running adaptation processes in the background has been implemented on a GRID computing system. In a restaurant information retrieval task, multiple language models representing linguistic variations of input speech according to utterance contents (topics/utterance categories) and multiple speaker-dependent acoustic models representing speaker variations have been used. Results of evaluation experiments using pre-recorded dialogue utterances show that the proposed system achieves 25.5% reduction in the keyword recognition error rate in comparison with a conventional system using a single acoustic as well as language model, when 75 recognition nodes and 15 speaker-adaptation nodes are driven.

Journal

  • IPSJ SIG Notes

    IPSJ SIG Notes 2005(127(2005-SLP-059)), 91-96, 2005-12-22

    Information Processing Society of Japan (IPSJ)

References:  15

Codes

  • NII Article ID (NAID)
    110003494732
  • NII NACSIS-CAT ID (NCID)
    AN10442647
  • Text Lang
    JPN
  • Article Type
    Technical Report
  • ISSN
    09196072
  • NDL Article ID
    7768015
  • NDL Source Classification
    ZM13(科学技術--科学技術一般--データ処理・計算機)
  • NDL Call No.
    Z14-1121
  • Data Source
    CJP  NDL  NII-ELS  IPSJ 
Page Top