実環境を考慮したマルチモーダル音声認識のためのストリーム重み最適化手法  [in Japanese] A stream-weight optimization method for audio-visual speech recognition in real environments  [in Japanese]

Access this Article

Search this Article

Author(s)

    • 田村 哲嗣 TAMURA Satoshi
    • 東京工業大学 情報理工学研究科 計算工学専攻 Department of Computer Science, Tokyo Institute of Technology
    • 岩野 公司 IWANO Koji
    • 東京工業大学 情報理工学研究科 計算工学専攻 Department of Computer Science, Tokyo Institute of Technology
    • 古井 貞煕 FURUI Sadaoki
    • 東京工業大学 情報理工学研究科 計算工学専攻 Department of Computer Science, Tokyo Institute of Technology

Abstract

音声認識の頑健性向上の手法のひとつとして,口唇動画像の情報を利用するマルチモーダル音声認識の研究が進められている.実環境でのマルチモーダル音声認識の性能向上には,モデルとして用いるマルチストリームHMMについて,少量の適応データでも実行できるストリーム重み係数の自動最適化手法が必要不可欠である.本論文では,我々の従来手法(尤度比最大化法)を参考に,各HMMの出力尤度平均を正規化するよう,尤度平均化基準による新たなストリーム重み最適化手法を提案する.車載カメラで収録した実環境データを用いた認識実験で,教師なし条件で提案法を評価したところ,音響特徴のみの結果と比べ,約16%の正解精度が改善した.さらにMLLR適応と提案手法を組み合わることで,約23%の正解精度の改善に成功した.Multimodal speech recognition which jointly uses acoustic and visual information has been actively investigated for increasing robustness of ASR. In order to improve performance of multimodal ASR in real environments, it is crucial to automatically optimize stream weights for multi-stream HMMs using a small size of data. This paper proposes a new stream-weight optimization method based on an output likelihood normalization (OLN) criterion; the stream weights are adjusted to equalize mean log likelihood values for all HMMs. Experiments were conducted using audio-visual data recorded in a driving car. A 16% improvement of recognition accuracy was achieved over an audio-only baseline by applying the unsupervised OLN stream-weight optimization. By additionally applying the MLLR adaptation, a 23% improvement was obtained.

Multimodal speech recognition which jointly uses acoustic and visual information has been actively investigated for increasing robustness of ASR. In order to improve performance of multimodal ASR in real environments, it is crucial to automatically optimize stream weights for multi-stream HMMs using a small size of data. This paper proposes a new stream-weight optimization method based on an output likelihood normalization (OLN) criterion ; the stream weights are adjusted to equalize mean log likelihood values for all HMMs. Experiments were conducted using audio-visual data recorded in a driving car. A 16% improvement of recognition accuracy was achieved over an audio-only baseline by applying the unsupervised OLN stream weight optimization. By additionally applying the MLLR adaptation, a 23% improvement was obtained.

Journal

  • IPSJ SIG Notes

    IPSJ SIG Notes 2005(12(2004-SLP-055)), 29-34, 2005-02-04

    Information Processing Society of Japan (IPSJ)

References:  9

Codes

  • NII Article ID (NAID)
    110002950635
  • NII NACSIS-CAT ID (NCID)
    AN10442647
  • Text Lang
    JPN
  • Article Type
    Technical Report
  • ISSN
    09196072
  • NDL Article ID
    7278441
  • NDL Source Classification
    ZM13(科学技術--科学技術一般--データ処理・計算機)
  • NDL Call No.
    Z14-1121
  • Data Source
    CJP  NDL  NII-ELS  IPSJ 
Page Top