要素論から全体論へ~全体から入る音声情報処理への招待~ From Reductionism to Holis -Holistic Approach for Speech Information Processing-

この論文にアクセスする

この論文をさがす

著者

抄録

一つの言語には通常数十種類の音素(phoneme)がある。しかし音素の音的実体は前後文脈(音素環境)などによって多様に変形し,異音(allophone)と呼ばれる。音素と比較して種類数も多く,より具体的な音的現象に対応している。しかし奇妙なことに,これら音的事象を記号を用いて記す場合,性別,年齢,収録・伝送機器特性などによる音の変形(非言語的要因による音響的変形)は一切無視される。その音響的変形が幾ら大きくても,である。音声認識の音響モデリングは,凡そ,異音に相当する音事象を triphone としてモデル化しているが,「非言語的変形の無視」を実装するために,数万人の話者から,様々な環境で収録した音サンプル群を統計的にモデル化している。本稿では,「非言語的変形の無視」の実装は,集めることではなく,音事象間の差異を捉えることで可能となることを数学的に示し,極めて少数の話者の音声で,不特定話者音声認識が可能であることを示す。提案する枠組みでは,音的要素をモデル化するのではなく,音的差異に着眼し,差異を集めることで構成される全体的な音的構造をモデル化する。A language generally has several tens of phonemes. Acoustic substances of the phoneme depend upon its phonemic environment and the context-sensitive phonemes are called allophones. The number of the allophones in a language is naturally much larger than that of the phonemes. Although the allophones represent finer acoustic differences between linguistic sounds, it is very strange that they completely ignore the acoustic variations in an allophone caused by differences in age,gender, microphone, room, etc.The triphones, which are acoustic models widely used in speech recognition, correspond to the allophones and the ignorance of the acoustic variations caused by the non-linguistic factors are implemented by collecting speech samples from an enormous number of speakers and training statistical acoustic models of the individual allophones. In this paper, it is mathematically shown that the ignorance can be realized not by collecting samples but by capturing timbral differences between two sounds. Then,the possibility of speaker-independent speech recognition only with a very small number of training speakers is experimentally examined. In the proposed framework, what is modeled is not the elementary sound substances, i.e., reductionism, but the holistic sound system exclusively composed of the timbral differences, i.e., holism.

A language generally has several tens of phonemes. Acoustic substances of the phoneme depend upon its phonemic environment and the context-sensitive phonemes are called allophones. The number of the allophones in a language is naturally much larger than that of the phonemes. Although the allophones represent finer acoustic differences between linguistic sounds, it is very strange that they completely ignore the acoustic variations in an allophone caused by differences in age, gender, microphone, room, etc. The triphones, which are acoustic models widely used in speech recognition, correspond to the allophones and the ignorance of the acoustic variations caused by the non-linguistic factors are implemented by collecting speech samples from an enormous number of speakers and training statistical acoustic models of the individual allophones. In this paper, it is mathematically shown that the ignorance can be realized not by collecting samples but by capturing timbral differences between two sounds. Then, the possibility of speaker-independent speech recognition only with a very small number of training speakers is experimentally examined. In the proposed framework, what is modeled is not the elementary sound substances, i.e., reductionism, but the holistic sound system exclusively composed of the timbral differences, i.e., holism.

収録刊行物

  • 情報処理学会研究報告音声言語情報処理(SLP)

    情報処理学会研究報告音声言語情報処理(SLP) 2007(75(2007-SLP-067)), 75-80, 2007-07-21

    一般社団法人情報処理学会

参考文献:  30件中 1-30件 を表示

被引用文献:  3件中 1-3件 を表示

各種コード

  • NII論文ID(NAID)
    110006381953
  • NII書誌ID(NCID)
    AN10442647
  • 本文言語コード
    JPN
  • 資料種別
    Technical Report
  • ISSN
    09196072
  • NDL 記事登録ID
    8858059
  • NDL 雑誌分類
    ZM13(科学技術--科学技術一般--データ処理・計算機)
  • NDL 請求記号
    Z14-1121
  • データ提供元
    CJP書誌  CJP引用  NDL  NII-ELS  IPSJ 
ページトップへ