局所的・大局的な特徴を利用した歌声と朗読音声の識別 Discrimination between Singing and Speaking Voices Using Local and Global Characteristics

この論文にアクセスする

この論文をさがす

著者

抄録

音声信号の局所的・大局的な特徴を利用した歌声と朗読音声の識別について検討する.聴取実験の結果 人間は200ms 1sの音声信号に対して それぞれ70.0% 99.7%で歌声と朗読音声の識別が可能であることを確認した.この結果より 短時間・長時間の音声信号に対して 異なる特徴が識別に影響するということを想定し スペクトル包絡(MFCC)と基本周波数の軌跡の2つの尺度に基づく識別器を設計した.このとき 入力音声信号が1秒よりも長い場合 基本周波数の軌跡を特徴量として利用した方がスペクトル包絡を特徴量とするよりも識別性能が高い.特に 発声開始より2秒の音声信号に対して85.0%の歌声と朗読音声の識別が可能であった.一方 入力音声信号が1秒よりも短い場合 スペクトル包絡の方が基本周波数の軌跡に比べて識別性能が高い.最終的に 2つの尺度を単純に統合することによって2秒の音声信号に対して87.5%の識別率を得ることができた.Discriminating between singing and speaking voices by using the local and globalcharacteristics of voice signals is discussed. From the results of subjective experiments, we show that human beings can discriminate singing and speaking voices with more than 70.0% and 99.7% accuracy from 200 ms and one second long signals, respectively. From the subjective experiment results, assuming that different features are effective for short-term and long-term signals, we designed two measures using a spectral envelope (MFCC) and the fundamental frequency (F0, perceived as pitch) contour. Experimental results show that the F0 measure performs better than the spectral envelope measure when the input voice signals are longer than one second. Particularly, it can discriminate singing and speaking voices with 85.0% accuracy with two-second signals. On the other hand, when the input signals are shorter than one second, the spectral envelope measure performs better than the F0 measure. Finally, by simply combining the two measures, 87.5% accuracy is obtained for two-second signals.

Discriminating between singing and speaking voices by using the local and global characteristics of voice signals is discussed. From the results of subjective experiments, we show that human beings can discriminate singing and speaking voices with more than 70.0% and 99.7% accuracy from 200ms and one second long signals, respectively. From the subjective experiment results, assuming that different features are effective for short-term and long-term signals, we designed two measures using a spectral envelope (MFCC) and the fundamental frequency (F0, perceived as pitch) contour. Experimental results show that the F0 measure performs better than the spectral envelope measure when the input voice signals are longer than one second. Particularly, it can discriminate singing and speaking voices with 85.0% accuracy with two-second signals. On the other hand, when the input signals are shorter than one second, the spectral envelope measure performs better than the F0 measure. Finally, by simply combining the two measures, 87.5% accuracy is obtained for two-second signals.

収録刊行物

  • 情報処理学会研究報告音楽情報科学(MUS)

    情報処理学会研究報告音楽情報科学(MUS) 2005(82(2005-MUS-061)), 1-6, 2005-08-04

    一般社団法人情報処理学会

被引用文献:  3件中 1-3件 を表示

各種コード

  • NII論文ID(NAID)
    110002952202
  • NII書誌ID(NCID)
    AN10438388
  • 本文言語コード
    JPN
  • 資料種別
    Technical Report
  • ISSN
    09196072
  • NDL 記事登録ID
    7421959
  • NDL 雑誌分類
    ZM13(科学技術--科学技術一般--データ処理・計算機)
  • NDL 請求記号
    Z14-1121
  • データ提供元
    CJP引用  NDL  NII-ELS  IPSJ 
ページトップへ