スペクトル包絡と基本周波数の時間変化を利用した歌声と朗読音声の識別  [in Japanese] Discrimination between Singing and Speaking Voices Using a Spectral Envelope and a Fundamental Frequency Derivative  [in Japanese]

Access this Article

Search this Article

Author(s)

Abstract

スペクトル包絡と基本周波数の時間変化を利用した歌声と朗読音声の識別について検討する.聴取実験の結果,人間は200 ms,1 s の音声信号に対して,それぞれ70.0%,99.7%で歌声と朗読音声の識別が可能であった.また,識別に影響する音響的特徴について調査するために,短時間のスペクトルの特徴,また韻律を変形させた音声信号を聴取させたところ,それぞれの特徴が相補的に識別の手がかりになることを確認した.この結果より,短時間,長時間の音声信号に対して,それぞれ異なる特徴が識別に影響するということを想定し,スペクトル包絡(MFCC)と基本周波数の時間変化の2 つの尺度に基づく識別器を設計した.このとき,入力音声信号が1 s よりも長い場合,基本周波数の時間変化を特徴量として利用した方がスペクトル包絡を特徴量とするよりも識別性能が高い.特に,発声開始より2 s の音声に対して85.0%の歌声と朗読音声の識別が可能であった.一方,入力音声信号が1 s よりも短い場合,スペクトル包絡の違いを識別に利用した方が基本周波数の時間変化を利用するよりも識別性能が高い.最終的に,2 つの尺度を単純に統合することによって2 s の音声に対して87.3%の識別率を得ることができた.In this paper, we discuss the discrimination between singing and speaking voices by using a spectral envelope and a fundamental frequency (F0, perceived as pitch) derivative of voice signals. According to the results of our preliminary subjective experiments, listeners distinguish between singing and speaking voices with the accuracy of 70.0% for 200 ms long signals and 99.7% for 1 second long signals. To examine how humans discriminate between these two voices, we then conducted subjective experiments with singing and speaking voice stimuli whose voice quality and prosody were systematically distorted by using signal processing techniques. The experimental results suggested that spectral and prosodic cues complementarily contributed to the perceptual judgments. By hypothesizing that listeners depend on different cues according to the length of stimuli, we propose an automatic vocal style discriminator that can distinguish between singing and speaking voices by using two measures: a spectral envelope (MFCC) and an F0 derivative. In our experimental results, when voice signals longer than one second are discriminated, the F0-based measure performs better than the MFCC-based measure. On the other hand, when voice signals shorter than one second are discriminated, the MFCC-based measure performs better than the F0-based measure. While the discrimination accuracy with the F0-based measure is 85.0% for two-second signals, simple combination of the two measures improves it by 2.3% for two-second signals.

In this paper, we discuss the discrimination between singing and speaking voices by using a spectral envelope and a fundamental frequency (F0, perceived as pitch) derivative of voice signals. According to the results of our preliminary subjective experiments, listeners distinguish between singing and speaking voices with the accuracy of 70.0% for 200ms long signals and 99.7% for 1 second long signals. To examine how humans discriminate between these two voices, we then conducted subjective experiments with singing and speaking voice stimuli whose voice quality and prosody were systematically distorted by using signal processing techniques. The experimental results suggested that spectral and prosodic cues complementarity contributed to the perceptual judgments. By hypothesizing that listeners depend on different cues according to the length of stimuli, we propose an automatic vocal style discriminator that can distinguish between singing and speaking voices by using two measures: a spectral envelope (MFCC) and an F0 derivative. In our experimental results, when voice signals longer than one second are discriminated, the F0-based measure performs better than the MFCC-based measure. On the other hand, when voice signals shorter than one second are discriminated, the MFCC-based measure performs better than the F0-based measure. While the discrimination accuracy with the F0-based measure is 85.0% for two-second signals, simple combination of the two measures improves it by 2.3% for two-second signals.

Journal

  • IPSJ journal

    IPSJ journal 47(6), 1822-1830, 2006-06-15

    Information Processing Society of Japan (IPSJ)

References:  12

Cited by:  9

Codes

  • NII Article ID (NAID)
    110004729744
  • NII NACSIS-CAT ID (NCID)
    AN00116647
  • Text Lang
    JPN
  • Article Type
    Journal Article
  • ISSN
    1882-7764
  • NDL Article ID
    7993331
  • NDL Source Classification
    ZM13(科学技術--科学技術一般--データ処理・計算機)
  • NDL Call No.
    Z14-741
  • Data Source
    CJP  CJPref  NDL  NII-ELS  IPSJ 
Page Top