A study of high quality speech synthesis based on the analysis of the randomness in speech signals 音声信号におけるランダムネスの解析に基づいた高品質音声合成に関する研究

この論文をさがす

著者

    • 青木, 直史 アオキ, ナオフミ

書誌事項

タイトル

A study of high quality speech synthesis based on the analysis of the randomness in speech signals

タイトル別名

音声信号におけるランダムネスの解析に基づいた高品質音声合成に関する研究

著者名

青木, 直史

著者別名

アオキ, ナオフミ

学位授与大学

北海道大学

取得学位

博士 (工学)

学位授与番号

甲第5113号

学位授与年月日

2000-03-24

注記・抄録

博士論文

Randomness observed in human speech signals is considered to be a key factor in the naturalness of human speech. This research project has investigated the characteristics of several kinds of randomness observed in human speech signals phonated by normal speakers. Based on the results of the analysis, some advanced techniques for artificially reproducing such randomness were developed with the aim of enhancing the voice quality of synthesized speech. The types of randomness particularly investigated in this project were: (1) amplitude fluctuation, (2) period fluctuation, (3) waveform fluctuation, (4) random fractalness of the source signals obtained by linear predictive analysis, and (5) unvoiced characteristics, namely, aperiodicity observed in voiced consonants. Using their statistical characteristics, a simple model was made for these forms of randomness, and was evaluated how it could contribute to realize high quality speech synthesis systems based on the LPC (linear predictive coding) vocoder. Normal sustained vowels always contain a cyclic change of maximum peak amplitudes and pitch periods, even at those times when the values seem to be quite stable. This project investigated the statistical characteristics of the fluctuations that were particularly labeled amplitude fluctuation and period fluctuation, respectively. Since the frequency characteristics of these fluctuation sequences appeared to be roughly subject to a 1/f power law, the author reached the conclusion that amplitude and period fluctuation could be modeled as 1/f fluctuations for a preliminary model. Psychoacoustic experiments performed in this study indicated that the differences in the frequency characteristics of the amplitude and period fluctuation could potentially influence the voice quality of synthesized speech. Compared with 1/f0 (white noise), 1/f2, and 1/f3 fluctuation models, amplitude and period fluctuation modeled as 1/f fluctuations could produce voice quality which was more similar to that of human speech phonated by normal speakers. Normal sustained vowels also always contain a cyclic change of the waveform itself, even during their most steady parts. This project investigated the statistical characteristics of the waveform fluctuations extracted from the residual signals of the LPC vocoder. Since the frequency characteristics of the waveform fluctuations appeared to be subject to a 1/f2 power law, the author reached the conclusion that the waveform fluctuations could be modeled as 1/f2 fluctuations for a preliminary model. Psychoacoustic experiments performed in this study indicated that the differences in the frequency characteristics of waveform fluctuations could potentially influence the voice quality of synthesized speech. Compared with 1/f0 (white noise), 1/f, and 1/f3 fluctuation models, waveform fluctuations modeled as 1/f2 fluctuations could produce voice quality which was more similar to that of human speech phonated by normal speakers. Theoretically, the source signals of the LPC vocoder are defined as being characterized by a spectral −6 dB/oct decay in the frequency domain, when the −12 dB/oct glottal vibration and the +6 dB/oct mouth radiation characteristics are taken into consideration simultaneously. Since this frequency characteristic is equivalent to a 1/f2 spectral decay, the source signals of the LPC vocoder can be potentially classified as Brownian motion from the viewpoint of the random fractal theory. This project employed a multiresolution analysis method, based on Schauder expansion, in order to statistically investigate the time domain characteristics of the source signals. The results of the analysis indicated that the random fractalness was clearly observed, particularly when a large resolution level was chosen. The author also found that a certain limitation existed in the size of the discontinuity for the source signal waveforms obtained from human speech signals. Based on the results of the analysis, an advanced technique was newly developed with the aim of enhancing the voice quality of synthesized speech produced by the conventional impulse train. This study reached the conclusion that the buzzer-like degraded voice quality resulting from utilizing the impulse train could be improved by removing the extremely large discontinuity of the waveforms from the impulse train. The developed technique also included a method called random fractal interpolation for restoring power in the high frequency region which had been undesirably decreased by removing the sharpness of the impulse train. The author implemented two applications that exemplified the effectiveness of the techniques developed through this research. One such application was a real-time vocoder system implemented on a digital signal processor (DSP) evaluation module (Texas Instruments, TMS320C62EVM); the other was a Japanese rule-based speech synthesis system implemented on a personal computer (Apple, Macintosh Quadra 840AV). Both applications employed the modified LPC vocoder as their speech synthesizer which fully implemented the features that were investigated in this research. In addition, these applications demonstrated how the voice quality of voiced consonants was enhanced by a MELP (mixed excitation linear prediction) scheme. Since voiced consonants are a mixture of both a periodic component attributed to voiced characteristics and an aperiodic component attributed to unvoiced characteristics, the waveforms of unvoiced consonants — which seem basically periodic due to reflecting the voiced feature — are disturbed in detail by the unvoiced feature. Psychoacoustic experiments conducted in this research clarified that synthesized voiced consonants produced by the conventional LPC vocoder tendedto degrade in voice quality, since such a vocoder completely disregards the incorporation of the unvoiced feature into the voiced consonants. An advanced technique, employing a wavelet transform for processing subband decomposition and reconstruction, was developed as a method for the inclusion of the unvoiced component with the voiced component at desirable bands. It was concluded that synthesized voiced consonants, for which the unvoiced feature was incorporated at high frequency subbands, could be perceived as possessing a more natural voice quality than that of the conventional LPC vocoder. This project has reached the following two major conclusions: (1) the voice quality of synthesized speech can be enhanced by the inclusion of the randomness that is artificially produced by adequate models, (2) the knowledge acquired through the techniques developed in this project can be applied to the design of LPC-vocoder-based high quality speech synthesis systems that can be expected to produce more realistic human-like natural speech.

目次

  1. Abstract / p1 (0003.jp2)
  2. Contents / p4 (0006.jp2)
  3. 1 Introduction / p1 (0010.jp2)
  4. 1.1 Motivation of the project / p1 (0010.jp2)
  5. 1.2 Overview of the following chapters / p3 (0012.jp2)
  6. 2 Analysis and perception of amplitude fluctuation and period fluctuation / p6 (0015.jp2)
  7. 2.1 Introduction / p6 (0015.jp2)
  8. 2.2 Speech analysis / p7 (0016.jp2)
  9. 2.3 Psychoacoustic experiments / p12 (0021.jp2)
  10. 2.4 Objective evaluation on the psychoacoustic experiments / p19 (0028.jp2)
  11. 2.5 Discussion / p22 (0031.jp2)
  12. 2.6 Conclusions / p25 (0034.jp2)
  13. 3 Analysis and perception of waveform fluctuation / p26 (0035.jp2)
  14. 3.1 Introduction / p26 (0035.jp2)
  15. 3.2 Speech analysis / p26 (0035.jp2)
  16. 3.3 Psychoacoustic experiments / p30 (0039.jp2)
  17. 3.4 Objective evaluation on the psychoacoustic experiments / p38 (0047.jp2)
  18. 3.5 Discussion / p41 (0050.jp2)
  19. 3.6 Conclusions / p43 (0052.jp2)
  20. 4 Analysis and perception of the random fractalness in the source signals of LPC vocoder / p44 (0053.jp2)
  21. 4.1 Introduction / p44 (0053.jp2)
  22. 4.2 Speech analysis / p45 (0054.jp2)
  23. 4.3 Psychoacoustic experiments / p52 (0061.jp2)
  24. 4.4 Discussion / p59 (0068.jp2)
  25. 4.5 Conclusions / p61 (0070.jp2)
  26. 5 Development of two speech synthesis applications / p62 (0071.jp2)
  27. 5.1 Introduction / p62 (0071.jp2)
  28. 5.2 Implementation of MELP vocoder using lifting wavelet transform / p62 (0071.jp2)
  29. 5.3 Development of a rule-based speech synthesis system for the Japanese language using a MELP vocoder / p71 (0080.jp2)
  30. 5.4 Conclusions / p80 (0089.jp2)
  31. 6 Conclusions / p81 (0090.jp2)
  32. 6.1 Summary / p81 (0090.jp2)
  33. 6.2 Original contributions / p83 (0092.jp2)
  34. 6.3 Future work / p85 (0094.jp2)
  35. A Fractional Brownian motion / p87 (0096.jp2)
  36. B Theoretical definition of the source signals of the LPC vocoder / p90 (0099.jp2)
  37. C C program lists / p92 (0101.jp2)
  38. C.1 Linear predictive analysis / p92 (0101.jp2)
  39. C.2 Linear predictive synthesis / p96 (0105.jp2)
  40. Acknowledgements / p99 (0108.jp2)
  41. References / p100 (0109.jp2)
  42. Contribution / p136 (0145.jp2)
  43. Paper / p136 (0145.jp2)
  44. International conference / p137 (0146.jp2)
  45. Technical report / p139 (0148.jp2)
  46. Presentation / p139 (0148.jp2)
21アクセス

各種コード

  • NII論文ID(NAID)
    500000189556
  • NII著者ID(NRID)
    • 8000000189839
  • DOI(NDL)
  • 本文言語コード
    • eng
  • NDL書誌ID
    • 000000353870
  • データ提供元
    • 機関リポジトリ
    • NDL ONLINE
    • NDLデジタルコレクション
ページトップへ