Search this article
Abstract
Speech-to-speech translation (S2ST) is a technology that translates speech across languages, which can remove barriers in cross-lingual communication. In the conventional S2ST systems, the linguistic meaning of speech was translated, but paralinguistic information conveying other features of the speech such as emotion or emphasis were ignored. In this paper, we propose a method to translate paralinguistic information, specifically focusing on emphasis. The method consists of a series of components that can accurately translate emphasis using all acoustic features of speech. First, linear-regression hidden semi-Markov models (LRHSMMs) are used to estimate a real-numbered emphasis value for every word in an utterance, resulting in a sequence of values for the utterance. After that the emphasis translation module translates the estimated emphasis sequence into a target language emphasis sequence using a conditional random field model considering the features of emphasis levels, words, and part-of-speech tags. Finally, the speech synthesis module synthesizes emphasized speech with LR-HSMMs, taking into account the translated emphasis sequence and transcription. The results indicate that our translation model can translate emphasis information, correctly emphasizing words in the target language with 91.6% F-measure by objective evaluation. A listening test with human subjects further showed that they could identify the emphasized words with 87.8% F-measure, and that the naturalness of the audio was preserved.
Journal
-
- IEEE/ACM Transactions on Audio, Speech, and Language Processing
-
IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (3), 544-556, 2016-12-21
IEEE
- Tweet
Keywords
- hidden Markov models
- language translation
- LRHSMMs
- S2ST systems
- acoustic features
- conditional random field model
- cross-lingual communication
- emphasis translation module
- linear-regression hidden semiMarkov models
- paralinguistic information translation
- part-of-speech tags
- speech synthesis module
- speech-to-speech translation
- target language emphasis sequence
- word-level emphasis preservation
- Acoustics
- Estimation
- Feature extraction
- speech synthesis
- regression analysis
- Speech
- Speech processing
- Speech recognition
- Emphasis estimation
- emphasis translation
- intent
- word-level emphasis
Details 詳細情報について
-
- CRID
- 1050295834376626432
-
- NII Article ID
- 120006226308
-
- NII Book ID
- AA12669539
-
- ISSN
- 23299304
-
- HANDLE
- 10061/11397
-
- Text Lang
- en
-
- Article Type
- journal article
-
- Data Source
-
- IRDB
- CiNii Articles