Emotional Speech Synthesis Based on Prosodic Feature Modification

Abstract

The synthesis of emotional speech has wide applications in the field of human-computer interaction, medicine, industry and so on. In this work, an emotional speech synthesis system is proposed based on prosodic features modification and Time Domain Pitch Synchronous OverLap Add (TD-PSOLA) waveform concatenative algorithm. The system produces synthesized speech with four types of emotion: angry, happy, sad and bored. The experiment results show that the proposed emotional speech synthesis system achieves a good performance. The produced utterances present clear emotional expression. The subjective test reaches high classification accuracy for different types of synthesized emotional speech utterances.

Share and Cite:

He, L. , Huang, H. and Lech, M. (2013) Emotional Speech Synthesis Based on Prosodic Feature Modification. Engineering, 5, 73-77. doi: 10.4236/eng.2013.510B015.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz and J. G. Taylor, “Emotion Recognition in Human-Computer Interaction,” Signal Processing Magazine, IEEE, Vol. 18, No. 1, 2001, pp. 32-80. http://dx.doi.org/10.1109/79.911197
[2] M. Schröder, R. Cowie and E. Cowie, “Emotional Speech Synthesis: A Review,” Eurospeech-2001, 2001.
[3] J. E. Cahn, “The Generation of Affect in Synthesized Speech,” Journal of the American Voice I/O Society, Vol. 9, 1990, pp. 1-19.
[4] F. Burkhardt and F. Sendlmeier, “Verification of Acoustical Correlates of Emotional Speech Using Formant-Synthesis,” ISCA Workshop on Speech & Emotion, Northern Ireland, 2000, pp. 151-156.
[5] M. Bulut, S. Narayan and A. Syrdal, “Expressive Speech Synthesis Using a Concatenative Synthesizer,” Proceedings of ICSLP, 2002, pp. 1265-1268.
[6] E. Eide, “Preservation, Identification, and Use of Emotion in a Textto-Speech System,” Proceedings of IEEE Workshop on Speech Synthesis, 2002, pp. 127-130.
[7] A. W. Black and N. Cambpbell, “Optimising Selection of Units from Speech Database for Concatenative Synthesis,” Proceedings of EUROSPEECH-95, 1995, pp. 581-584.
[8] J. Pitrelli, R. Bakis, E. Eide, R. Fernandez, W. Hamza and M. Picheny, “The IBM Expressive Text-to-Speech Synthesis System for American English,” IEEE Transactions on Speech Audio Process, Vol. 14, No. 4, 2006, pp. 1099- 1108. http://dx.doi.org/10.1109/TASL.2006.876123
[9] W. Hamza, R. Bakis, E. Eide, M. Picheny and J. Pitrelli, “The IBM Expressive Speech Synthesis System,” Proceedings of ICSLP, 2004.
[10] G. Hofer, K. Richmond and R. Clark, “Informed Blending of Databases for Emotional Speech Synthesis,” Proceedings of Interspeech, 2005, pp. 501-504.
[11] M. Schroder, “Speech and Emotion Research: An Overview of Research Frameworks and a Dimensional Approach to Emotional Speech Synthesis,” Ph.D. Thesis, Saarland University, Saarland, 2004.
[12] L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signals,” Prentice-Hall, Inc., Englewood Cliffs, 1978.
[13] F. Burkhardt, A. Paeschke, M. Rolfes, et al., “A Database of German Emotional Speech,” Proceedings of Interspeech, 2005.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.