An Intonation Speech Synthesis Model for Indonesian Using Pitch Pattern and Phrase Identification


Prosody in speech synthesis systems (text-to-speech) is a determinant of tone, duration, and loudness of speech sound. Intonation is a part of prosody which determines the speech tone. In Indonesian, intonation is determined by the structure of sentences, types of sentences, and also the position of the word in a sentence. In this study, a model of speech synthesis that focuses on its intonation is proposed. The speech intonation is determined by sentence structure, intonation patterns of the example sentences, and general rules of Indonesian pronunciation. The model receives texts and intonation patterns as inputs. Based on the general principle of Indonesian pronunciation, a prosody file was made. Based on input text, sentence structure is determined and then interval among parts of a sentence (phrase) can be determined. These intervals are used to correct the duration of the initial prosody file. Furthermore, the frequencies in prosody file were corrected using intonation patterns. The final result is prosody file that can be pronounced by speech engine application. Experiment results of studies using the original voice of radio news announcer and the speech synthesis show that the peaks of F0 are determined by general rules or intonation patterns which are dominant. Similarity test with the PESQ method shows that the result of the synthesis is 1.18 at MOS-LQO scale.

Share and Cite:

Suyanto, Y. ,  , S. , Harjoko, A. and Hartati, S. (2014) An Intonation Speech Synthesis Model for Indonesian Using Pitch Pattern and Phrase Identification. Journal of Signal and Information Processing, 5, 80-88. doi: 10.4236/jsip.2014.53010.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] Schroder, M. (2001) Emotional Speech Synthesis: A Review. Proceedings of Eurospeech 2001, 1, 561-564.
[2] Vroomen, J., Collier, R. and Mozziconacci, S. (1993) Duration and Intonation in Emotional Speech. Proceedings of the Third European Conference on Speech Communication and Technology, Berlin, 22-25 September 1993, 577-580.
[3] Campbell, W.N., Isard, S., Monaghan, A.L.C. and Verhoeven, J. (1990) Duration, Pitch and Diphones in the CSTR TTS System. Proceedings of the International Conference on Spoken Language Processing, Kobe, 1 January 1990, 825-828.
[4] Tritoasmoro, I.I. (2006) Text-to-Speech Bahasa Indonesia Menggunakan Concatenation Synthesizer Berbasis Fonem. Seminar Nasional Sistem dan Informatika, Bali, 17 November 2006, 171-176.
[5] Laksman, M. (1995) Realisasi Tekanan Kata dalam Bahasa Indonesia. PELLBA 8, pages 179{215. Lembaga Bahasa Unika Atmajaya, Jakarta, 1995.
[6] Halim, A. (1975) Intonation in Relation to Syntax in Bahasa Indonesia. Proyek Pengembangan Bahasa dan Sastra Indonesia dan Daerah, Departemen Pendidikan dan Kebudayaan.
[7] van Lieshout, P.P. (2003) PRAAT Short Tutorial. University of Toronto, Graduate Department of Speech-Language Pathology, Faculty of Medicine, Oral Dynamics Lab (ODL), Toronto.
[8] Heuven, V.J.V. and Zanten, E.V. (2007) Prosody in Indonesian Languages. LOT, Utrecht.
[9] Sakri, A. (1994) Bangun Kalimat Bahasa Indonesia. 2nd Edition, Penerbit ITB, Bandung.
[10] McCune, K.M. (1985) The Internal Structure of Indonesian Roots. Number v. 2 in the Internal Structure of Indonesian Roots. Badan Penyelenggara Seri Nusa, Universitas Katolik Indonesia Atma Jaya, Jakarta.
[11] Vamarasi, M.K. (1986) Grammatical Relations in Bahasa Indonesia. Cornell University, Ithaca.
[12] Mbrola, T. (2009) The MBROLA Home Page.
[13] Boril, H. and Pollák, P. (2004) Direct Time Domain Fundamental Frequency Estimation of Speech in Noisy Conditions. EUSIPCO 2004, 2, 1003-1006.
[14] ITU (2001) ITU-T Recommendation P.862: Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. Technical Report, ITU.
[15] Rix, A.W., Beerends, J.G., Hollier, M.P. and Hekstra, A.P. (2001) Perceptual Evaluation of Speech Quality (PESQ)— A New Method for Speech Quality Assessment of Telephone Networks and Codecs. IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, 7-11 May 2001, 749-752.

Copyright © 2020 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.