A Study of Bilinear Models in Voice Conversion
Victor Popa, Jani Nurminen, Moncef Gabbouj
.
DOI: 10.4236/jsip.2011.22017   PDF    HTML     4,498 Downloads   8,246 Views   Citations

Abstract

This paper presents a voice conversion technique based on bilinear models and introduces the concept of contextual modeling. The bilinear approach reformulates the spectral envelope representation from line spectral frequencies feature to a two-factor parameterization corresponding to speaker identity and phonetic information, the so-called style and content factors. This decomposition offers a flexible representation suitable for voice conversion and facilitates the use of efficient training algorithms based on singular value decomposition. In a contextual approach (bilinear) models are trained on subsets of the training data selected on the fly at conversion time depending on the characteristics of the feature vector to be converted. The performance of bilinear models and context modeling is evaluated in objective and perceptual tests by comparison with the popular GMM-based voice conversion method for several sizes and different types of training data.

Share and Cite:

V. Popa, J. Nurminen and M. Gabbouj, "A Study of Bilinear Models in Voice Conversion," Journal of Signal and Information Processing, Vol. 2 No. 2, 2011, pp. 125-139. doi: 10.4236/jsip.2011.22017.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] A. Kain and M. Macon, “Spectral Voice Conversion for Text-to-Speech Synthesis,” Proceedings of International conference on Acoustics, Speech and Signal Processing, Vol. 1, Seattle, 12-15 May 1998, pp. 285-288.
[2] Z. Shuang, R. Bakis and Y. Qin, “Voice Conversion Based on Mapping Formants,” Proceeding of TC- STAR Workshop on Speech-to-Speech Translation, Barcelona, 19-20 June 2006, pp. 219-223.
[3] M. Narendranath, H. Murthy, S. Rajendran and N. Yegnanarayana, “Transformation of Formants for Voice Conversion Using Artificial Neural Networks,” Speech Communication, Vol. 16, No. 2, 1995, pp. 207-216. doi:10.1016/0167-6393(94)00058-I
[4] E. K. Kim, S. Lee and Y. Oh, “Hidden Markov Model Based Voice Conversion Using Dynamic Characteristics of Speaker,” 5th Proceedings of European Conference on Speech Communication and Technology, Rhodes, 1997.
[5] Y. Stylianou, O. Cappe and E. Moulines, “Continuous Probabilistic Transform for Voice Conversion,” IEEE Transaction on Speech and Audio Processing, Vol. 6, No. 2, 1998, pp. 131-142. doi:10.1109/89.661472
[6] L. Arslan and D. Talkin, “Voice Conversion by Codebook Mapping of Line Spectral Frequencies and Excitation Spectrum,” 5th Proceedings of European Conference on Speech Communication and Technology, Rhodes, 1997.
[7] T. Toda, Y. Ohtani and K. Shikano, “Eigenvoice Conversion Based on Gaussian Mixture Model,” Proceedings of ICSLP, Pittsburgh, September 2006, pp. 2446- 2449.
[8] T. Toda, A. W. Black and K. Tokuda, “Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory,” IEEE Transactions on Audio, Speech and Language Processing, Vol. 15, No. 8, 2007, pp. 2222-2235. doi:10.1109/TASL.2007.907344
[9] E. Helander, T. Virtanen, J. Nurminen and M. Gabbouj, “Voice Conversion Using Partial Least Squares Regression,” IEEE Transactions on Audio, Speech and Language Processing, Vol. 18, No. 5, 2010, pp. 912-921. doi:10.1109/TASL.2010.2041699
[10] J. B. Tenenbaum and W. T. Freeman, “Separating Style and Content with Bilinear Models,” Neural Computation, Vol. 12, No. 6, 2000, pp. 1247-1283. doi:10.1162/089976600300015349
[11] V. Popa, J. Nurmien and M. Gabbouj, “A Novel Technique for Voice Conversion Based on Style and Content Decomposition with Bilinear Models,” Interspeech 2009, Brighton, 6-10 September 2009.
[12] B. P. Nguyen, “Studies on Spectral Modification in Voice Transformation,” Ph.D. Thesis, School of Information Science, Japan Advanced Institute of Science and Technology, Japan, March 2009.
[13] A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of Royal Statistical Society B, Vol. 39, No. 1, 1977, pp. 1-38.
[14] B. S. Atal, “Efficient Coding of LPC Parameters by Temporal Decomposition,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’83), 1983, pp. 81-84.
[15] C. N. Athaudage, A. B. Brabley and M. Lech, “Optimization of a Temporal Decomposition Model of Speech,” Proceedings of the International Symposium on Signal Processing and Its Applications (ISSPA’99), Brisbane, 22-25 August 1999, pp. 471-474.
[16] M. Niranjan and F. Fallside, “Temporal Decomposition: A Framework for Enhanced Speech Recognition,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’89), Glasgow, 23-26 May 1989, pp. 655-658.
[17] P. J. Dix and G. Bloothooft, “A Breakpoint Analysis Procedure Based on Temporal Decomposition,” IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 1, 1994, pp. 9-17. doi:10.1109/89.260329
[18] P. C. Nguyen, T. Ochi and M. Akagi, “Modi?ed Restricted Temporal Decomposition and Its Application to Low Bit Rate Speech Coding,” IEICE Transactions on Information and Systems, Vol. E86-D, 2003, pp. 397-405.
[19] P. C. Nguyen, M. Akagi and T. B. Ho, “Temporal Decomposition: A Promising Approach to VQ-Based Speaker Identi?cation,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’03), Baltimore, 6-9 July 2003, pp. 184-187.
[20] B. P. Nguyen, T. Shibata and M. Akagi, “High-Quality Analysis/Synthesis Method Based on Temporal Decomposition for Speech Modi?cation,” Proceedings of the International Speech Communication Association (Interspeech’08), Brisbane, 22-26 September 2008, pp. 662-665.
[21] T. Shibata and M. Akagi, “A Study on Voice Conversion Method for Synthesizing Stimuli to Perform Gender Perception Experiments of Speech,” Proceedings of the RISP International Workshop on Nonlinear Circuits and Signal Processing (NCSP’08), 2008, pp. 180-183.
[22] J. Nurminen, V. Popa, J. Tian and I. Kiss, “A Parametric Approach for Voice Conversion,” Proceedings of TC- STAR Workshop on Speech-to-Speech Translation, Barcelona, 19-21 June 2006, pp. 225-229.
[23] E. Helander, J. Nurminen and M. Gabbouj, “Analysis of LSF Frame Selection in Voice Conversion,” International Conference on Speech and Computer, 2007, pp. 651-656.
[24] E. Helander, J. Nurminen and M. Gabbouj, “LSF Mapping for Voice Conversion with Very Small Training Sets,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’08), Las Vegas, 31 March - 4 April 2008, pp. 4669-4672.
[25] K. K. Paliwal, “Interpolation Properties of Linear Prediction Parametric Representations,” Proceedings of the European Conference on Speech Communication and Technology (Eurospeech’95), 1995, pp. 1029-1032.
[26] K. K. Paliwal and B. S. Atal, “Efficient Vector Quantization of LPC Parameters at 24 Bits/Frame,” IEEE Transactions on Speech and Audio Processing, Vol. 1, No. 1, 1993, pp. 3-14.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.