An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese

Abstract

Data sparseness has been an inherited issue of statistical language models and smoothing method is usually used to resolve the zero count problems. In this paper, we studied empirically and analyzed the well-known smoothing methods of Good-Turing and advanced Good-Turing for language models on large sizes Chinese corpus. In the paper, ten models are generated sequentially on various size of corpus, from 30 M to 300 M Chinese words of CGW corpus. In our experiments, the smoothing methods; Good-Turing and Advanced Good-Turing smoothing are evaluated on inside testing and outside testing. Based on experiments results, we analyzed further the trends of perplexity of smoothing methods, which are useful for employing the effective smoothing methods to alleviate the issue of data sparseness on various sizes of language models. Finally, some helpful observations are described in detail.

Share and Cite:

Huang, F. , Yu, M. and Hwang, C. (2013) An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese. Journal of Computer and Communications, 1, 14-19. doi: 10.4236/jcc.2013.15003.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] P. F. Brown, V. J. Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer, “Class-Based n-Gram Models of Natural Language,” Computational Linguistics, Vol. 18, 1992, pp. 467-479.
[2] F. Jelinek, “Automatic Speech Recognition-Statistical Methods, M.I.T., 1997.
[3] W. Naptali, “Masatoshi Tsuchiya, and Seiichi Nakagawa,” ACM Transactions on Asian Language Information Processing, Vol. 9, No. 2, Article 7, Pub., 2010.
[4] L. H. Witten and T. C. Bell, “The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression,” IEEE Transactions on Information Theory, Vol. 37, No. 4, 1991, pp. 1085-1094. http://dx.doi.org/10.1109/18.87000
[5] D. Jurafsky and J. H. Martin, “Speech and Language Processing,” Prentice Hall, Chapter 6, 2000.
[6] W. A. Gale and G. Sampson, “Good-Turing Frequency Estimation without Tears,” Journal of Quantitative Linguistics, Vol. 2, No. 3, 1995, pp. 15-19. http://dx.doi.org/10.1080/09296179508590051
[7] I. J. Good, “The Population Frequencies of Species and the Estimation of Population Parameters,” Biometrika, Vol. 40, 1953, pp. 237-264.
[8] S. M. Katz, “Estimation of Probabilities from Sparse Data for the Language Models Component of a Speech Recognizer,” IEEE Transactions on Acoustic, Speech and Signal Processing, Vol. ASSP-35, 1987, pp. 400-401. http://dx.doi.org/10.1109/TASSP.1987.1165125
[9] S. F. Chen and G. Joshua, “An Empirical Study of Smoothing Techniques for Language Modeling,” Computer Speech and Language, Vol. 13, 1999, pp. 359-394. http://dx.doi.org/10.1006/csla.1999.0128
[10] K. W. Church and W. A. Gale, “A Comparison of the Enhanced Good-Turing and Deleted Estimation Methods for Estimating Probabilies of English Bigrams,” Computer Speech and Language, Vol. 5, 1991, pp. 19-54. http://dx.doi.org/10.1016/0885-2308(91)90016-J
[11] S. F. Chen and G. Joshua, “An Empirical Study of Smoothing Techniques for Language Modeling,” Computer Speech and Language, Vol. 13, 1999, pp. 359-394. http://dx.doi.org/10.1006/csla.1999.0128
[12] P. H. Algort and T. M. Cover, “A Sandwich Proof of the Shannon-McMillan-Breiman Theorem,” The Annals of Probability, Vol. 16, No. 2, 1988, pp. 899-909. http://dx.doi.org/10.1214/aop/1176991794
[13] S. Ostrogonac, B. Popovi?, M. Se?ujski, R. Mak and D. Pekar, “Language Model Reduction for Practical Implementation in LVCSR Systems,” INFOTEH-JAHORINA, Vol. 12, 2013, pp. 391-394.
[14] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, J. C. Lai and R. L. Mercer, “An Estimate of an Upper Bound for the Entropy of English,” Computational Linguistics, Vol. 18, 1992, pp. 31-40.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.