Journal of Computer and Communications
Volume 1, Issue 5 (October 2013)
ISSN Print: 2327-5219 ISSN Online: 2327-5227
Google-based Impact Factor: 1.12 Citations
An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese ()
Affiliation(s)
ABSTRACT
Data sparseness has been an inherited issue of statistical language models and smoothing method is usually used to resolve the zero count problems. In this paper, we studied empirically and analyzed the well-known smoothing methods of Good-Turing and advanced Good-Turing for language models on large sizes Chinese corpus. In the paper, ten models are generated sequentially on various size of corpus, from 30 M to 300 M Chinese words of CGW corpus. In our experiments, the smoothing methods; Good-Turing and Advanced Good-Turing smoothing are evaluated on inside testing and outside testing. Based on experiments results, we analyzed further the trends of perplexity of smoothing methods, which are useful for employing the effective smoothing methods to alleviate the issue of data sparseness on various sizes of language models. Finally, some helpful observations are described in detail.
KEYWORDS
Share and Cite:
Cited by
Copyright © 2024 by authors and Scientific Research Publishing Inc.
This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.