Journal of Computer and Communications

Volume 1, Issue 5 (October 2013)

ISSN Print: 2327-5219   ISSN Online: 2327-5227

Google-based Impact Factor: 1.12  Citations  

An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese

HTML  Download Download as PDF (Size: 214KB)  PP. 14-19  
DOI: 10.4236/jcc.2013.15003    3,503 Downloads   5,875 Views  Citations

ABSTRACT

Data sparseness has been an inherited issue of statistical language models and smoothing method is usually used to resolve the zero count problems. In this paper, we studied empirically and analyzed the well-known smoothing methods of Good-Turing and advanced Good-Turing for language models on large sizes Chinese corpus. In the paper, ten models are generated sequentially on various size of corpus, from 30 M to 300 M Chinese words of CGW corpus. In our experiments, the smoothing methods; Good-Turing and Advanced Good-Turing smoothing are evaluated on inside testing and outside testing. Based on experiments results, we analyzed further the trends of perplexity of smoothing methods, which are useful for employing the effective smoothing methods to alleviate the issue of data sparseness on various sizes of language models. Finally, some helpful observations are described in detail.

Share and Cite:

Huang, F. , Yu, M. and Hwang, C. (2013) An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese. Journal of Computer and Communications, 1, 14-19. doi: 10.4236/jcc.2013.15003.

Cited by

[1] Hugo BAZILLE
2019
[2] Detection and Quantification of Events in Stochastic Systems
2019
[3] An empirical study of statistical language models: n-gram language models vs. neural network language models
International Journal of Innovative Computing and Applications, 2018
[4] Optimizing DNA assembly based on statistical language modelling
Nucleic Acids Research, 2017
[5] Language model reduction using PCA approach and tackling sparcity using document-weighted smoothing
2017
[6] 基于统计语言模型及动态规划算法的蛋白质表达载体的优化设计
2016
[7] 基于统计语言模型及动态规划算法的 蛋白质表达载体的优化设计.
2016
[8] 动态规划算法对 GenoCAD 设计结果的优化.
Chinese Journal of Bioinformatics, 2016
[9] A Statistical Approach for Estimating Language Model Reliability with Effective Smoothing Technique
International Journal of Computer Applications, 2015
[10] Various Approaches towards Cryptanalysis
International Journal of Computer Applications, 2015

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.