An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese - Journal of Computer and Communications

JCC > Vol.1 No.5, October 2013

Journal of Computer and Communications

Volume 1, Issue 5 (October 2013)

ISSN Print: 2327-5219 ISSN Online: 2327-5227

Google-based Impact Factor: 1.12 Citations

An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese ()

HTML

Download as PDF (Size: 214KB) PP. 14-19

DOI: 10.4236/jcc.2013.15003 3,503 Downloads 5,875 Views Citations

Author(s)

Feng-Long Huang, Ming-Shing Yu, Chien-Yo Hwang

Affiliation(s)

Department of Computer Science and Information Engineering, National United University, MaioLi, Chinese Taipei.
Department of Computer Science, National Chung-Hsing University, Taichung, Chinese Taipei.

ABSTRACT

Data sparseness has been an inherited issue of statistical language models and smoothing method is usually used to resolve the zero count problems. In this paper, we studied empirically and analyzed the well-known smoothing methods of Good-Turing and advanced Good-Turing for language models on large sizes Chinese corpus. In the paper, ten models are generated sequentially on various size of corpus, from 30 M to 300 M Chinese words of CGW corpus. In our experiments, the smoothing methods; Good-Turing and Advanced Good-Turing smoothing are evaluated on inside testing and outside testing. Based on experiments results, we analyzed further the trends of perplexity of smoothing methods, which are useful for employing the effective smoothing methods to alleviate the issue of data sparseness on various sizes of language models. Finally, some helpful observations are described in detail.

KEYWORDS

Good-Turing Methods; Smoothing; Language Models; Perplexity

Share and Cite:

Huang, F. , Yu, M. and Hwang, C. (2013) An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese. Journal of Computer and Communications, 1, 14-19. doi: 10.4236/jcc.2013.15003.

Cited by

[1]	Hugo BAZILLE
	2019

[2]	Detection and Quantification of Events in Stochastic Systems
	2019

[3]	An empirical study of statistical language models: n-gram language models vs. neural network language models
	International Journal of Innovative Computing and Applications, 2018

[4]	Optimizing DNA assembly based on statistical language modelling
	Nucleic Acids Research, 2017

[5]	Language model reduction using PCA approach and tackling sparcity using document-weighted smoothing
	2017

[6]	基于统计语言模型及动态规划算法的蛋白质表达载体的优化设计
	2016

[7]	基于统计语言模型及动态规划算法的蛋白质表达载体的优化设计.
	2016

[8]	动态规划算法对 GenoCAD 设计结果的优化.
	Chinese Journal of Bioinformatics, 2016

[9]	A Statistical Approach for Estimating Language Model Reliability with Effective Smoothing Technique
	International Journal of Computer Applications, 2015

[10]	Various Approaches towards Cryptanalysis
	International Journal of Computer Applications, 2015

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies