An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese

Data sparseness has been an inherited issue of statistical language models and smoothing method is usually used to resolve the zero count problems. In this paper, we studied empirically and analyzed the well-known smoothing methods of Good-Turing and advanced Good-Turing for language models on large sizes Chinese corpus . In the paper, ten models are generated sequentially on various size of corpus, from 30 M to 300 M Chinese words of CGW corpus. In our experiments, the smoothing methods; Good-Turing and Advanced Good-Turing smoothing are evaluated on inside testing and outside testing. Based on experiments results, we analyzed further the trends of perplexity of smoothing methods, which are useful for employing the effective smoothing methods to alleviate the issue of data sparseness on various sizes of language models. Finally, some helpful observations are described in detail.


Introduction
Speech processing (SP) studies the domain of speech signals and the processing methods of these digital signals.It is always combined into natural language processing (NLP).The technology development is widespread day after day; the information system with speech service already became important tendency.Speech Processing may divide into two broad domains: Speech Recognition and speech synthesis.The former is to recognize the speech signal with respect to the text output and the latter is to synthesize the speech with frequent prosody for the text or articles inputs.
In many domains of natural language processing (NLP); such as speech recognition [1], and machine translation [2]; the statistical language models (LMs) [3] play an important role in natural language processing.

Language Models
The statistical language models have been widely used in NLP.Supposed that W = w 1 , w 2 , w 3 , … w n , where w i and n denote the the i th Chinese character and its number in a sentence ( 0 i n ≤ ≤ ).P(W) = P(w 1 , w 2 , ..., w n ), the probability can be calculated by using chain rules:.

N-Gram Models
Basically, N-gram is so-called N − 1) th -order Markov model, which calculate conditional probability of successive events: calculate the probability of N th event while preceding (N − 1) event occurs.
Basically, N-gram Language Model is simply expressed as follows: where C(w) denotes the counts of event w occurring in dataset.
In formula (3), the obtained probability P(•) is so called Maximum Likelihood Estimation (MLE).While predicting the pronunciation category, we can predict based on the probability on each category t (1 t T ≤ ≤ ), T denotes the number of categories for the polyphonic character.The category with maximum probability P max (•) will be the target and then the correct pronunciation with respect to the polyphonic character can be decided further.
As shown in Equation (3), C(•) of a novel (a unknown event), which don't occur in the training corpus, may be zero because of the limited training data, infinite language and its expansion of language.It is always a hard work for us to collect sufficient datum.The smoothing methods are needed and exploited usually to alleviate the zero-count issue for statistical language models.

Processes of Smoothing Methods
As described above, the zero count issue [4] of unknown events will lead to the degradation of language models; therefore we need the smoothing methods to alleviate the situation.The idea of smoothing processes is to adjust the total probability of seen events to that of unseen events, leaving some probability mass (so-called escape probability, P esc ) for all the unseen events.
Smoothing algorithms [5,6] can be considered as discounting some counts of seen events in order to obtain the escape probability P esc .And then P esc will be assigned into unseen events based on the smoothing algorithm.The adjustment of smoothed probability for all possibly occurred events involves discounting and redistributing processes:

Discounting Process
Based on the statistical feature, the probability of all seen and unseen (unknown) events is summed to be unity (one).First operation of smoothing method is the discounting process, which discount the probability of all seen events.It means that the probability of seen events will be decreased a bit.

Redistributing Process
In this operation of smoothing algorithm, the escape probability discounted from all seen events will be redistributed to unseen events.The escape probability is usually shared by all the unseen events.That is, the escape probability is redistributed uniformly to each unseen event, P ESC /U, where U is the number of unseen events.On the other hand, each unseen event obtains the same probability in the smoothing criterion.

The Smoothing Methods
In the Section, the well-known smoothing methods, Good-Turing and advanced Good-Turing smoothing will be presented and evaluated in next section.

Good-Turing Smoothing
The Good-Turing smoothing method was first described by I. J. Good and A. M. Turing in 1953 [7].At that time it was used to decipher the German Enigma code during World War II.Some previous works can be found in [8,9].Notation n c denotes the number of n-grams with exactly c count in the corpus.For example, n 0 represent that the number of n-grams with zero count and n 1 means the number of n-grams which exactly occur once in training data.Therefore, n c will be described as: where w denotes a bigram in training corpus.Based on Good-Turing smoothing, the redistributed count c * will be expressed in three term of n c , n c+1 and c as: Note that the numerator in Equation ( 3) will be replaced by c * of Equation ( 5).On the other hand, the count c of events is now adjusted by the smoothing methods.For the bigram models, P(•) of Equation ( 3) will be modified as: The probability of Equation ( 6) is called Good-Turing estimator.Similarly, the revised count for bigrams can be derived from Equation (5).As shown in Equation ( 6), Good-Turing smoothing method just employs the bigram models to smooth the probability, rather than interpolating higher and lower order models (such as unigrams).
Similarly, the recounted count c * of events or can be derived again from Equation (5).As shown in Equation ( 6), Good-Turing smoothing just employs the n-gram models to smooth the probability, rather than interpolating higher and lower order models (such as n − 1 grams).Hence, Good-Turing is usually used as a tool by other smoothing methods.
In Good-Turing Method, the situation for 1 0 c n + = wasn't considered and discussed further.Katz [10] proposed a revised method for calculating c* as following: , for 1 ( 1) 1 Based on the formula above, the threshold value k will be used.Only for the events with count between 1 and k (k >= c >= 1), the adjusted count c * will be calculated according to the formula while the count of event larger than k will not be changed (c * = c, for ∀ c > k).Katz suggested that threshold k set to 5. Several previous works can be found in [2,7].The influence of threshold k for Good-Turing will be further evaluated in Section 4.

Advanced Issues of Good-Turing Method
Good-Turing smoothing has been employed in many natural language applications.Previous works [3,11,12] discussed the related parameters, such as cut-off k in Good-Turing method.However, these works employ English corpus only.In this section, we will focus on the Good-Turing method in Mandarin corpus and further analyze the problems of Good-Turing for Chinese corpus with various size, and different cut-off value k.
As shown in Equation ( 8), Good-Turing reestimate count c * of all events in term of original count c and event number n c and n c+1 .In practice, the discounted count c * is not used for all count c.Assumed that larger counts c are always much reliable.Recount c * are set by Katz [5], in which c denotes the count of an event, c* denotes the recount of an event, suggested by Katz ,1987 for English data.n i denotes the number of bigrms with i counts, k denotes the cut-off value.
Good-Turing was first applied as a smoothing method for n-gram models by Katz [5].Until now, few papers discuss the related problems between cut-off k and entropy for Mandarin corpus, even for English.Katz suggested a cut-off k at 5 as threshold for English corpus.Another important parameter of Good-Turing is the best k b (not ever discussed in previous works) in term of training size N.
For Chinese character unigram model, we first calculate the recount c * (c >= 0).Referring to the empirical results, some recounts c * are negative (<0).In such case, furthermore it leads to negative probability P and violates the statistical principle.For instance, c = 8, n 8 = 106, n 9 = 67, k = 10, the recount c * can be calculated and is negative −20.56.

Models Evaluation-Cross Entropy and Perplexity
Two commonly used schemes for evaluaitng the quality of language model LM are referred to the entropy and perplexity [13,14].Supposed that a sample T is consisted of several events e 1 , e 2 , …, e m of m strings.The probability P for a given testing sample T is calculated as following: 1 ( ) ( ) where ) ( i e P is the probability for the event e i , and ) (T E can be regarded as the coded length for all events in testing datasets: where E(T) and PP(T) denote the entropy (log model probability) and perplexity for testing dataset T respectively.E min stands for the minimum entropy for a model.The perplexity PP is usually regarded as the average number for selected number which will be the possible candidates referred to a known sequence.When a language model is employed to predict the next appearing word in the current given context, the perplexity is adopted to compare and evaluate n-gram statistical language models.
In general, lower entropy E leads to lower PP for the language models.It means that the lower PP, the better performance of language models.Therefore, perplexity is a quality measurement for LM.While two language models, LM 1 and LM 2 , are compared, the one with lower perplexity is the better language representation and commonly provides higher performance.
In fact, the probability distribution for testing language models is usually unknown.The cross entropy (CE) is another measure for evaluating a language model.The model which can predict better the next occurring event always achieves lower cross entropy.In general situation, E CE >= , E denotes the entropy using the same language model M for training and testing models.The Cross Entropy can be expressed as:

CE p M p w w w w M w w w w n
Based on the Shannon-McMillan-Breiman theorem [7], formula 12 can be simplified as following:

Experiments and Evaluation
Chinese Giga Word (CGW) is the Chinese corpus collected from several world news databases and issued by Linguistic Data Consortium (LDC).In the paper, we adopted the CGW 3.0 of newest version published on September 2009.The CGW news sources are Agence France-Presse, Central News Agency of Taiwan, Xinhua News Agency of Beijing, aned Zaobao Newspaper of Singapore.
In the paper, we will create 10 Unigram language models with Chinese words for experiments.At first, we read in random the paper of Chinese words from CGW corpus, a language model LM 1 will be created for the first 3 × 10 7 (30 M) Chinese words.In the following, the other new model LM 2 can be created consequently for the next 3 × 10 7 Chinese words.In other words, LM 2 is consisted of first 6 × 10 7 (60 M) Chinese words of CGW, first half of which is also used to create LM 1 .
In the paper, the 10 language models created by different size of corpus are evaluated sequentially for inside testing on these 10 models.As shown in Table 1, the x-axis and y-axis present the training model (TrM) and testing models (TeM) respectively.For each row in Table 1, testing models are used for evaluating 10 training models TrM.On the other side, 10 testing models TeM will be used respectively to evaluate one of 10 training models for GT smoothing.Figure 1 presents the results on 3 dimensions respect to Table 1.
The results of perplexity (PP) for Good-Turing smoothing are drawn in Figure 2. Note that the smaller size of testing models, the lower of perplexity and the larger size of training models, the higher of perplexity.For each row, it is apparent that the lowest PP, bold numbers displayed on the diagonal line in the table, can be achieved on the same training and testing models.All the results always match ) , ( ) , as described in Section 3.3.

Advanced Issues in Good-Turing
As described in Section 3.1, the situation for 1 0 c n + = wasn't discussed and processed further by the Good-Turing Method.Katz [5] proposed a revised method for  calculating c* based on Equation (7).It is obvious that Good-Turing (GT) smoothing is always superior to the advanced Good-Turing (AGT) for all the testing models.
On the other hands, the AGT set the thresholds k (k = 5) and avoid the issue caused by the situation 1 0 c n + = , while it will lead to the degradation of performance (higher perplexity) for all training models.
Figure 3 presents the results of perplexity PP of Good-Turing smoothing.Comparison for Good-Turing and Advanced Good-Turing smoothing are displayed in Figure 4.It is obvious that the GT is always superior to that of AGT for all 10 TrM models.Our observation is the setting for cut off k proposed by KATZ for Good-Turing smoothing can't achieve better performance while it avoided occurrence of situation, which will lead to violation of probability.
Comparison for Good-Turing (GT) and Advanced Good-Turing (AGT) smoothing are displayed in Figure 4.It is obvious that the GT is always superior to that of AGT for all 10 TrM models.In Figure 4, two observations are the same as described for GT and AGT smoothing: 1) the lowest PP was achieved on TrM 120M , for average PP of lower triangle of each training models.
2) the lowest PP was also achieved on the model TrM 180M and TrM 210M , for average PP of each training models.The perplexity for TrM with size larger than 180 M words will be also gradually increased.
Based on the evaluation results of PP for three smoothing methods, we could conclude that, in general case for average perplexity, the lowest PP can be achieved on model TrM 120M .The experiment results could prove that the model which was created on larger than 180 M corpus can't achieve a better performance.On the other hand, our experiments supported that the model with middle size of corpus of 180 M Chinese words can always achieve the best performance of language model.

Evaluation of Outside Testing
In the following experiments, the text sources from ASBC corpus are exploited as outside datasets.The    5.
Ten Chinese language models LM 1 , LM 2 , to LM 10 , which contain different size of Chinese words from CGW 3.0, will be evaluated for outside testing.In our experiments the perplexity is calculated on these 10 models.Based on empirical results, the influence of cut off k for AGT smoothing is varied on size of testing model TeM.It is obvious that the smaller the cut of k, the lower perplexity for all TeMs.
The same observation can be also found for GT smoothing.Figure 5 presents that PP for GT smoothing on various k, outside testing.Note that the PP trends decreased gradually while the size of model increased.It means the perplexity will be affected by the models size.

Conclusion
In this paper, we studied empirically and analyzed the well-known smoothing methods for language models on large sizes Chinese corpus.The smoothing methods, Good-Turing, Advanced Good-Turing, are evaluated for inside testing and outside testing.We analyzed further the results of experiments, which is helpful for employing the effective smoothing methods to alleviate the issue of data sparseness on various size of training corpus.Some helpful observations are described in detail for both GT and AGT smoothing.PP trend for AGT smoothing on various k, outside testing string w 1 , w 2 , w 3 , … w k−1 .In general, unigram, bigram and trigram (3 <= N) are generated.N-gram model calculates P(•) of N th events by the preceding N − 1 events, rather than the string w 1 , w 2 , w 3 … w N-1 .

Figure 2 .
Figure 2. Results of perplexity PP of Good-Turing.

Figure 4 .
Figure 4. Average perplexity PP of GT and AGT smoothing.

Figure 5 .
Figure 5. PP trend for GT (upper) and AGT (lower) smoothing on various k, outside testing.Academic Sinica Balanced Corpus version 3.0 (ASBC) is composed of 9228 text files distributed in different fields, occupying 118 MB and near 5 millions of Chinese words labeled with POS tag.The contents and paper distributions of ASBC are listed in Table 5.