An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese

Copyright © 2013 SciRes. JCC

3.2. Advanced Issues of Good-Turing Method

Good-Turing smoothing has been employed in many

natural language applications. Previous works [3,11,12]

discussed the related parameters, such as cut-off k in

Good-Turing method. However, these works employ

English corpus only. In this section, w e will focus on the

Good-Turing method in Mandarin corpus and further

analyze the problems of Good-Turing for Chinese corpus

with various siz e, a n d differe nt cut-off value k.

As shown in Equation (8), Good-Turing reestimate

count c* of all events in term of original count c and

event number nc and nc+1. In practice, the discounted

count c* is not used for all count c. Assumed that larger

counts c are always much reliable. Recount c* are set by

Katz [5], in which c denotes the count of an event, c*

denotes the recount of an event, suggeste d by Katz ,1987

for English data. ni denotes the number of bigrms with i

counts, k denotes the cut-off value.

Good-Turing was first applied as a smoothing method

for n-gram models by Katz [5]. Until now, few papers

discuss the related problems between cut-off k and en-

tropy for Mandarin corpus, even for English. Katz sug-

gested a cut-off k at 5 as threshold for English corpus.

Another important parameter of Good-Turing is the best

kb (not ever discussed in previous works) in term of

training size N.

For Chinese character unigram model, we first calcu-

late the recount c* (c >= 0). Referring to the empirical

results, some recounts c* are negative (<0). In such case,

furthermore it leads to negative probability P and vi-

olates the statistical principle. For instance, c = 8, n8 =

106, n9 = 67, k = 10, the recount c* can be calculated and

is negative −20.56.

3.3. Models Evaluation-Cross Entropy and

Perplexity

Two commonly used schemes for evaluaitng the quality

of language model LM are referred to the entropy and

perplexity [13,14]. Supposed that a sample T is consisted

of sev eral ev ents e1, e2, …, em of m string s. The probabil-

ity P for a given testing sample T is calculated as fol-

lowing:

(9)

where

is the probability for the event ei, and

can be regarded as the coded length for all events

in testing datasets:

22

1

()( )log( )()log()

m

ii

xi

ETPx PxPePe

=

=−=−

∑∑

(10)

(11)

where E(T) and PP(T) denote the entropy (log model

probability) and perplexity for testing dataset T respec-

tively. Emin stands for the minimum entropy for a mode l.

The perplexity PP is usually regarded as the average

number for selected number which will be the possible

candidates referred to a known sequence. When a lan-

guage model is employed to predict the next appearing

word in the curr ent given context, the perplexity is adop-

ted to compare and evaluate n-gram statistical language

models.

In general, lower entropy E leads to lower PP for the

language models. It means that the lower PP, the better

performance of language models. Therefore, perplexity is

a quality measurement for LM. While two language mo-

dels, LM1 and LM2, are compared, the one with lower

perplexity is the better language representation and com-

monly provides higher performance.

In fact, the p robab ility d istributio n fo r t esting language

models is usually unknown. The cross entropy (CE) is

another measure for evaluating a language model. The

model which can predict better the next occurring event

always achieves lower cross entropy. In general situation,

, E denotes the entropy using the same language

model M for training and testing models. The Cross En-

tropy can be expressed as:

123 123

(, )

1

lim(... )log(...)

nn

nWL

CEp M

pwww wMwww w

n

→∞ ∈

=∑

(12)

Based on the Shannon-McMillan-Breiman theorem [7],

formula 12 can be simplified as following:

123

1

( ,)limlog(...)

n

n

CEpMMwwww

n

→∞

=

(13)

4. Experiments and Evaluation

Chinese Giga Word (CGW) is the Chinese corpus col-

lected from several world news databases and issued by

Linguistic Data Consortium (LDC). In the paper, we

adopted the CGW 3.0 of newest version published on

September 2009. The CGW news sources are Agence

France-Presse, Central News Agency of Taiwan, Xinhua

News Agency of Beijing, aned Zaobao Newspaper of

Singapore.

In the paper, we will cr eate 10 Unigram language mo-

dels with Chinese words for experiments. At first, we

read in random the paper of Chinese words from CGW

corpus, a language model LM1 will be created for the

first 3 × 107 (30 M) Chinese words. In the following, the

other new model LM2 can be created consequently for

the next 3 × 107 Chinese words. In other words, LM2 is

consisted of first 6 × 107 (60 M) Chinese words of CGW,

first half of which is also used to create LM1.

In the paper, the 10 language models created by dif-

ferent size of corpus are evaluated sequentially for inside