1. Introduction
When testing whether two nominal-scale attributes (one having
discrete values, the other one
) are independent, the traditional (Pearson’s) test uses the following test statistic
(1)
where
is the total number of observations,
is the number of those resulting in the
value of the first attribute and in the
value of the second attribute,
and
The null hypothesis claims that the probability of an observation being of the
type is
where
and
; note that the multinomial probability mass function of the
’s is therefore
(2)
having
algebraically independent parameters [1].
This model implies that the conditional probability of an observation’s second attribute being equal to
, given that the value of its first attribute is
, is given by
(for any
and
), i.e. that the attributes are independent of each other. The test also assumes that the
and
probabilities are unknown and are estimated by their maximum-likelihood estimators, namely
and
respectively. The alternate hypothesis allows the probabilities of the
’s to have any non-negative values, as long as these add up to 1, thus resulting in
algebraically independent parameters. Failing the null hypothesis makes the above test statistic increase in value; the critical region is always an upper tail of the corresponding distribution.
Another test statistic proposed decades later for testing the same hypothesis is (using the same notation, for future convenience)
(3)
of the so-called G test [2] (note that when
, which happens occasionally, makes the corresponding term of the last expression equal to 0). It is well known (something we verify shortly) that, in the
limit, both Formula (1) and Formula (3) converge to the same expression, whose asymptotic distribution is chi-squared with
degrees of freedom [3].
The main objective of this article is to find
-proportional correction to this distribution, separately for each of our two test statistics.
Next comes a brief review of the basic mathematical tool used throughout this article.
2. Kronecker Product
In this section
and
are matrices of any dimensions (a dimension equal to 1 implies a column or row vector; both dimensions equal to 1 represents a scalar). Kronecker product
of two matrices is created by multiplying every element of
by the full matrix
and organizing the resulting blocks into a single matrix [4]. This product is clearly non-commutative, associative and distributive over addition; furthermore (assuming conformable dimensions for each matrix multiplication)
where both
and
indicate matrix multiplication; note that
takes precedence over
which, in turn, takes precedence over
(this is the reason for the duplicate notation). When
and
are square matrices (
by
and
by
respectively)
We also need the following Woodbury’s identity [5]
where
stands for a (conformable) identity matrix, and the following Sylvester’s identity [6]
3. Asymptotic Theory
In this section we investigate the asymptotic distribution of Formula (1) and Formula (3) when
, getting the same answer for both of these [7].
3.1. Normal Limit of Formula (2)
To make our task easier, we assume that the sum all probabilities of the Multinomial distribution defined in Formula (2) equals to
, and thus allowing for the possibility of an extra outcome whose probability is
, denoting its observed total by
. This substantially simplifies subsequent development by removing the singularity of the variance-covariance (V-C) matrix of the
variables, while letting us reach correct conclusions in the
limit. Correspondingly modifying Formula (2) is easy: extend the bottom row of the multinomial coefficient by
and further multiply by
. Since all variables must add up to
, the resulting
by
V-C matrix is singular, but the singularity disappears when replacing
by
and considering the distribution of the remaining
’s only. The corresponding moments are
which remains correct even with the extra
. We then define
(4)
and substitute
into the natural logarithm of Formula (2), after is has been modified by introducing the extra
. Adding ln of the corresponding Jacobian (namely
, since the distribution of the
’s is
dimensional) and taking the
limit of the resulting expression (all limits of this article are routinely done by a computer equipped with Mathematica or a similar software) yields
(5)
This identifies the asymptotic distribution of the
’s to be multivariate Normal, with a probability density function (PDF) whose natural logarithm is given by the last expression. Its linear and quadratic moments agree with those of the original discrete distribution, being equal to 0 for the expected value of each
and to
(6)
for the covariance between
and
(for any combination of indices); the corresponding V-C matrix is thus
(7)
where
and
are column vectors whose elements are the
and
probabilities (respectively), and
and
are diagonal matrices with the
and
probabilities (respectively) on the main diagonal. The PDF can then be written, in correspondence with Formula (5), as
(8)
where
is a column vector whose
elements are
(organized in accordance with our Kronecker’s product, i.e.
),
(9)
and
(10)
the last two implied by Formula (5).
Alternately, they can be derived from Formula (7) using the Woodbury and Silvester identities. Bypassing the details, we only verify that
is the inverse of
, by multiplying
where
and
are
by
and
by
identity matrices respectively.
3.2. Asymptotic Distribution of U
We now take the
limit of Formula (1) and of Formula (3), first replacing
with
by using Formula (4); this results in the same answer in both cases, namely in
(11)
where
and
. To get the corresponding moment generating function (MGF) of
, we need to compute the expected value of
, i.e.
(12)
where
(13)
which is the matrix version of Formula (11); to see this, it helps to re-write Formula (11) as follows
To find the result of Formula (12) necessitates computing the determinant of
; this matrix can be expressed in the following form
(14)
which follows from Formula (9) and Formula (13); the square brackets indicate that the matrix consists of two (and subsequently also of four) blocks.
Pre-multiplying Formula (14) by
, whose determinant is
leaves us with the task of finding the determinant of
(15)
This needs to be done in two steps: for the determinant of
, we get
while its inverse equals to
(16)
easily verifiable by simple multiplication, utilizing
.
Sylvester’s identity implies that the determinant of
is a product of
and the determinant of (at this point, we can start replacing
by 1)
(17)
where
is a zero column-vector of indicated length and 1 is a scalar. The determinant of the last matrix is
since the determinant of each main-diagonal block is equal to
due to Sylvester’s identity.
In the
limit Formula (12), whose value is given by
, then equals to
(18)
proving the well-known result that, to this level of approximation, the distribution of
is
with
degrees of freedom.
3.3. Inverting
Making the integrand of Formula (12) into a PDF of a Normal distribution results in
(19)
whose higher (quartic and hexic in particular) moments are key to finding
-proportional corrections to the MGF of each Formula (1) and Formula (3). Furthermore, these moments are polynomial functions of its second moments (a special property of a multivariate Normal distribution), which in turn are simply the elements of
.
To compute this inverse, we first write Formula (14) as a product of
(whose inverse is easy) and of
, introduced in Formula (15). From Formula (16) we know that
as
is no longer needed and can be set equal to 1.
To complete the exercise, we need to simplify
obtained from Woodbury’s identity as a part of inverting
. The last line of Formula (17) has already done exactly that for the matrix in parentheses; the corresponding inverse is
easily verifiable by simple matrix multiplication (note that both matrices in the last expression are idempotent).
This inverse still needs to be pre-multiplied by
and post-multiplied by
, thus getting
and further pre and post-multiplied by
, resulting in
Multiplying the last expression by
and subtracting the result from
yields
, namely
Finally, post-multiplying by
results in the following expression for
and the corresponding formula for computing the desired second moments, namely
(20)
(note the expected agreement with Formula (6) when
).
Consequently,
(21)
provide the remaining moments needed to proceed with computing the desired
-proportional corrections. Note that fourth-order moments are then found from
(with subsequent special cases, such as
etc.), and the sixth-order moments from
(where the RHS consists of 15 terms corresponding to all possible ways of pairing the six arguments), usually needed in its special form such as
etc.
4.
-Proportional Corrections
To find the desired correction to the
distribution of the Pearson test, we must
include
and
-proportional terms in the Formula (13) expansion, thus getting (listing the extra terms only)
Similarly, we need the extra terms in the expansion of the natural logarithm of the RHS of Formula (2); this time we get
Combining these two, we expand
up to and including
-proportional terms, getting
(22)
(having discarded
-proportional terms, whose expected values are equal to zero).
What remains to be done is to take the expected value of Formula (22); this is done by first expanding every sum in this expression, then finding the expected value of each term of these expansions, finally followed by carrying out the summation itself (over up to four indices). This results in scores of individual contributions, even though, rather surprisingly, most of them cancelling each other upon subsequent simplification. This computation can be successfully completed only with a computer’s help; the actual Mathematica program required many lines of code and is available upon request.
As an example, we demonstrate how to find
. We start by ignoring the triple summations, multiplying only the first term of
by
(note that this requires using a different index for
) and expanding the resulting expected value, thus
where we have already discarded terms containing a zero factor such as
. We then evaluate the expected values based on Formula (20) and Formula (21), and complete the
and
summations, resulting in
where
. To include the second term of
, we simply interchange
and
of the previous answer, replace
by
, and add the result to the previous answer.
When this is completed for all terms of Formula (22), adding them and simplifying (many terms cancel out, as mentioned already) results in
(23)
where
and
; note that when
for all
values,
is then equal to zero, and similarly
implies
.
The last expression, further divided by
, is the actual correction to the MGF of the test statistic Formula (13), implying that the corresponding PDF correction is a linear combination of
distributions; to get an explicit formula for this correction, make the following replacement in Formula (23)
where
is the PDF of the
distribution. Note that each of the three corrections integrates to zero (the main reason for keeping them in this form).
These results are in perfect agreement with those obtained, using a different technique, by [8].
G-Test Results
This time we get the following extra terms in the expansion of Formula (3)
We then repeat the steps of the previous section, getting the following (surprisingly simple) correction to the corresponding PDF
5. Monte-Carlo Verification and Conclusion
To demonstrate the utility of our correction formulas (but also their limitations, in case of the G-test), we generate a million observations from a contingency-table model and compare the results with the
approximation, both without and including the
-proportional correction.
Our first example assumes that two attributes have 10 equally likely levels each (for a total of 100 cells), and the number of observations is equal to 150. The results are displayed in Figure 1 for the Pearson test and in Figure 2 for the G-test (the dotted lines show the basic
approximation, the solid lines include the respective corrections). From these graphs it is quite obvious that, in the former case, the correction terms are quite capable of removing the (still rather serious) error, while the G-test is so hopelessly inaccurate that repairing it becomes impossible (the coefficient of our formula has a value bigger than 5, being no longer ‘small’, resulting in making the ‘corrected’ PDF partly negative; a strong indication that the error is too large to be dealt with in this manner). The benefit of our correction is to clearly indicate a presence of an inacceptable error in the usual
approximation (suggesting a solution would require a separate study). We conclude that the G-test should not be used with large number of cells (unless the number of observations is correspondingly large).
![]()
Figure 1. Pearson’s test correction (n = 150).
Figure 2. G-test and its error (n = 150).
Nevertheless, we can still confirm that our G-test correction works when the attributes have only 3 levels each (we will still make them equally likely) and
is equal to 35; this is indicated in Figure 3.
Figure 3. G-test corrected (n = 35).
Note that the empirical distribution (represented by a histogram) has some minor irregularities due to the exact distribution’s discrete nature (correcting for these would require a new study). Nevertheless, the graph indicates that the right-tail region is not affected by this phenomenon (this is particularly true for the Pearson test, as seen in Figure 4). We should also mention that, in this same situation and using the same data, the Pearson test is still more accurate than the G-test, both when using the basic
approximation and using its corrected version. In conclusion, we find no reason for recommending the G-test over the Pearson test of independence.
Figure 4. Correcting Pearson’s test (n = 35).