1/n - Proportional Correction to Tests of Independence

Abstract

It is known that both the Pearson and G-test of independence have the same asymptotic distribution, namely χ 2 with ( K1 )( M1 ) degrees of freedom, where K and M are the number of levels, respectively, of two attributes. However, when extending the accuracy ot this approximation by 1 n -proportional terms, the resulting corrections differ quite dramatically. The purpose of this article is to derive each of these corrections and demonstrate their use in practical application.

Share and Cite:

Vrbik, J. (2025) 1/n - Proportional Correction to Tests of Independence. Applied Mathematics, 16, 262-274. doi: 10.4236/am.2025.163013.

1. Introduction

When testing whether two nominal-scale attributes (one having K discrete values, the other one M ) are independent, the traditional (Pearson’s) test uses the following test statistic

U:= i=1 K j=1 M ( X i,j X i, X ,j n ) 2 X i, X ,j n (1)

where n is the total number of observations, X i,j is the number of those resulting in the i th value of the first attribute and in the j th value of the second attribute, and The null hypothesis claims that the probability of an observation being of the ( i,j ) th type is p i q j where j=1 M p i =1 and i=1 K q j =1 ; note that the multinomial probability mass function of the X i.j ’s is therefore

f( x )=( n x 1,1 , x 1,2 ,, x K,M ) ( p 1 q 1 ) x 1,1 ( p 1 q 2 ) x 1,2 ( p K q M ) x K,M (2)

having K+M2 algebraically independent parameters [1].

This model implies that the conditional probability of an observation’s second attribute being equal to j , given that the value of its first attribute is i , is given by q j (for any j and i ), i.e. that the attributes are independent of each other. The test also assumes that the p i and q j probabilities are unknown and are estimated by their maximum-likelihood estimators, namely X i, n and X ,j n respectively. The alternate hypothesis allows the probabilities of the X i,j ’s to have any non-negative values, as long as these add up to 1, thus resulting in KM1 algebraically independent parameters. Failing the null hypothesis makes the above test statistic increase in value; the critical region is always an upper tail of the corresponding distribution.

Another test statistic proposed decades later for testing the same hypothesis is (using the same notation, for future convenience)

(3)

of the so-called G test [2] (note that when X i.j =0 , which happens occasionally, makes the corresponding term of the last expression equal to 0). It is well known (something we verify shortly) that, in the n limit, both Formula (1) and Formula (3) converge to the same expression, whose asymptotic distribution is chi-squared with ( K1 )( M1 ) degrees of freedom [3].

The main objective of this article is to find 1 n -proportional correction to this distribution, separately for each of our two test statistics.

Next comes a brief review of the basic mathematical tool used throughout this article.

2. Kronecker Product

In this section a, b, c and d are matrices of any dimensions (a dimension equal to 1 implies a column or row vector; both dimensions equal to 1 represents a scalar). Kronecker product ab of two matrices is created by multiplying every element of a by the full matrix b and organizing the resulting blocks into a single matrix [4]. This product is clearly non-commutative, associative and distributive over addition; furthermore (assuming conformable dimensions for each matrix multiplication)

( ab ) T = a T b T

( ab ) 1 = a 1 b 1

abcd=acbd

where both and indicate matrix multiplication; note that takes precedence over which, in turn, takes precedence over (this is the reason for the duplicate notation). When a and b are square matrices ( K by K and M by M respectively)

det( ab )=det ( a ) M det ( b ) K

We also need the following Woodbury’s identity [5]

( a+bc ) 1 = a 1 a 1 b ( I+c a 1 b ) 1 c a 1

where I stands for a (conformable) identity matrix, and the following Sylvester’s identity [6]

det( a+bc )=det( a )det( I+c a 1 b )

3. Asymptotic Theory

In this section we investigate the asymptotic distribution of Formula (1) and Formula (3) when n , getting the same answer for both of these [7].

3.1. Normal Limit of Formula (2)

To make our task easier, we assume that the sum all probabilities of the Multinomial distribution defined in Formula (2) equals to r<1 , and thus allowing for the possibility of an extra outcome whose probability is 1r , denoting its observed total by W . This substantially simplifies subsequent development by removing the singularity of the variance-covariance (V-C) matrix of the X i,j variables, while letting us reach correct conclusions in the r1 limit. Correspondingly modifying Formula (2) is easy: extend the bottom row of the multinomial coefficient by w=n i=1 K j=1 M x i,j and further multiply by ( 1r ) w . Since all variables must add up to n , the resulting KM+1 by KM+1 V-C matrix is singular, but the singularity disappears when replacing W by n i=1 K j=1 M X i,j and considering the distribution of the remaining X i,j ’s only. The corresponding moments are

E( X i.j )=n p i q j

Var( X i,j )=n p i q j ( 1 p i q j )

Cov( X i,j , X k,m )=n p i q j p k q m whenikorjm

which remains correct even with the extra W . We then define

Y i,j := X i,j n p i q j n (4)

and substitute

x i,j =n p i q j + y i,j n

w= i=1 K j=1 M y i,j

into the natural logarithm of Formula (2), after is has been modified by introducing the extra W . Adding ln of the corresponding Jacobian (namely KMln n , since the distribution of the Y i,j ’s is KM dimensional) and taking the n limit of the resulting expression (all limits of this article are routinely done by a computer equipped with Mathematica or a similar software) yields

(5)

This identifies the asymptotic distribution of the Y i,j ’s to be multivariate Normal, with a probability density function (PDF) whose natural logarithm is given by the last expression. Its linear and quadratic moments agree with those of the original discrete distribution, being equal to 0 for the expected value of each Y i,j and to

p i δ i,k q j δ j,m p i p k q j q m (6)

for the covariance between Y i,j and Y k.m (for any combination of indices); the corresponding V-C matrix is thus

V=p p T q q T (7)

where p and q are column vectors whose elements are the p i and q j probabilities (respectively), and and are diagonal matrices with the p i and q j probabilities (respectively) on the main diagonal. The PDF can then be written, in correspondence with Formula (5), as

exp( Y T AY 2 ) ( 2π ) KM/2 detA (8)

where Y is a column vector whose KM elements are Y i,j (organized in accordance with our Kronecker’s product, i.e. Y 1,1 , Y 1,2 ,, Y 1,M , Y 2,1 ,, Y K,M ),

A= 1 1 + 1 k 1 k T 1 m 1 m T 1r = 1 1 + 1 k 1 m 1 k T 1 m T 1r (9)

and

det( A )=det( 1 1 )det( 1+ 1 k T 1 m T 1 k 1 m 1r ) = 1 ( i=1 K p i ) M ( j=1 M q j ) K ( 1r ) (10)

the last two implied by Formula (5).

Alternately, they can be derived from Formula (7) using the Woodbury and Silvester identities. Bypassing the details, we only verify that A is the inverse of V , by multiplying

( 1 1 + 1 k 1 k T 1 m 1 m T 1r )( p p T q q T ) = I k I m 1 k p T 1 m q T + 1 k p T 1 m q T 1r r( 1 k p T 1 m q T ) 1r = I k I m

where I k and I m are k by k and m by m identity matrices respectively.

3.2. Asymptotic Distribution of U

We now take the n limit of Formula (1) and of Formula (3), first replacing X i,j with Y i,j by using Formula (4); this results in the same answer in both cases, namely in

i=1 K j=1 M Y i,j 2 p i q j i=1 K Y i, 2 p i j=1 M Y ,j 2 q i (11)

where Y ,j = i=1 K Y i,j and Y i, := j=1 M Y i.j . To get the corresponding moment generating function (MGF) of U , we need to compute the expected value of exp( tU ) , i.e.

exp( Y T ( A2tU )Y 2 )dY ( 2π ) KM/2 detA (12)

where

U= 1 1 1 1 m 1 m T 1 k 1 k T 1 (13)

which is the matrix version of Formula (11); to see this, it helps to re-write Formula (11) as follows

i=1 K k=1 K j=1 M m=1 M Y i.j ( δ i,k δ j,m p i q j δ i,k p i δ j,m q j ) Y k,m

To find the result of Formula (12) necessitates computing the determinant of A2tU ; this matrix can be expressed in the following form

( 12t ) 1 1 + 1 k 1 k T 1 m 1 m T 1r +2t[ 1 1 m 1 k 1 ][ I k 1 m T 1 k T I m ] (14)

which follows from Formula (9) and Formula (13); the square brackets indicate that the matrix consists of two (and subsequently also of four) blocks.

Pre-multiplying Formula (14) by 12t , whose determinant is

D 1 := ( i=1 K p i ) M ( j=1 M q j ) K ( 12t ) KM

leaves us with the task of finding the determinant of

:= I k I m + p 1 k T q 1 m T ( 1r )( 12t ) + 2t 12t [ I k q p I m ][ I k 1 m T 1 k T I m ] :=F+ 2t 12t G (15)

This needs to be done in two steps: for the determinant of F , we get

D 2 :=detF=1+ 1 k T 1 m T pq ( 1r )( 12t ) =1+ r ( 1r )( 12t ) = 12t( 1r ) ( 1r )( 12t )

while its inverse equals to

F 1 = I k I m p 1 k T q 1 m T 12t( 1r ) (16)

easily verifiable by simple multiplication, utilizing p 1 k T q 1 m T p 1 k T q 1 m T =r( p 1 k T q 1 m T ) .

Sylvester’s identity implies that the determinant of is a product of detF and the determinant of (at this point, we can start replacing r by 1)

I k+m + 2t 12t [ I k 1 m T 1 k T I m ]( I k I m p 1 k T q 1 m T )[ I k q p I m ] = I k+m + 2t 12t [ I k 1 p 1 m T 1 k T q 1 I m ] 2t 12t [ p 1 k T 1 p 1 m T 1 k T q 1q 1 m T ] = 1 12t [ ( I k 2tp 1 k T )1 0 k 0 m T 0 m 0 k T 1( I m 2tq 1 m T ) ] (17)

where 0 is a zero column-vector of indicated length and 1 is a scalar. The determinant of the last matrix is

D 3 := ( 12t ) 2 ( 12t ) K+M

since the determinant of each main-diagonal block is equal to 12t, due to Sylvester’s identity.

In the r1 limit Formula (12), whose value is given by detA det( A2tU ) = detA D 1 D 2 D 3 , then equals to

( 1 12t ) ( K1 ) ( M1 )/2 (18)

proving the well-known result that, to this level of approximation, the distribution of U is χ 2 with ( K1 )( M1 ) degrees of freedom.

3.3. Inverting A2tU

Making the integrand of Formula (12) into a PDF of a Normal distribution results in

exp( Y T ( A2tU )Y 2 ) ( 2π ) KM/2 detA2tU (19)

whose higher (quartic and hexic in particular) moments are key to finding 1 n -proportional corrections to the MGF of each Formula (1) and Formula (3). Furthermore, these moments are polynomial functions of its second moments (a special property of a multivariate Normal distribution), which in turn are simply the elements of ( A2tU ) 1 .

To compute this inverse, we first write Formula (14) as a product of ( 12t ) 1 1 (whose inverse is easy) and of =F+ 2t 12t G , introduced in Formula (15). From Formula (16) we know that

F 1 = I k I m p 1 k T q 1 m T

as r is no longer needed and can be set equal to 1.

To complete the exercise, we need to simplify

F 1 2t 12t F 1 G ( I k+m + 2t 12t F 1 G ) 1 F 1

obtained from Woodbury’s identity as a part of inverting . The last line of Formula (17) has already done exactly that for the matrix in parentheses; the corresponding inverse is

( 12t )[ I k 1 0 k 0 m T 0 m 0 k T 1 I m ]+2t[ p 1 k T 1 0 k 0 m T 0 m 0 k T 1q 1 m T ]

easily verifiable by simple matrix multiplication (note that both matrices in the last expression are idempotent).

This inverse still needs to be pre-multiplied by [ I k q p I m ] and post-multiplied by [ I k 1 m T 1 k T I m ] , thus getting

( 12t )( I k q 1 m T +p 1 k T I m )+4tp 1 k T q 1 m T

and further pre and post-multiplied by F 1 , resulting in

( 12t )( I k q 1 m T +p 1 k T I m 2p 1 k T q 1 m T )+4t 0 k 0 k T 0 m 0 m T

Multiplying the last expression by 2t 12t and subtracting the result from F 1 yields 1 , namely

I k I m +( 4t1 )p 1 k T q 1 m T 2t( I k q 1 m T +p 1 k T I m )

Finally, post-multiplying by 12t results in the following expression for ( A2tU ) 1

12t +( 1 12t 2 )p p T q q T +( 1 1 12t )( q q T +p p T )

and the corresponding formula for computing the desired second moments, namely

E( Y i,j Y k,m )=2 p i p k q j q m + δ i.k p i q j q m + δ j,m p i p k q j + p i ( δ i.k p k ) q j ( δ j,m q m ) 12t (20)

(note the expected agreement with Formula (6) when t=0 ).

Consequently,

E( Y i,j Y k, )= p i q j ( δ i,k p k ) E( Y i,j Y ,m )= p i q j ( δ j,m q m ) E( Y i, Y k, )= p i ( δ i.k p k ) E( Y ,j Y ,m )= q j ( δ j,m q m ) E( Y i, Y ,j )=0 (21)

provide the remaining moments needed to proceed with computing the desired 1 n -proportional corrections. Note that fourth-order moments are then found from

E( abcd )=E( ab )E( cd )+E( ac )E( bd )+E( ad )E( bc )

(with subsequent special cases, such as E( a 2 b 2 )=2E ( ab ) 2 +E( a 2 )E( b 2 ) etc.), and the sixth-order moments from

E( abcdef )=E( ab )E( cd )E( ef )++E( af )E( cd )E( be )

(where the RHS consists of 15 terms corresponding to all possible ways of pairing the six arguments), usually needed in its special form such as

E( a 3 b 3 )=9E( a 2 )E( b 2 )E( ab )+6E ( ab ) 3

etc.

4. 1 n -Proportional Corrections

To find the desired correction to the χ 2 distribution of the Pearson test, we must

include 1 n and 1 n -proportional terms in the Formula (13) expansion, thus getting (listing the extra terms only)

U=+ i=1 K Y i, 3 p i 2 + j=1 M Y ,j 3 q j 2 n + i=1 K j=1 M ( 2 Y i,j Y i, Y ,j p i q j Y i,j 2 Y i, p i 2 q j Y i,j 2 Y ,j p i q j 2 ) n + i=1 K j=1 M ( 2 Y i,j Y i, 2 Y .j p i 2 q j 2 Y i,j Y i, Y .j 2 p i q j 2 + Y i,j 2 Y i, 2 p i 3 q j + Y i,j 2 Y .j 2 p i q j 3 + Y i,j 2 Y i, Y .j p i 2 q j 2 + Y i, 2 Y .j 2 p i q j ) n + i=1 K Y i, 4 p i 3 j=1 M Y .j 4 q j 3 n := u 1 n + u 2 n + u 3 n + u 4 n

Similarly, we need the extra terms in the expansion of the natural logarithm of the RHS of Formula (2); this time we get

A=+ i=1 K j=1 M ( Y i,j 3 6 p i 2 q j 2 Y i,j 2 p i q j ) n + 1 12n i=1 K j=1 M ( Y i,j 4 12 p i 3 q j 3 Y i,j 2 4 p i 2 q j 2 + 1 12 p i q j ) n := a 1 n + 1 12n + a 2 n

Combining these two, we expand exp( A2tU ) up to and including 1 n -proportional terms, getting

2( u 1 2 + u 2 2 +2 u 1 u 2 ) t 2 2t( u 3 + u 4 )2t( u 1 + u 2 ) a 1 + 1 2 a 1 2 + a 2 + 1 12 n (22)

(having discarded 1 n -proportional terms, whose expected values are equal to zero).

What remains to be done is to take the expected value of Formula (22); this is done by first expanding every sum in this expression, then finding the expected value of each term of these expansions, finally followed by carrying out the summation itself (over up to four indices). This results in scores of individual contributions, even though, rather surprisingly, most of them cancelling each other upon subsequent simplification. This computation can be successfully completed only with a computer’s help; the actual Mathematica program required many lines of code and is available upon request.

As an example, we demonstrate how to find E( u 1 u 2 ) . We start by ignoring the triple summations, multiplying only the first term of u 1 by u 2 (note that this requires using a different index for u 1 ) and expanding the resulting expected value, thus

E( Y k, 3 p k 2 ( 2 Y i,j Y i, Y ,j p i q j Y i,j 2 Y i, p i 2 q j Y i,j 2 Y ,j p i q j 2 ) ) = 6E( Y i, Y k, )E ( Y i,j Y k, ) 2 3E( Y i, Y k, )E( Y k, 2 )E( Y i,j 2 )6E( Y i,j Y i, )E( Y i,j Y k, )E( Y k, 2 ) p i 2 q j p k 2 6E( Y i,j Y ,j )E( Y i,j Y k, )E( Y k, 2 ) p i q j 2 p k 2 + 6E( Y i,j Y ,j )E( Y i, Y k, )E( Y k, 2 ) p i q j p k 2

where we have already discarded terms containing a zero factor such as E( Y i, Y ,j ) . We then evaluate the expected values based on Formula (20) and Formula (21), and complete the i,j and k summations, resulting in

12+18K+9 K 2 15 P 0 + 3( M1 )( K 2 P 0 ) 12t

where P 0 := i=1 K p i 1 . To include the second term of u 1 , we simply interchange K and M of the previous answer, replace P 0 by Q 0 := j=1 M q j 1 , and add the result to the previous answer.

When this is completed for all terms of Formula (22), adding them and simplifying (many terms cancel out, as mentioned already) results in

d 2n ( 1 12t 1 ) + P ˜ Q ˜ 2( M1 ) P ˜ 2( K1 ) Q ˜ 2( K+M3 )d 8n ( 1 ( 12t ) 2 2 12t +1 ) + 5 P ˜ Q ˜ +2( M1 )( M2 ) P ˜ +2( K1 )( K2 ) Q ˜ +2d( K2 )( M2 ) 24n ×( 1 ( 12t ) 3 3 ( 12t ) 2 + 3 12t 1 ) (23)

where P ˜ := P 0 K 2 , Q ˜ := Q 0 M 2 and d:=( K1 )( M1 ) ; note that when p i = 1 K for all i values, P ˜ is then equal to zero, and similarly q j = 1 M implies Q ˜ =0 .

The last expression, further divided by ( 12t ) d , is the actual correction to the MGF of the test statistic Formula (13), implying that the corresponding PDF correction is a linear combination of χ 2 distributions; to get an explicit formula for this correction, make the following replacement in Formula (23)

( 1 12t 1 )( x d 1 ) χ d 2 ( x )

( 1 ( 12t ) 2 2 12t +1 )( x 2 d( d+2 ) 2x d +1 ) χ d 2 ( x )

( 1 ( 12t ) 3 3 ( 12t ) 2 + 3 12t 1 )( x 3 d( d+2 )( d+4 ) 3 x 2 d( d+2 ) + 3x d 1 ) χ d 2 ( x )

where

χ d 2 ( x )= x d/2 1 exp( x/2 ) 2 d/2 Γ( d/2 )

is the PDF of the χ d 2 distribution. Note that each of the three corrections integrates to zero (the main reason for keeping them in this form).

These results are in perfect agreement with those obtained, using a different technique, by [8].

G-Test Results

This time we get the following extra terms in the expansion of Formula (3)

U=+ 2 i=1 K Y i, 3 p i 2 +2 i=1 M Y ,j 3 q j 2 3 n + 2 i=1 K j=1 M Y i,j 3 p i 2 q j 2 3 n + i=1 K Y i, 4 p i 3 + i=1 M Y ,j 4 q j 3 3n + i=1 K j=1 M Y i,j 4 p i 3 q j 3 3n := u 1 n + u 2 n + u 3 n + u 4 n

We then repeat the steps of the previous section, getting the following (surprisingly simple) correction to the corresponding PDF

P ˜ Q ˜ +( M 2 1 ) P ˜ +( K 2 1 ) Q ˜ +( K 2 1 )( M 2 1 ) 12n ( x d 1 ) χ d 2 ( x )

5. Monte-Carlo Verification and Conclusion

To demonstrate the utility of our correction formulas (but also their limitations, in case of the G-test), we generate a million observations from a contingency-table model and compare the results with the χ 2 approximation, both without and including the 1 n -proportional correction.

Our first example assumes that two attributes have 10 equally likely levels each (for a total of 100 cells), and the number of observations is equal to 150. The results are displayed in Figure 1 for the Pearson test and in Figure 2 for the G-test (the dotted lines show the basic χ 2 approximation, the solid lines include the respective corrections). From these graphs it is quite obvious that, in the former case, the correction terms are quite capable of removing the (still rather serious) error, while the G-test is so hopelessly inaccurate that repairing it becomes impossible (the coefficient of our formula has a value bigger than 5, being no longer ‘small’, resulting in making the ‘corrected’ PDF partly negative; a strong indication that the error is too large to be dealt with in this manner). The benefit of our correction is to clearly indicate a presence of an inacceptable error in the usual χ 2 approximation (suggesting a solution would require a separate study). We conclude that the G-test should not be used with large number of cells (unless the number of observations is correspondingly large).

Figure 1. Pearson’s test correction (n = 150).

Figure 2. G-test and its error (n = 150).

Nevertheless, we can still confirm that our G-test correction works when the attributes have only 3 levels each (we will still make them equally likely) and n is equal to 35; this is indicated in Figure 3.

Figure 3. G-test corrected (n = 35).

Note that the empirical distribution (represented by a histogram) has some minor irregularities due to the exact distribution’s discrete nature (correcting for these would require a new study). Nevertheless, the graph indicates that the right-tail region is not affected by this phenomenon (this is particularly true for the Pearson test, as seen in Figure 4). We should also mention that, in this same situation and using the same data, the Pearson test is still more accurate than the G-test, both when using the basic χ 2 approximation and using its corrected version. In conclusion, we find no reason for recommending the G-test over the Pearson test of independence.

Figure 4. Correcting Pearson’s test (n = 35).

Conflicts of Interest

The author declares no conflicts of interest regarding the publication of this paper.

References

[1] Kendall, M.G. and Stuart, A. (1973) The Advanced Theory of Statistics, Vol. 2, Chapter 33. Harper Publishing Company.
[2] MacDonald, H.J. (2014) G-Test of Goodness-of-Fit, Handbook of Biological Statistics. 3rd Edition, Sparky House Publishing, 53-58.
[3] Sokal, R.R. and Rohlf, F.J. (1981) Biometry: The Principles and Practice of Statistics in Biological Research. 2nd Edition, Freeman.
[4] Wikipedia. Kronecker Product.
[5] Wikipedia. Woodbury Matrix Identity.
[6] Wikipedia. Sylvester’s Determinant Identity.
[7] Serfling, R.J. (1980). Approximation Theorems of Mathematical Statistics. Wiley.
https://doi.org/10.1002/9780470316481
[8] Vrbik, J. (2014) Improving Accuracy of Test of Independence. Advances and Applications in Statistics, 39, 91-94.

Copyright © 2025 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.