Exact Distribution of Difference of Two Sample Proportions and Its Inferences ()
1. Introduction
Comparing two population proportions, especially when the sample size is small is very challenging in statistics, and has applications in many fields. Several procedures have been suggested [One of the most popular and common methods that has been used for a long time is the Wald interval]. Due to simplicity and convenience, the first method that comes in the mind of most statisticians is the Wald method. However, there are some disadvantages of the Wald interval. Firstly, it is based on normal approximation and for this approximation to work well, we need a large sample. Unfortunately, large samples may be costly in practice. Secondly, the coverage probability is liberal. The coverage probability with nominal 95% confidence interval is almost less than 0.5 when the sample size is small. Even for a large sample size, the coverage probability is always less than the nominal confidence level (
).
Agresti and Brian Caffo (2000) [1] introduced Adjusted Wald Confidence Interval by slightly modifying Wald interval by adding one success and one failure for each group. They have also shown that the coverage probability of the Adjusted Wald Interval is reasonably greater than the regular Wald interval. However, Agresti-Caffo interval is also based on normal approximation.
Robert G. Newcombe (1998) [2] has explained eleven different methods to compare the difference between two population proportions. Some of them are conservative, like Score, while others are liberal, like Wald.
The main purpose of this journal is to derive a closed formula for the exact distribution of the difference between two independent sample proportions, and use it to perform related inferences such as a hypothesis test. The rest of the journal is organized as follows. In Section 2, we derive the closed formula for exact distribution of the difference between two independent sample proportions and break it into different cases. We obtain the support of the distribution in Section 3. In Section 4, we perform the hypothesis test. In Section 5, we compute the power of the hypothesis test. In Section 6, we compute the confidence interval and compare it to others. In Section 7, we summarize the main findings and conclude the journal.
2. Exact Distribution of Difference of Two Sample Proportions
Let
and
are iid Bernoulli random samples from two different populations with parameters
and
respectively and let
and
be the point estimates of the parameters
and
respectively. We denote the difference between
and
by D.
To obtain the exact distribution of D, we first derive the probability generating function (pgf) of
in the following lemma.
Lemma
Let
, then the pgf of W is given by
(1)
Now, let
denote the probability mass function (pmf) of D at the point
, for
and
.
Theorem
Let the greatest common divisor:
, and
and
be such that
and
. The pmf of D is given by
for
and
, where
.
From the Theorem above, we derive the next results by corresponding them to different relations between m and n.
Corollary 1
If
, then the exact distribution of D is given by:
for
, while
.
Corollary 2
If
and
then the exact distribution of D is given by
Corollary 3
The exact distribution of D is given by
for
and
where,
Corollary 4
The exact distribution of D is symmetrical about zero if
and
.
3. Support of the Distribution
Support of the exact distribution is denoted by
. For small values of m and n, it can be derived manually. However, for larger values of m and n, it is tedious and time consuming, so the software such as R is used.
For
. Where
and
.
Thus the support for
is
.
The graphs of the Probability mass function for exact distribution for the difference of two population proportion for m = n and p1 = p2 are plotted in Figure 1. These graphs (Figure 1) are the evidence to support corollary 4.
Figure 1. Probability mass function for exact distribution for the difference of two population proportion for
and
.
4. Hypothesis Testing
To test
against
, we use D as a test statistic. Let
. Then the null distribution of D is given by
for
and
, where
.
The critical region can be obtained by finding
and
such that:
and
.
This means that:
and
.
where
and
Example: Gender Discrimination
The table below shows the gender distribution of the promoted files.
Data Source:
https://www2.stat.duke.edu/courses/Spring12/sta101.1/lec/lec14S.pdf.
In this question, we will investigate whether or not gender discrimination is associated with the promotion of the employees. In other words, we would like to conduct the following hypothesis test.
: There is no gender discrimination in promotion vs
: There is gender discrimination in promotion.
We run the R program for exact distribution for
,
,
, and
, obtain the test statistic, and p-value to 0.291667 and 0.03286628
respectively. Since p-value is less than
, we reject the null hypothesis and conclude that there is gender discrimination in promotion. However the p-value is slightly less than
, so there is moderate gender discrimination for the promotion of the employees.
5. Power Calculation
If
and
are the left and right critical values and if the Null hypothesis is rejected for the test statistic,
then the power of the corresponding hypothesis test is given by:
where
Continuation of the example: Gender Discrimination
In this example, we have rejected null hypothesis with the significance level
. Now we want to find power of the hypothesis test for
, and
. We run the R program for the power
calculation of exact distribution and obtain that the power of the hypothesis test equals to 0.5657226.
6. Confidence Interval
Point estimator of
is
, which can be obtained by the given samples. Let
and
are lower and upper bound for
confidence coefficient for
. We obtain
and
as follows:
Thus,
confidence interval for
is
.
A relatively easy approach to compare the difference between population proportions (
) is confidence interval. We calculate the sample proportions
and
from respective samples. Once
and
are calculated, we use them to construct confidence interval with nominal confidence coefficient
. If the confidence interval does not include 0, we reject the null hypothesis. Otherwise fail to reject null hypothesis.
(a) (b)
Table 1. 95% confidence interval for Exact, Wald, Agresti-Caffo, and Score.
For the purpose of this comparison, we have constructed some confidence intervals including respective confidence width for Exact, Wald, Agresti-Caffo and Score for
and 95% confidence coefficient (Table 1).
The last four columns of the above table are the confidence widths for Exact, Wald, Agrest-Caffo, and Score. It can be seen that the confidence width of Exact has the least amount.
7. Conclusion
Inferences of the difference of the population proportion are a very basic problem in statistics. Standard Wald interval has been used universally. Standard Wald interval is persistently chaotic, and has unacceptably poor coverage probabilities when either the sample sizes are small or one proportion is very large and the other is very small. Several intervals have been suggested but their level of performance is not satisfactory when the sample size is small. We have been shown that our distribution does not depend on sample size. We have also shown that exact distribution has the least confidence width among Wald, Agresti-Caffo and Score, so it is suitable for inferences of the difference between the population proportion regardless of sample size.
Appendix
Proof of lemma
If we define
, then W can be written as
. The pgf of W can be written as
since the two
samples are independent of each other and the observations in each sample are independent and identically distributed.
Since
for
, then
and
(2)
Similarily, since
for
, then
(3)
We multiply the RHS’ of 2 and 3 to obtain 1.
Proof of Theorem
Notice that, even though the support of D and W are different, their pmf’s have
the same probabilities:
for
and
. The pmf of W can be obtained from the pgf as follows:
Therefore,
(4)
where
if
and 0 otherwise.
To simplify the formula 4, we use the fact that
is equivalent to
which, in its turn, is equivalent to
. From this last equality, we conclude that
and
for some
because
and
are relative prime to each other. The values of i are hence obtained by solving the following system of equations:
This leads to the following simplified system:
. which corresponds to the values of i that forms the set
.
Proof of Corollary 1
Since m and n are relatively prime to each other, the support of D becomes:
when
, we have
, hence
and
. Therefore
. Now from Theorem above we get,
when
, we have
and hence:
For this case,
is either 0 or −1 and
is either 0 or 1 so, now from the theorem we get,
Proof of corollary 2
For
and
, the theorem reduces to,
where,
Now we replace
by u and obtain the following result:
.
Proof of corollary 3
The exact distribution of D, using lemma, is given by;
where
if
and 0 otherwise. Let us define a set
as follows:
Thus,
Proof of corollary 4
Using Corollary (3), the exact distribution of D for
and
is given by
where,
Since both k and l run from 0 to n so
.