^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

In this paper we examine 5 indexes (the two Yule’s indexes, the chi square, the odds ratio and an elementary index) of a two-by-two table, which estimate the correlation coefficient
*ρ* in a bivariate Bernoulli distribution. We will find the compact expression of the influence functions, which allow the quantification of the effect of an infinitesimal contamination of the probability of any pair of attributes of the bivariate random variable distributed according to the above-mentioned model. We prove that the only unbiased index is the chi square. In order to determine the indexes, which are less sensitive to contamination, we obtain the expressions of three synthetic measures of the influence function, which are the maximum contamination (gross sensitivity error), the mean square deviation and the variance. These results, even if don’t allow a definitive assessment of the overall optimum properties of the five indexes, as not all of them are unbiased, nevertheless they allow to appreciating the synthetic entity of the effect of the contaminations in the estimation of the parameter
* ρ* of the bivariate Bernoulli distribution.

In this paper we analyze the influence of a minimal contamination of the bivariate Bernoulli distribution on the values of the index measuring the association in a two-by-two table, having as a scenario the estimation of the correlation parameter of that distribution.

Let us suppose that two dichotomous variables, denoted by X and Y, are relevant within a population. These variables take the values 1 and 0, depending on whether one of the dichotomous attributes is present or absent. The corresponding theoretical model is the bivariate Bernoulli distribution [

The mean values of the two variables are

The variances of the two variables are

The covariance between the two variables is

The correlation coefficient is

Several indexes, suggested by various authors (Yule, Quetelet and others), are available for the sample estimation of the above-mentioned correlation coefficient. We refer to such indexes as R_{1}, R_{2} etc. For given indexes, R_{h}, all variable between −1 and +1, we must take into account unbiasedness, i.e.

efficiency, i.e.

and the limited influence of limited modification of the model.

With regard to this last fundamental property Hampel [

Basically the influence function

where

Attributes of variable X | Attributes of variable Y | Total | |
---|---|---|---|

0 | 1 | ||

0 | α | β | α + β |

1 | γ | δ | γ + δ |

Total | α + γ | β + δ | 1 |

index computed for the non-contaminated bivariate Bernoulli distribution and

It is easily understood that such a function measures the effect of an infinitesimal contamination of the model on the value of the correlation index [

Let us first consider the elementary index given by

A contamination in the cell (0,0) leads to the influence function value

while a contamination in the cell (1,1) leads to the influence function value

That is, a contamination in one of the two cells indicating concordance increases the value of the C index of a quantity, which is proportional to the sum of the frequencies of the two discordance cells.

A contamination in the cell (0,1) leads to the influence function value

while a contamination in the cell (1,0) leads to the influence function value

That is, a contamination in one of the two cells indicating discordance decreases the value of the C index of a quantity, which is proportional to the sum of the frequencies of the two concordance cells.

In short, the influence function can be displayed as

in which v, in the case of a concordance cell, is equal to the sum of the discordance frequencies, while, in the case of a discordance cell, is equal to the sum of the frequencies of discordance cells, changed of sign. In other words, the influence of the contamination for each concordance cell is directly proportional to the sum of the discordant frequencies and vice versa, for each discordance cell, provided that it is positive for the concordance cells and negative for the discordance cells [

Let us first consider the 1900 Yule’s index [

A contamination in the cell (0,0) leads to the following value of the influence function

while a contamination in the cell (1,1) leads to the value given by

That is, a contamination in one of the two cells indicating concordance increases the value of the index Q by a quantity, which is proportional to the product of the frequencies of the three non-contaminated cells.

On the other hand a contamination in the cell (0,1) leads to the value of the influence function given by

while a contamination in the cell (1,0) leads to the following value of the influence function

That is, a contamination in one of the two cells indicating discordance decreases the value of the index Q by a quantity, which is proportional to the product of the frequencies of the three non-contaminated cells.

In short, the influence function can be displayed as

in which v is equal to one of the frequencies with positive sign if it corresponds to a or to d, and to one of the frequencies with negative sign if it corresponds to b or to c. In other words, the influence of the contamination is inversely proportional to the frequency of the contaminated cell, provided that it is positive for the concordance cells and negative for the discordance cells.

Let us consider now the other index proposed by Yule [

A contamination in the cell (0,0) leads to the following value of the influence function

while a contamination in the cell (1,1) leads to the value of the influence function given by

That is, a contamination in one of the two cells indicating concordance increases the value of the index Y by a quantity proportional to the root of the product of the frequencies of the three non-contaminated cells divided by the root of the frequency of the contaminated cell.

A contamination in the cell (0,1) leads to the value of the influence function given by

while a contamination in the cell (1,0) leads to the following value of the influence function

That is, a contamination in one of the two cells indicating discordance decreases the value of the index Y by a quantity proportional to the root of the product of the frequencies of the three non-contaminated cells divided by the root of the frequency of the contaminated cell.

In short, the influence function can be displayed as

in which v is equal to one of the frequencies with positive sign if a or d and to one of the frequencies with negative sign if b or c. In other words, the influence of the contamination is inversely proportional to the frequency of the contaminated cell, provided that it is positive for the concordance cells and negative for the discordance cells.

Let us examine the chi square index

A contamination in the cell (0,0) leads to the influence function value

while a contamination in the cell (1,1) leads to the influence function value

That is, a contamination in one of the two cells indicating concordance increases the value of the

A contamination in the cell (0,1) leads to the influence function value

while a contamination in the cell (1,0) leads to the influence function value

That is, a contamination in one of the two cells indicating discordance decreases the value of the

It is impossible to have a unique expression of the influence function as we had for the other indexes, because the expressions for the contaminated concordance cells differ from those related to the discordance ones.

Let us now examine the odds ratio index

A contamination in the cell (0,0) leads to the influence function value

while a contamination in the cell (1,1) leads to the influence function value

That is, a contamination in one of the two cells indicating concordance increases the value of the index

A contamination in the cell (0,1) leads to the influence function value

while a contamination in cell (1,0) leads to the influence function value

That is, a contamination in one of the two cells indicating discordance decreases the value of the index

In short, the influence function can be displayed as

in which v is equal to the frequency of the contaminated cell, provided that the sign is positive in case of contamination in a concordance cell, and negative in case of contamination in a discordance cell.

It must be reminded that an index

let us examine now the unbiasedness of every index.

The index

has the mean

and therefore it is unbiased.

Indexes Q, Y, θ e C are biased.

It has to be said that 3 of the considered indexes (Q, Y and θ) are functionally related, as it is shown below:

The other 2 indexes (C and

Since that the effects of contaminations in the various cells are balanced, it is necessary to evaluate their overall influence regardless of the sign. This can be done considering the maximum of the absolute values of the influence or the mean absolute deviation or the variance of the said values [

As, regardless of the sign, the influence function is equal to

the maximum of the influence function is therefore

As, regardless of the sign, the influence function is equal to

in which v is one of the four frequencies of the table, the maximum of the influence function is obtained for min(v); it is therefore

As, regardless of the sign, the influence function is equal to

in which v is one of the four frequencies of the table, the maximum of the influence function is obtained for min(v); it is therefore

An empirical analysis allows to asses that the maximum absolute value of the influence function is obtained in correspondence of the minimum frequency. Thus,

As, regardless of the sign, the influence function is equal to

in which v is one of the four frequencies of the table, the maximum of the influence function is obtained for min(v); it is therefore

A few algebraic steps allow us to obtain

It can be seen that the mean deviation for all indexes is a symmetric function either of the concordant frequencies or of the discordant frequencies.

Let us consider the asymptotic variance of the indexes. A few algebraic steps lead us to the following expression

It can be seen that the asymptotic variance is a symmetric function of the concordance and discordance frequencies as well.

Let us consider a practical example in which 1071 persons are classified on 2 dichotomic characters: “does he/she smoke” and “is he/she suffering from bronchitis?” both with yes or no response (see

There were 1071 cases of which 135 smoke and have bronchitis and 547 don’t smoke and don’t have bronchitis.

As it can be noticed, between the 4 indexes whose values go between −1 and +1, the ones which are less sensitive to contamination are C and chi square indexes; on the other hand, the more sensitive ones are Yule’s indexes, Q and Y. The greater sensitivity of the odds ratio is due to the fact that such index measures a function of the correlation of the model that goes in the range from 1 to ¥.

Smoke | Bronchitis | Total | |
---|---|---|---|

Yes | No | ||

Yes | 135 | 287 | 422 |

No | 102 | 547 | 649 |

Total | 237 | 834 | 1071 |

Source: Survey at the University Hospital of Bari, Department of Pulmonology.

In this paper we analyzed the indexes of a two-by-two table, which allow the estimation of the correlation coefficient ρ in the bivariate Bernoulli model. More precisely, we considered the two Yule’s indexes, the chi square, the odds ratio and a further elementary index. We obtained, for these indexes, the compact expressions of the influence functions, which allow the quantification of the effect of an infinitesimal contamination of the probability of any pair of attributes of the bivariate random variable distributed according to the above-mentioned model.

In order to determine the indexes which are less sensitive to contamination, we obtained the expressions of three synthetic measures of the influence function, specifically the maximum contamination (gross sensitivity error), the mean absolute deviation and the variance. These expressions, even if don’t allow a definitive assessment of the overall optimum properties of the five indexes considered, as not all of them are unbiased, nevertheless they allow to appreciating the synthetic entity of the effect of the contaminations in the estimation of the parameter ρ of the bivariate Bernoulli model.