The Influence Function of the Correlation Indexes in a Two-by-Two Table *

In this paper we examine 5 indexes (the two Yule’s indexes, the chi square, the odds ratio and an elementary index) of a two-by-two table, which estimate the correlation coefficient ρ in a bivariate Bernoulli distribution. We will find the compact expression of the influence functions, which allow the quantification of the effect of an infinitesimal contamination of the probability of any pair of attributes of the bivariate random variable distributed according to the above-mentioned model. We prove that the only unbiased index is the chi square. In order to determine the indexes, which are less sensitive to contamination, we obtain the expressions of three synthetic measures of the influence function, which are the maximum contamination (gross sensitivity error), the mean square deviation and the variance. These results, even if don’t allow a definitive assessment of the overall optimum properties of the five indexes, as not all of them are unbiased, nevertheless they allow to appreciating the synthetic entity of the effect of the contaminations in the estimation of the parameter ρ of the bivariate Bernoulli distribution.


Introduction
In this paper we analyze the influence of a minimal contamination of the bivariate Bernoulli distribution on the values of the index measuring the association in a two-by-two table, having as a scenario the estimation of the correlation parameter of that distribution.

Selecting a Template
Let us suppose that two dichotomous variables, denoted by X and Y, are relevant within a population.These variables take the values 1 and 0, depending on whether one of the dichotomous attributes is present or absent.The corresponding theoretical model is the bivariate Bernoulli distribution [1], reported in Table 1.

Properties of the Correlation Parameter Estimation
Several indexes, suggested by various authors (Yule, Quetelet and others), are available for the sample estimation of the above-mentioned correlation coefficient.We refer to such indexes as R 1 , R 2 etc.For given indexes, R h , all variable between −1 and +1, we must take into account unbiasedness, i.e.
[ ] and the limited influence of limited modification of the model.With regard to this last fundamental property Hampel [2] in 1974 suggested the influence function as a tool for evaluating the effect caused on the value of an indexby a minimal contamination of the model.In our case the model is the bivariate Bernoulli distribution, the parameter is the correlation coefficient ρ and the indexes are those proposed by various authors over time.
Basically the influence function where [ ] R H ε is the index computed for the contaminated bivariate Bernoulli distribution index computed for the non-contaminated bivariate Bernoulli distribution and ε is the weight of the contami- nation.
It is easily understood that such a function measures the effect of an infinitesimal contamination of the model on the value of the correlation index [3].From now on we will denote by a, b, c and d the empirical frequencies of the four cells of the two-by-two table obtained for a sample of n units.

Influence Function of the Correlation Indexes
A contamination in the cell (0,0) leads to the influence function value while a contamination in the cell (1,1) leads to the influence function value That is, a contamination in one of the two cells indicating concordance increases the value of the C index of a quantity, which is proportional to the sum of the frequencies of the two discordance cells.
A contamination in the cell (0,1) leads to the influence function value while a contamination in the cell (1,0) leads to the influence function value That is, a contamination in one of the two cells indicating discordance decreases the value of the C index of a quantity, which is proportional to the sum of the frequencies of the two concordance cells.
In short, the influence function can be displayed as in which v, in the case of a concordance cell, is equal to the sum of the discordance frequencies, while, in the case of a discordance cell, is equal to the sum of the frequencies of discordance cells, changed of sign.In other words, the influence of the contamination for each concordance cell is directly proportional to the sum of the discordant frequencies and vice versa, for each discordance cell, provided that it is positive for the concordance cells and negative for the discordance cells [4].

Yule's Q Index
Let us first consider the 1900 Yule's index [5] given by .ad bc Q ad bc A contamination in the cell (0,0) leads to the following value of the influence function while a contamination in the cell (1,1) leads to the value given by ( ) ( ) That is, a contamination in one of the two cells indicating concordance increases the value of the index Q by a quantity, which is proportional to the product of the frequencies of the three non-contaminated cells.
On the other hand a contamination in the cell (0,1) leads to the value of the influence function given by ( ) ( ) while a contamination in the cell (1,0) leads to the following value of the influence function That is, a contamination in one of the two cells indicating discordance decreases the value of the index Q by a quantity, which is proportional to the product of the frequencies of the three non-contaminated cells.
In short, the influence function can be displayed as in which v is equal to one of the frequencies with positive sign if it corresponds to a or to d, and to one of the frequencies with negative sign if it corresponds to b or to c.In other words, the influence of the contamination is inversely proportional to the frequency of the contaminated cell, provided that it is positive for the concordance cells and negative for the discordance cells.

Yule's Y Index
Let us consider now the other index proposed by Yule [6] in 1912, ad bc Y ad bc A contamination in the cell (0,0) leads to the following value of the influence function while a contamination in the cell (1,1) leads to the value of the influence function given by That is, a contamination in one of the two cells indicating concordance increases the value of the index Y by a quantity proportional to the root of the product of the frequencies of the three non-contaminated cells divided by the root of the frequency of the contaminated cell.
A contamination in the cell (0,1) leads to the value of the influence function given by ( ) ( ) while a contamination in the cell (1,0) leads to the following value of the influence function That is, a contamination in one of the two cells indicating discordance decreases the value of the index Y by a quantity proportional to the root of the product of the frequencies of the three non-contaminated cells divided by the root of the frequency of the contaminated cell.
In short, the influence function can be displayed as in which v is equal to one of the frequencies with positive sign if a or d and to one of the frequencies with negative sign if b or c.In other words, the influence of the contamination is inversely proportional to the frequency of the contaminated cell, provided that it is positive for the concordance cells and negative for the discordance cells.

The Chi Square Index
Let us examine the chi square index A contamination in the cell (0,0) leads to the influence function value while a contamination in the cell (1,1) leads to the influence function value That is, a contamination in one of the two cells indicating concordance increases the value of the 2 χ by a quantity which is proportional to the product of the sums of the frequency of the other concordance cell with each of the frequencies of the discordance cell.
A contamination in the cell (0,1) leads to the influence function value while a contamination in the cell (1,0) leads to the influence function value That is, a contamination in one of the two cells indicating discordance decreases the value of the 2 χ by a quantity which is proportional to the product of the sums of the frequency of the other discordance cell with each of the frequencies of the concordance cells.
It is impossible to have a unique expression of the influence function as we had for the other indexes, because the expressions for the contaminated concordance cells differ from those related to the discordance ones.

Odds Ratio, θ
Let us now examine the odds ratio index .ad bc θ = (33) A contamination in the cell (0,0) leads to the influence function value ( ) while a contamination in the cell (1,1) leads to the influence function value That is, a contamination in one of the two cells indicating concordance increases the value of the index θ by a quantity that is proportional to the frequency of the other concordance cell and inversely proportional to the product of the frequencies of the discordance cells.
A contamination in the cell (0,1) leads to the influence function value while a contamination in cell (1,0) leads to the influence function value That is, a contamination in one of the two cells indicating discordance decreases the value of the index θ by a quantity, which is proportional to the product of the frequencies of the concordance cell and inversely proportional to the product of the square of the frequency of the contaminated cell multiplied by the frequency of the other discordance cell.
In short, the influence function can be displayed as in which v is equal to the frequency of the contaminated cell, provided that the sign is positive in case of contamination in a concordance cell, and negative in case of contamination in a discordance cell.

Unbiasedness of the Indexes
It must be reminded that an index h R is unbiased if let us examine now the unbiasedness of every index.

The Chi Square Index
The index has the mean and therefore it is unbiased.

Other Indexes
Indexes Q, Y, θ e C are biased.
It has to be said that 3 of the considered indexes (Q, Y and θ) are functionally related, as it is shown below: The other 2 indexes (C and2 χ ) are not functionally explainable with themselves nor with the above-men- tioned ones.The 5 indexes estimate functions of the ρ parameter.More exactly 4 of these indexes (C, Q, Y and 2 χ ) are estimators of increasing functions of this parameter and, in particular in the points −1, 0 and +1, these functions coincide with the argument.So the index θ can easily lead to the 2 Yule's indexes achieving again its characteristics.

Influences of the Indexes
Since that the effects of contaminations in the various cells are balanced, it is necessary to evaluate their overall influence regardless of the sign.This can be done considering the maximum of the absolute values of the influence or the mean absolute deviation or the variance of the said values [7].
, or , the maximum of the influence function is therefore in which v is one of the four frequencies of the table, the maximum of the influence function is obtained for min(v); it is therefore (45)

Yule's Y Index
As, regardless of the sign, the influence function is equal to in which v is one of the four frequencies of the table, the maximum of the influence function is obtained for min(v); it is therefore . ad ad b c bc a d ASV bc It can be seen that the asymptotic variance is a symmetric function of the concordance and discordance frequencies as well.

Example
Let us consider a practical example in which 1071 persons are classified on 2 dichotomic characters: "does he/she smoke" and "is he/she suffering from bronchitis?" both with yes or no response (see Table 2 As it can be noticed, between the 4 indexes whose values go between −1 and +1, the ones which are less sensitive to contamination are C and chi square indexes; on the other hand, the more sensitive ones are Yule's indexes, Q and Y.The greater sensitivity of the odds ratio is due to the fact that such index measures a function of the correlation of the model that goes in the range from 1 to ∞.

Conclusions
In this paper we analyzed the indexes of a two-by-two table, which allow the estimation of the correlation coefficient ρ in the bivariate Bernoulli model.More precisely, we considered the two Yule's indexes, the chi square, the odds ratio and a further index.We obtained, for these indexes, the compact expressions of the influence functions, which allow the quantification of the effect of an infinitesimal contamination of the probability of any pair of attributes of the bivariate random variable distributed according to the above-mentioned model.In order to determine the indexes which are less sensitive to contamination, we obtained the expressions of three synthetic measures of the influence function, specifically the maximum contamination (gross sensitivity error), the mean absolute deviation and the variance.These expressions, even if don't allow a definitive assessment of the overall optimum properties of the five indexes considered, as not all of them are unbiased, nevertheless they allow to appreciating the synthetic entity of the effect of the contaminations in the estimation of the parameter ρ of the bivariate Bernoulli model.

4. 1 .
C IndexLet us first consider the elementary index given by .

6. 1 .
Maximum of the Absolute Values of the Influence Function (Gross Sensitivity Error) 6.1.1.C Index As, regardless of the sign, the influence function is equal to

6 . 1 . 2 .
Yule's Q Index As, regardless of the sign, the influence function is equal to

).
There were 1071 cases of which 135 smoke and have bronchitis and 547 don't smoke and don't have bronchitis.

Table 2 .
Smoke versus bronchitis.Survey at the University Hospital of Bari, Department of Pulmonology.