Estimation of a Linear Model in Terms of Intra-Class Correlations of the Residual Error and the Regressors ()
1. Introduction
Ordinary least squares (OLS) and generalized least squares (GLS) are basic methods to estimate coefficients of regressor variables in regression equations that are linear with respect to coefficients. Let OLSE and GLSE denote the corresponding estimators. OLS minimizes the sum of squared residuals (deviations from the estimated model), while GLS minimizes the estimation variances of the coefficient estimates if the residual errors (deviations from the true model) have non-equal variances and/or are correlated between different observations. OLS estimates are unbiased also for cases where GLS estimates have smaller variances, but the estimated variances of the coefficient estimates can be biased, leading to biased t- and F-tests. GLS estimation utilizes the inverse of the covariance matrix of residual errors. Thus, GLSE does not exist if the inverse does not exist, i.e., the covariance matrix is singular. If the inverse exists, then GLSE is the best linear unbiased estimator, BLUE. It requires deeper matrix algebra to derive BLUE when GLSE does not exist. Further complications are caused if there are linear dependencies between predictor variables, mathematically if the model matrix, to be presented shortly, does not have full column rank. In practice, such difficulties can be avoided by dropping linearly dependent predictors. It is assumed here that there are no such dependencies.
In the history of statistics, attention has been given to analyze what are the necessary and sufficient conditions that guarantee that OLSE is BLUE. Reference [1] describes the influential role of C. R: Rao in this question.
When analyzing grouped data, a common model for correlated residual errors is to assume that there is a constant correlation between all observations of each group, and members of different groups are uncorrelated. Let
denote this intra-class correlation, and let i. denote group, and j denote an observation number within a group. A non-negative intra-class correlation of residual error is implied by assuming a variance component model
, where
is the random group effect with mean zero and variance
and
is a random individual effect with variance
and uncorrelated with
. Thus,
. The intra-class correlation of residual errors can be also negative. If
is positive, then two observations taken from a given group are, on average, more alike than two observations taken from different groups. When
is negative, then two observations taken from a given group are, on average, more different that two observations taken randomly from the whole population. Positive
implies that group averages have a larger variance than a random sample of equal size from the whole population, while negative
implies that group averages vary less than averages of samples from the whole population. On the extreme, all group means are equal, and all the variation is between individuals. While assuming a positive
, it is not necessary to specify the group size (of course, when dealing with data, group size plays an important role). The effect of a negative
is always dependent on the group size, and it does not make sense to assume constant negative
for different group sizes. The possibility of a negative
is too often ignored. Even if the overall
is positive, due to natural variation between groups, the competition between individuals from limited resources is decreasing
, and this competition should be described when analyzing grouped data. The lower limit of
can be obtained from the condition
, which implies that
. The upper limit of
is 1. This paper assumes a constant
and a constant group size, which allows a straightforward treatment of negative
.
Even if the matrix formulas of OLSE and GLSE are simple, it may remain intuitively unclear how the correlation structure of residual errors and the values of the regressor variables interact. It is shown that this interaction can be described in terms of the empirical intra-class correlations of the regressor variables. Closed form algebraic formulas are first derived for a special case where a simple linear model is estimated. These derivations are then generalized to a model with several regressors. They give a concrete and easily understood explanation and illustration for the matrix theory results which go beyond the mathematical expertise of most scientists who are just interested to apply statistical methods in research
This paper discusses the following questions: 1) How do the variances of the GLS estimates depend on the intra-class correlation of residuals and on the values of the x-variable in a simple linear model? 2) What are the ratios of the GLS variances and OLS variances? 3) What is the bias in the estimated variances of the coefficient estimates when OLS is used? 4) How does GLS utilize the group means of the regressors and the deviations from the group means? 5) How this formulation shows why OLS is sometimes equal to GLS? 6) How can a BLUE be constructed when the correlation matrix of residual errors is singular, i.e.,
or
and the intra-call correlations of regressors gets any values? Similarly, as with
, the extreme values of the regressor correlations need to be considered separately. This last BLUE part may not be of interest to practicing scientists, but it may serve as an introduction to the BLUE problem for researchers, who cannot fluently read matrix theory. For matrix theory experts this part may be trivial or self-evident.
Reference [2] compared the GLS estimation of a simple regression line with OLS estimation using differences in the predictor variable. This work is here put into a general context. The results reported in [3] regarding the bias of F-tests when OLS is used, are exemplified and generalized. Algebraic formulas show how OLSE, GLSE, BLUE and singular correlation matrices are related, and thus illustrate the matrix results presented in [4] [5], and [6]. Reference [5] provides the matrix conditions that are used to show that a suggested estimator is BLUE.
Let us then move to more formal definitions. Let us assume a model
, where
is a N dimensional random vector,
is
model matrix,
is p dimensional fixed coefficient vector,
, and
. If
is estimated with the GLS estimator
, then
. If
is estimated with the OLS estimator
then
, and
. In the OLS regression,
is estimated with
, where
is the residual vector
. Then,
where
. The expected value of
is
. Thus,
.
It is now assumed that that data consist of
groups, indexed with i, each group has
observations, indexed with j. It is denoted that
.
is block diagonal so that each block
has a compound symmetry correlation structure where the non-diagonal elements are equal to the intra-class correlation
. The term “intra-class correlation” is used for historical reasons, but otherwise the term “group” is used. The intercept is the first parameter in
, i.e., the first column in
is vector of ones,
. It is assumed further that other columns are centered, i.e., the sum of elements of the column is zero. This assumption makes the analysis simpler. Total and group averages are denoted as
,
,
and
.
Closed form algebraic equations are initially derived for the simple linear model
. A model with several regressors is presented later. Centering means that the working model is
. Estimates of
and b are uncorrelated both in OLS and GLS. After estimating
and b, an estimate of a is obtained from
. Thereafter
, and
.
2. GLS in the Simple Linear Model
First, let us assume that
, which means that
is non-singular. Then
is a matrix with diagonal elements:
(1)
The non-diagonal elements are equal to:
(2)
Thereafter
is
(3)
The second diagonal element in (3) can be expressed as
using the empirical intra-class correlation
, which can be presented in two equivalent forms:
(4)
(5)
Reference [7] explains how the intra-class correlation, dating back to Fisher, generalizes the correlation idea in order to measure the similarity of group members. Equation (4) appears as covariance divided by variance, and (5) is almost equal to the sample variance of group means divided by the total sample variance, which resembles the definition of
. Equation (5) indicates that
. The lower limit is obtained when all group means
are equal. The upper limit
is obtained when
for all i and j. If
, then
, the population variance of
, is equal to the population variance
divided by n, resembling the variance of the mean of uncorrelated random variables.
Matrix
is obtained by taking the reciprocals of the diagonal elements of (3). The variances of the estimates of
considered here are proportional to
, and the variances of the estimate of b are proportional
to
. When
,
, and
. First, we note that
. Thus, we directly derive that
. If
, then
. For b, we obtain:
(6)
If
or
, then
. It holds that
, if
is between 0 and
. Equation (6) is an increasing function of
, an increasing function of
when
, and a decreasing function of
when
. When
, then
. When
, then
. The dependency of
on
for different values of
is shown in Figure 1.
3. OLS in Simple Linear Model When
What happens if OLS is used when
? Using
, provided in the introduction section, we obtain:
(7)
When
and OLS are used to estimate the variances of the parameter estimates, the bias combines the bias of the estimate of
and the errors of
and
where
indicates the variances implied by OLS assumptions (these are not biases, as they are not related to the expected values). In our case:
(8)
It holds that
. Now,
Figure 1. The dependency of
in (6) on
, the intra-class correlation of residual errors, for different values of
, the intra-class correlation of x (denoted on the curves) when n = 3. Blue lines are for negative values of
and red lines for positive values. The thick line is for
. Note that the same variance is obtained for
and
.
(9)
In (9),
denotes the variance estimate we obtain in OLS. If it is combined with (8), we can obtain
in terms of
. The estimated variance is biased downwards if
. If we similarly combine (8) and (7) we obtain:
(10)
When K increases, the bias coefficient in (10) approaches
, which again demonstrates the adverse effect of
. The direction of the bias can be seen using the difference between the numerator and denominator of (10). Thus
(11)
When K increases (and thus N also increases), then the direction of bias is dependent on the sign of
. The condition can be stated in a simpler form as
, but (11) is needed for comparison with [3] in which two inequalities are derived which taken together, imply that the F-test in OLS leads to p-values that are too small. In our case, conditions in [3] are
and
, which together imply the validity of the inequality in (11). Moreover, (11) can solve the direction of the bias in the case where sub-conditions indicate opposing directions. When
is used in a t-test, its bias also produces bias in the computed p-values. Reversing the inequality in (11), we obtain a condition for obtaining too large p-values.
4. BLUE for Several Regressors, Nonsingular V
Let us now assume a general linear model
, the first column of
being
and the other columns centered. It is here assumed that
is nonsingular, i.e.,
. Let x and z denote two regressors. Both
and
are inversely proportional to
. Thus,
and
increase when
approaches
or 1, leading to singular
when either of these limits are obtained. Let us define
. The diagonal elements of
are equal to
, the nondiagonal elements are equal to
. Let
combine all blocks. Using
, we obtain a linear estimator:
(12)
Element
of
is for x:
(13)
The first diagonal element of
, corresponding to the intercept, is
, and an element for x is:
(14)
Equation (14) explains why (6) is an increasing function of
: with a large
GLS places large weighting on the component with small variation. The non-diagonal elements are zero for the first row and column, and generally:
(15)
Note that on the second-row summation over n could be dropped by multiplying with n.
The first element of
is
, and the others are:
(16)
The averages
and
do not contribute, but they may increase (or decrease) understanding. They do not indicate centering of y because they are not involved in the first element of
. In GLS, the same estimates are obtained with any scaling of
, thus
for a non-singular
, but may be computationally more stable. Variances are then computed using
(17).
5. BLUE When V is Singular
5.1. BLUE in the Simple Linear Regression with Singular V
Let us introduce the estimation with a singular
using the simple linear model.
. If
and
we can estimate b in the model with zero variance with
using two observations such that
. As
implies that
for all observations, a can be estimated by computing the arithmetic mean (which is OLS estimator) of
or
, the latter computation implying correct variance estimator
.
If.
, we can find at least two observations in the same group having different x values. If there are more observations deviating from the group means, there is an infinite number of estimating equations which all produce this same estimate with zero variance.
If
and
, then all y and x values are equal to the group means in all groups (for y with probability one), and b needs to be estimated with OLS using group means. This means that
is n times larger than we would get for
, as could be anticipated from Figure 1 for
. The same estimates are produced using observation level OLS regression, which produces biased estimates for the variances
and
.
If
and
, then we can find at least two groups i and
so that
, and b can be estimated with
with zero variance. If there are more than two differing group means, there is an infinite number of estimating equations producing same estimates. After estimating b, a can be estimated with the arithmetic mean of
with variance
. The same estimate is obtained with the arithmetic mean of
, but standard OLS computations would produce a biased estimate for the estimation variance.
If
and
, i.e. all group means of
and
are equal (for
with probability one), then BLUE of b and a can be obtained by dropping one observation from each group and doing OLS regression in the remaining data. This means that
is equal to
times the OLS variance obtained for
. For
,
as could be anticipated from Figure 1.
5.2. BLUE in the Multiple Regression When
When
, then
is singular. Also
is singular, all of its diagonal elements are equal to
, and nondiagonal elements are equal to −1. The estimator cannot be applied directly because the first diagonal element of
is
thus producing singular
. According to (14) the diagonal elements are zero also for such predictors x having
, i.e. no variation in
. If all predictors have
then OLSE is BLUE, because
, see [4].
A general BLUE can be obtained using the decomposition
where
contains
and all columns having
and
contains other columns. It is now required that K ≥ 3. Let
decompose
similarly. If we weight columns of
as done in OLS, i.e. replace columns of
in Equation (12) with
we get an estimator
(18)
I suggest that the estimator is called a GOLS estimator as it combines OLS and GLS estimation principles in the same estimator. Using the standard formula for computing the inverse of a partitioned matrix, the estimator can be presented as
(19)
Thus
(20)
Thus,
is the GLSE of
as estimated solely using
. Now
. Noting that the second row of
is part of the first row, we obtain:
(21)
(22)
where
is the OLS estimator of
obtained ignoring
.
Using (22), we get
, i.e. the same variance what we get when
and we regress group means of y on group means of all variables in
. This is natural because all y values and x values are equal in all observations in each group.
Denote that
. An estimator
is BLUE for
, if and only if
, see [5]. If
has full column rank (as we have), then
is estimable and
is BLUE for
if and only if
. Then, that
can be seen directly from taking into account that
. Noting that
and
we obtain that:
(23)
Matrix
is a projector to the space spanned by
. Now
, thus
, which completes the demonstration that
.
The GOLS estimator can be put into general matrix theory context using [5]. Now
and
are disjoint and
and
. In such a case (20) provides BLUE for
. The estimator
is an unbiased for
. This unbiased estimator is updated into BLUE in (22), being an application of proposition 10.5 on p. 228 of [5].
5.3. BLUE When
If
then all elements of
are equal to
. Let us organize
into
, where
contains all such predictors for which
and
contains all other columns. Equation provides again a BLUE with
and the same arguments can be used to prove it also for this case. Now
, which is the same variance we obtain if
and
is estimated from data where one redundant observation is dropped from each group.
It may a useful exercise to get confidence in multiple regressor derivations to show how the estimators derived in Section 5.1 for the simple linear model are produced also with the matrix formulas. Matrix formulas use in computations of all observations while formulas in section 5.1 utilize only nonredundant information.
6. Discussion
When
is regressed on
, it is implicitly assumed that the regression of
on
, and the regression of
on
, have the same slope. GLS estimation provides different weights to
and
when attempting to utilize the correlation structure, although
and
are put into the same column of
as shown in (14). When developing mixed models for grouped data, it is often necessary to consider both
and
as separate predictors, see [8]. It is natural to assume that
is related to
and
, where
is the mean of x in the whole group. Reference [8] suggests a solution, based on a multivariate mixed model, which solves the measurement error bias problem that occurs when
computed from a sample from group i is used to “measure”
.
The mean of variable
is zero in each group, thereby indicating a negative within-group correlation. If
is linearly related to
, the negative within-group correlation of
is also transmitted to
. Hence, the addition of predictor
to the models makes the estimated variance of the random intercept larger, as the “negative variance” is subtracted from the residual variance, see [8] and [9].
The mixed model formulation always leads to non-negative intra-class correlations. However, negative intra-class correlations are needed in situations where the group means have a smaller variance than that implied by the assumption of uncorrelated individuals. In the marginal interpretation of mixed model equations, a negative definite variance matrix of random effects that maximize the likelihood is allowed, leading to negative correlations and negative variances, provided the marginal variance of the y-vector is positive semi-definite, see [10]. The significance of a negative intra-class coefficient is dependent on the group size.
When the x-variable is a random variable with a theoretical intra-class correlation
, the empirical intra-class correlation
is a random variable. The expected value of
approaches
when K increases. Simulations with normally distributed variables show that, for given values of n and K,
can be well described with function
. Negatively correlated variables can be simulated using principal components.
Reference [2] discussed the estimation of the slope in a simple linear model assuming that
,
and
for all i. To compare the formulas presented above with the formulas in [2], denote that
where
is the average of
, then
, and
when all
’s values are equal. It holds that
. They compare the variance of GLSE to the variance of the OLS estimate obtained by regressing
on
. The standard OLSE of the slope has, however, a smaller variance than their OLSE based on differences when
. For
, the ratio of the variance of the standard OLSE to the variance of their OLSE is
. When all
’s are equal, then
,
, and all three methods provide the same estimate.
If the group sizes
vary in a data set, it is not reasonable to assume the same negative intra-class coefficient
for all groups. Thus
needs to be made group specific
. Reference [7] generalizes the intra-class correlation to non-equal group sizes, but this generalization does not provide meaningful analysis for the interaction of the intra-class correlation of the residual and the values of regressors. However, it is possible to define a measure for the between-group variation of x which gets value of zero if there is no variation between groups, and use this measure to analyze how the intra-class correlation of residual errors and the between-group and within-group variation of predictors interact. This analysis can be extended to auto-correlation structures., and is under development.
In experiments, (6) informs us that
should be
(all
should be equal) if it is anticipated that
. If it is anticipated that
, then the same treatment level should be applied to the whole group.
The derivations of this paper are fully covered with the general matrix theory. The purpose of the paper is to make general matrix formula understandable using algebraic derivations which show how between-group variation and within-group variations of residual errors are connected to the between-group and within-group variations of regressors in the estimation of a general linear model.
Acknowledgements
I thank Ronald Christensen for taking seriously the first version dealing with the simple linear model, Simo Puntanen for patient guidance to the conditions which prove that an estimator is BLUE, and Jarkko Isotalo for indicating how the GOLS estimator is related to the general BLUE theory. I take full responsibility if I have not understood their explanations correctly. They fly smoothly in linear spaces and subspaces where I need to walk watching each step.