The Use of Yanai ’ s Generalized Coefficient of Determination to Reduce the Number of Variables in DEA Models

This paper proposes a new method to reduce the dimensionality of input and output spaces in DEA models. The method is based on Yanai’s Generalized Coefficient of Determination and on the concept of pseudo-rank of a matrix. In addition, the paper suggests a rule to determine the cardinality of the subset of selected variables in a way to gain the maximal discretionary power and to suffer a minimal informational loss.


Introduction
The DEA (Data Envelopment Analysis) model is a nonparametric method for estimating production frontiers.The DEA involves the solution of a set of linear programming (LP) problems to determine a production frontier against which the technical efficiency of the Decision Making Units (DMU) will be calculated.The basic DEA model was originally proposed by [1], and that model is nowadays known as "the CCR model".
The basic CCR model proposal aims to maximize the ratio between a weighted sum of outputs and a weighted sum of inputs.The weights of these sums are chosen according to the feasibility conditions and assuming a hypothesis of constant returns to scale.Charnes, Cooper and Rhodes have previously transformed the fractional CCR model into a linear model whose dual is commonly referred to in the literature as the DEA (the details of this procedure can be find in [2] ch. 1 and 2) model.
The CCR model and its variants have been increasingly applied following the original description by [1].Since then, researchers worldwide have used the model as a tool to assess technical efficiency.The DEA model has become the method widely chosen for this type of study, especially in the absence of an explicit production function to define the relationship between inputs and outputs.
It is worth to say that if there is evidence that the inputs and outputs can be linked through a function, then the stochastic frontier model can be used as an alternative to the DEA.Please refer to the work of [3] on models of stochastic frontier production models.One of the most frequent problems associated with the CCR model is the lack of discrimination among DMUs when the number of inputs and outputs is rather large in relation to the number of DMUs.A large number of variables relative to the number of observations may entail a large number of efficient DMUs in the sample, thus reducing the model's ranking capability.This represents a characteristic of the CCR model: the lower the number of DMUs, the less active the restrictions imposed on the maximum efficiency multipliers.In fact, according to [4] DEA models are subject to the curse of dimensionality.
In the present work, we propose a simple and objective method to address a situation in which the original inputs and outputs have been correctly selected but the low number of observations has translated into low discrimination power.This approach is intended to reduce the dimensions of the inputs and outputs spaces and therefore does not require additional information.Additionally, this method does not require post-estimation procedures (as in the case of the super-efficiency and cross-efficiency approaches).Our proposed approach relies on multivariate statistical techniques, namely, Principle Components Analysis and the use of a correlation matrix.Again, it should be emphasized that this approach is not intended for the selection of inputs or outputs.Instead, it is a support tool used to increase the discriminatory power without a significant loss of information, i.e., the discriminatory power is increased regardless of knowledge of which variables are essential for the model.The method is a tour de force of the linear algebra when applied to find similarity between subspaces by exploring the special structure of the space of positive semidefinite matrices.
Following this introduction, the remainder of the work is structured as follows: Section 2 briefly reviews multivariate variable selection in the DEA models, Section 3 introduces the proposed methodology, Section 4 discusses applications of the proposed method to the CCR model, and finally, Section 5 presents conclusions and suggestions for future research.and outputs.A consequence of using a large number of variables relative to the number of observations is the loss of discrimination power due to the generation of a large number of efficient DMU's.There is no consensus on the optimal number of inputs and outputs to be used.However, [2] suggest the following rule: J:J > max{M × N, 3(M + N)}, where M and N correspond to the number of outputs and inputs respectively.One of the most common ways of selecting variables in DEA is the use of correlation matrices for inputs and outputs.When two variables (inputs or outputs) are highly correlated, one is discarded, usually on the basis of ad hoc criteria.
However, eliminating one or the other variable could have a dramatic impact on estimated efficiency 2 .
In recent years, the application of multivariate statistical methods, especially Principal Components Analysis (PCA), has appeared as a satisfactory alternative for variable reduction 3 .The formulation of a DEA model in which inputs and outputs are summarized as principal components is the focus of the work of [14] and [15].One limitation of using PCs instead of inputs and outputs is that these "new" variables may take a negative value, and therefore must be transformed for PCA.That transformation, however, may impact results.One way of overcoming the problem is to use the additive model of DEA (see [16]) that is invariant to translation of inputs and outputs.In [17] the author has also shown that an input-oriented BCC model (after [18]) is invariant to output translation.
There is also a practical difficulty.Even if the problem of negative PCs is resolved, the question remains of how to interpret the results in terms of the projection of inputs and outputs (that is, predicting quantities).The fact is that the only satisfactory way of accomplishing this is back transformation to original variables-which might require a considerable computational effort.Some authors select variables based upon their contribution to PCs.Specifically, the variables with the largest absolute linear combination coefficient are selected.Because it is common for the first few components to explain most data variance, the result is a considerably reduced subset of variables.
In [19] the authors employ the method of partial covariance analysis to identify the correlation between variables as well as the contribution of each variable to these correlations.With this technique, the authors demonstrate that the removal of variables with little contribution to the correlations does not significantly change the results.
The method we propose is based on the work of [20] and [21] and combines PCA with elements of the [19]   In fact the use of PCA for summarizes a data set is broadly applied in several areas range from image processing ( [24]) to meteorology ( [25]).Determination (GCD) (proposed in [22]).The resulting subset of PCs (which usually includes PCs that explain eighty to ninety percent of data variance) corresponds to the original variables that maximize the GCD.
The number of PCs is determined prior to the generation of the correlation matrix, estimating the pseudo-rank for the matrix, as proposed by [20] based on the specific geometric structure of the cone of positive semidefinite matrices.
In the following section, a brief discussion of the geometrical structure of the cone of positive semi-definite matrices and of PC analysis is presented and the GCD is subsequently defined.

The Pseudo-Rank of a Matrix and Yanai's Generalized Coefficient of Determination
This section briefly describes basic concepts described in detail in the work of [20] [21] [26].
Assuming p C to be the cone of positive semidefinite matrices with dimension p p × provided with the Frobenius inner product , : The norm induced by (1) will be denoted by F .For any matrix is called the ray associated with V.The ray associated with the identity matrix of dimension p is called the central ray of p C .With the definitions above, it is possible to find the angle between the rays associated with any two matrices A and B. This angle is given by the arc whose cosine is ( ) In [19], the author demonstrates that p C has a layered structure, with containing several cones fitted inside the other, just like showed in Figure 1.
Based on this observation, the author argues that the region close to the central ray contains only matrices of full rank, or at least with rank 1 p − (which can occur at the boundary of such regions).The farther away from the central ray, the lower the rank of matrices.
However, matrices of full rank are also found outside of the core.Because they have may have eigenvalues close to zero, they behave as low-rank matrices.The question then is how far (into the core of the cone) does one have to move to avoid such matrices?The answer to this question lies in the concept of the pseudo-rank of a matrix.According to [20], the pseudo-rank of a The author shows that this value is given by where z     is the nearest to z greater integer.Letting V the covariance/correlation matrix of a data matrix A, with p variables and n observations, or In this case the pseudo-rank of V corresponds to the number of components to be used as representative of all accumulated variance associated with A.

Yanai's Generalized Coefficient of Determination
Let A represent a data matrix with dimension (n × p), where p indicates the number of variables and n the number of observations for each variable.In this context, A may refer to a matrix of discretionary/nondiscretionary outputs/inputs.It is important to keep in mind that in DEA models, the number of observations indicates the number of columns in the pertinent matrices.Thus, for the implementation of the proposed method, matrix A should be considered the transposed output/input matrix.
Given the covariance matrix/correlation of data , let Λ and P be the diagonal matrix of eigenvalues (arranged in decreasing order) and the matrix of normalized eigenvectors of S respectively.The PCs are the columns of the matrix (n × p) given by C AP = .Using the spectral decomposition of S, it is easy to show that the covariance matrix of C is exactly Λ, so that the variables in C are uncorrelated.For this reason the AP transformation is sometimes called data "decorrelation".
Consider K to be a subset of indices associated with k p ≤ PCs arranged in decreasing order of eigenvalues (in general the first k's).Similarly, let Q be defined as a subset of indexes associated with the q p ≤ original variables.The sets K and Q are the subspaces generated by vectors with indices K and Q respectively.The following matrices are then defined: A is the submatrix of A in which columns with indexes in K are maintained; is the covariance matrix associated with K A ; • K Λ is the matrix of the eigenvalues associated with K S ; • K P is the matrix of eigenvectors associated with eigenvalues in K Λ ; Let us assume K P to be the matrix of orthogonal projection on the subspace K such that Similarly, Q P is the matrix of orthogonal projection on the subspace Q defined as where Q I is the identity matrix of the submatrix obtained by selecting the q columns with indices in Q and Given the definitions above, Yanai's GCD between subspaces Q and K is defined as: ( ) Supposing that remains fixed, where * k is the pseudo- rank of the covariance matrix, in this case the selection of variables exhibiting the greatest contribution to the principal components selected in K is the set of

Reduced CCR Model 4
A practical example of the proposed method is provided in this section.For that, the CCR model will be presented with the original variables (hereafter called the "general model" and denoted by CCRg), and with the subset of selected variables ("reduced model," denoted by CCRr).For the sake of simplicity, only product-oriented models will be discussed below.
Let us suppose that there are J DMUs under study, each using a vector x y (6) Let us now suppose that a set of variables was selected following the procedure presented in Section 3.For simplicity's sake, let us assume that only outputs were selected.Let us denote by q j y the vector product of a DMU 1, 2, , j J =  , with selection of q N < outputs, and let us denote by q Y the respective reduced output matrix.The technology in the CCRr model is defined as Given a DMU j, its technical efficiency in the model CCRr, denoted by ET , is estimated by solving the following linear programming problem ( ) x y (7) The reduced model is obtained in three stages, as summarized in the Figure 2  below: One issue not covered by the procedure presented above is that of defining the cardinality of the subset of selected variables.One suggestion would be to combine the gain in discrimination with some measure that would reveal loss of information.The gain in discrimination would be obtained by the difference between the percentage of efficient DMUs in the general model and in the reduced model.The complexity of this issue relates to what measure of informational loss where F and q F are the empirical cumulative distribution functions of technical efficiency estimated by the general and reduced models (the latter with a subset q of selected variables) respectively.
Let * K and * q K be the number of efficient DMUs in the general and re- duced models respectively; let ( ) proportional terms, the gain in discrimination power of the reduced model in relation to the general model.Then the optimal cardinality would be given by in which ( ) 1.36 2 KS q K ≤ is included so that optimal cardinality will depend on acceptance of the null hypothesis of the Kolmogorov-Smirnov test.The amount 1.36 2 K represents the nullity condition of the Kolmogorov-Smir- nov test with a significance level of 0.05, such that if ( ) 1.36 2 KS q K > the null hypothesis of equality between the distributions is rejected.It should be noted that 1.36 2 K is generally valid for 8 K ≥ , otherwise it is necessary to consult tabulated values. 5f there are multiple solutions to (8) the lowest maximizer * q is selected, so that ( ) ( ) * ˆ2 min arg max ; 1.36

An Application to Real Data
This example employs real-world data previously described by [28], who examined the technical efficiency of the public health care system in Brazil.The database uses one input (x), represented by the annual per capita expenditure on health of the three levels of government, and 12 outputs (y i , 1, ,12 i =  ), representing health indicators available in the Ministry of Health's Information System DATASUS. 6For the present example, to reduce the power of discrimination, only 12 of the 27 states studied by [28] were selected.The data and some of the descriptive statistics are shown in Table 1 (Appendix Table A1 provides a description of the variables used in the example).
In this example, the Kolmogorov-Smirnov null hypothesis is accepted with a significance level of 0.05 for a value up to 0.5552, such that the condition to select the cardinality of the subset of selected outputs becomes  KS q KS q δ ≤ = and therefore * 4 q = .
The last part of the example appears in Figure 2, which shows the estimated densities for the general and reduced models with cardinality from 1 to 8. The procedure to estimate the densities uses the Gaussian kernel.The bandwidth was selected minimizing the mean integrated square error (see [29] for details).
The results shown in Figure 3 suggest that in the presence of four selected variables, the inclusion of an additional variable does not have a significant impact on the comparison between the general model and the reduced model.This observation confirms the conclusions obtained by applying the method of subset cardinality selection suggested by equation (9).
is suggested to support the choice of cardinality of the subset of selected variables.This rule seeks to combine maximum gain in discriminatory power with minimal loss of information.
Through an example that employs real-world data, it was found that the pseudo-rank of the output correlation matrix indicates that the first four PCs should be maintained.The GCD for output subsets with cardinality from 1 to 10 was then calculated.The cardinality rule indicated the subset with four of the 12 original outputs.Finally, the estimation of densities of the general and reduced models suggested that the cardinality decision rule can support decisions concerning the number of variables required in the model to obtain maximum discrimination with minimal loss of information.
Further research is suggested on the cardinality decision rule taking into consideration various measures of loss of information.Also warranted are studies comparing the proposed method with other methods of summarization and selection.
method.It consists of generating a correlation matrix between two data sets-the orthogonal projection of data onto the subspace generated by a subset of PCs and the orthogonal projection of data onto the subspace generated by a subset of original variables.This matrix measure of the closeness of the two subspaces is known as Yanai's Generalized Coefficient of 2 In [23] the authors presents an example that illustrate that point very well.

1 J
λ × are input and output matrices and the vector of intensities respectively.Given a DMU j, its technical efficiency in the model CCRg, denoted by g j ET , is estimated by solving the following linear programming problem (the minimization of slacks is omitted for simplicity)

Figure 2 .
Figure 2. Steps for obtaining the reduced model.

Figure 3 .
Figure 3.Estimated densities for TE in geberal and reduced models.

Table 1 .
Data of example.

Table 2 .
Results for general and reduced models a .