A New Algorithm for Generalized Least Squares Factor Analysis with a Majorization Technique

Factor analysis (FA) is a time-honored multivariate analysis procedure for exploring the factors underlying observed variables. In this paper, we propose a new algorithm for the generalized least squares (GLS) estimation in FA. In the algorithm, a majorization step and diagonal steps are alternately iterated until convergence is reached, where Kiers and ten Berge’s (1992) majorization technique is used for the former step, and the latter ones are formulated as minimizing simple quadratic functions of diagonal matrices. This procedure is named a majorizing-diagonal (MD) algorithm. In contrast to the existing gradient approaches, differential calculus is not used and only elmentary matrix computations are required in the MD algorithm. A simuation study shows that the proposed MD algorithm recovers parameters better than the existing algorithms.


Introduction
Using y for a 1 p × observation vector whose expectation ( ) Here, O is the m p × matrix of zeros, m I is the m m × identity matrix, and Ψ is the p p × diagonal matrix whose diagonal elements are called unique variances.The FA model (1) with the assumptions in (2) imply that the covariance matrix ( ) [1] [2].A main purpose of FA is to estimate the parameter matrices Λ and Ψ from the inter-variable sample covariance matrix S ( ) p p × corresponding to (3).Some authors classify FA as exploratory (EFA) or confirmatory (CFA) [2], where Λ is unconstrained in EFA, while some elements of Λ are constrained in CFA.In this paper, we refer to EFA simply as FA.
Three major approaches for the parameter estimation are least squares (LS), generalized least squares (GLS), and maximum likelihood (ML) procedures [3].They differ in the definition of the loss function ( ) The functions for the LS and GLS estimation procedures are defined as ( ) ( ) respectively, while ( ) f Θ is defined as the negative of the log-likelihood derived under the normality assumption for x and e in the ML estimation [3] [4].
In all estimation procedures, iterative algorithms are needed for minimizing loss function ( ) f Θ .They can be roughly classified into gradient and inequality-based algorithms.Here, the gradient ones refer to the algorithms using Newton and related methods [5], in which the partial differentiation of ( ) f Θ with respect to Θ is used for updating it.On the other hand, the term "inequality-based algorithms" is not a popular one.We use the term for the algorithms, in which differentiation is not used and the inequality Θ underlies that which guarantees the weakly monotone decrease in the loss function value with updating Θ to new Θ .Similar dichotomization of minimization methodology is also found in [6].
For all of the LS, GLS, and ML estimation, gradient algorithms have been developed: those with the Fletcher-Powell and Newton-Raphson methods have been proposed for the ML estimation [7] [8], while the algorithms using the Newton-Raphson and Gauss-Newton methods have been developed for GLS [9] [10] with the gradient algorithms for GLS also used for LS.On the other hand, inequality-based algorithms have been developed for the LS and ML estimation excluding GLS.Such an algorithm for LS is MINRES [11] in which Θ is parti- tioned into the subsets of parameters with { } 1 , , q =  Θ θ θ and the minimization of ( ) The inequality-based one for the ML estimation is the EM algorithm for FA [12] in which ( ) f Θ decreases monotonically with the alternate iteration of so-called E-and M-steps [13].A feature of MINRES and the EM algorithm is that only simple matrix computations such as the inversion of matrices are required and their computer-programs are easily formed.In contrast, the gradient algorithms require more complicated computations such as obtaining or numerically approximating the second derivatives of ( ) f Θ .As found in the above discussion, an inequality-based algorithm has not been developed for the GLS estimation in which (4) is minimized over Θ .To propose it is the purpose of this paper.The algorithm to be proposed is also computationally simple as in the existing inequality-based ones: only elementary matrix computations are required such as the inversion and singular value decomposition (SVD) of matrices.A feature of the proposed algorithm to be addressed is using majorization in one of steps.The majorization generally refers to a class of the techniques in which a majorizing function ( ) , h Φ Ξ is utilized for minimizing a function ( ) Here, ( ) Φ being the minimizer of the majo- rizing function ( ) , h * Φ Φ for its latter argument matrix Φ kept fixed [14].It shows that ( ) g Φ decreases with the update of Φ into * Φ .As described in the next section, the step with a majorization technique and the steps for minimizing the functions of diagonal matrices form the algorithm to be presented.It is thus called majorizing-diagonal (MD) algorithm in this paper.
The MD algorithm is not the first one with majorization in FA.Indeed, the above EM algorithm [12] can be regarded as a majorization procedure with its majorizing function being the full log likelihood derived by supposing that latent factor scores in x were observed. [15]has also proposed an FA algorithm with a majoriza- tion technique.However, in that algorithm, the estimation of a new type [16] [17] is considered, which are different from the LS, GLS, and ML estimation treated as the major procedures in this paper: [15] is beyond the scope of this paper.
The remaining parts of this paper are organized as follows: the MD algorithm is detailed in the next section, and it is illustrated with a real data set in Section 3. A simulation study for assessing the algorithm is reported in Section 4, which is followed by discussions.

Proposed Algorithm
We propose the MD algorithm for minimizing the GLS loss function (4) over the loadings in Λ ( ) the unique variances in the diagonal matrix Ψ ( ) Here, it is supposed that the sample covariance matrix S is positive-definite and Λ is of full-column rank, i.e., its rank is m with p m > .This supposition and the covariance matrix being modeled as (3) imply that, without loss of generality, we can reparameterize Λ as where L is a p m × matrix satisfying and ∆ is an m m × positive-definite diagonal matrix.By substituting (5) into the GLS loss function ( 4), it is rewritten as , , tr tr 2tr tr 2tr 2tr tr 2tr tr This function is minimized over L , ∆ , and Ψ subject to (6) and the latter two matrices being diagonal ones, by alternately iterating the majorizing and diagonal steps described in the next subsections.

Majorization Step
Let us consider minimizing (7) over L subject to (6) while ∆ and Ψ are kept fixed.Summarizing the parts irrelevant to L in (7) into Though the optimal L that minimizes (8) under ( 6) is not given explicitly, the solution can be obtained using Kiers and ten Berge's [18] majorization technique, whose earlier version is also found in [19].This technique purposes to minimize a function expressed as the form ( ) this with (8), we can find (8) to be a special case of the above ( ) φ L with A being the zero matrix, ( ) ∆ , and 2 q = .Therefore, the update formula in [18] (pp.374-375) can be straightforwardly used for (8).
According to the formula, the update of L by ′ = L PQ (9) decreases the value of ( 8) with ( ) ( ) Here, OLD L stands for the matrix L before the update; P and Q are the column-orthonormal matrices that are obtained from the SVD defined as with Θ the diagonal matrix including the singular values of the matrix in the left-hand side, and a λ , b λ , and c λ the largest eigenvalues of

Diagonal Steps
In this section, we describe updating each of diagonal matrices ∆ and Ψ .First, let us consider minimizing the loss function (7) over ∆ with keeping L and Ψ fixed.Since the terms relevant to ∆ in the loss func- tion (7) are the same as those relevant to L , the expression (8) into which (7) is rewritten is to be noted again.By taking account of the fact that ∆ is a diagonal matrix, (8) can be rewritten as ( ) ( ) ( ) Here, ( ) diag  denoting the diagonal matrix whose diagonal elements are those of the parenthesized matrix.Further, we can rewrite (11) as ( ) with  denoting the Frobenius norm.It shows that the function ( 11) is minimized for for fixed L and Ψ .Next, we consider minimizing (7) over Ψ with L and ∆ fixed.Summarizing the parts irrelevant to Ψ in (7) into ( ) We can find that ( 13) is minimized for ( ) ( ) for fixed L and ∆ .

Whole Algorithm
The results in the last two subsections show that the proposed MD algorithm can be listed as follows: Step 1. Initialize L , ∆ , and Ψ .
Step 5. Finish with Λ set to (5) if convergence is reached; otherwise, return to Step 2. It should be noted in Step 2 that the update of L by (9) does not minimize (7) but only decreases its value, which implies that that update can be replicated ( l times) for further decreasing the value of (7).In this paper, we set 5 l = .In Step 1, the initialization is performed using the principal component analysis of sample covariance matrix S .That is, the initial L and ∆ are given by V and 1 2 Ω , respectively, with Ω the m m × diagonal ma- trix whose diagonal elements are the m largest eigenvalues of S , and the columns of V ( ) Step 5, we define the convergence as the decrease in the value of ( 7) or (4) from the previous round being less than 2 10 0.1 p × .

Illustration
In this section, we illustrate the performance of the MD algorithm with a 190-person × 25-item data matrix, which was collected by the author and publicly available at http://bm.osaka-u.ac.jp/data/big5/.This data set contains the self-ratings of the persons (university students) for to what extents they are characterized by the personalities described by the 25 items.According to a theory in personality psychology [20], the items can be classified into the five groups shown in the first column of Table 1.The 25 × 25 matrix of the correlation coefficients among those items was obtained from the data set.We carried out the MD algorithm for the correlation matrix with the number of factors m set to five. Figure 1 shows the change in the value of loss function (4) until the steps in Section 2.3 were iterated ten times and the change after the tenth iteration.There, we can find that the function value decreased monotonically with iteration, which was finally reached to convergence at the 542 nd iteration.As the resulting loading matrix has rotational freedom, that is, the Λ post-multiplied by arbitrary orthonor- mal matrix satisfies (1) and ( 2), the loading matrix was rotated by the varimax method [21].The solution is pre-  1. There, bold font is used for the loadings whose absolute values are greater than 0.35.They show that the 25 items are clearly classified into the five groups as predicted by the theory in personality psychology [20], which demonstrates that the MD algorithm provided the reasonable solution.

Simulation Study
A simulation study was performed in order to assess how well parameter matrices are recovered by the proposed MD algorithm and compare it with the existing algorithms for the GLS estimation in the goodness of the recovery.We first describe the procedure for synthesizing the data to be analyzed, which is followed by results.
An n-observations × p-variables data matrix Y was synthesized according to the matrix versions of the FA model ( 1) and the assumptions in (2): [ ] and Here  (1), and the five equations in (2) can be summarized into the two matrix expressions in (16).The data synthesis procedure follows the next steps: Step 1. Draw m from ( )  15) from ( ) which is followed by centering Z and post- multiplying it by the matrix that allows the resulting Z to satisfy (16).
Step 4. Form Y with (15) and obtain the covariance matrix 1 n − ′ = S Y Y .In Step 3 we have used a uniform distribution for Z , rather than the normal distribution typically used for such a matrix, as a feature of the GLS estimation is that it does not need the normality assumption required in the ML estimation.We replicated the above steps to have 2000 sets of S .For them, the MD and the existing algorithms were carried out, where the latter are the two gradient algorithms [9] [10], as described in Section 1.We refer to the ones in [9] and [10] as the Newton-Raphson (NR) and Gauss-Newton (GN) algorithms, respectively.In the NR one, we obtained the gradient vector in [9], Equation (32), by pre-multiplying the vector of first derivatives by the Moore Penrose inverse of the corresponding Hessian matrix.Also in the NR and GN algorithms, we used the same initialization and definition of convergence as in Section 2. , Λ Ψ for the solution given by the NR, GN, or MD algorithm.For assessing the recovery of the loading matrix, the averaged absolute difference (AAD) of the elements in Λ to the corresponding estimates, i.e., ( ) ( ) can be used with 1 l  denoting the 1 l norm.Here, it should be noted that # Λ has rotational freedom and must be rotated so that the resulting # Λ is optimally matched to Λ .Such a rotated # Λ can be obtained by the orthogonal Procrustes method [22] with Λ a target matrix.The loading matrix # Λ in (17) thus stands for the one rotated by the Procrustes method.The recovery of unique variances can also be assessed with the AAD index ( ) Ψ Ψ , where the unique variances are uniquely determined, thus the additional procedure as for # Λ is unnecessary.Smaller values of those AAD indices stand for better recovery.The statistics of AAD values over 2000 data sets are presented in Table 2. There, the averages show that the recovery by the MD algorithm is the best and that for the NR one is the worst.It should be noted that the 50 and 75 percentiles for the NR algorithm are zero, while the maximum and 99 percentile are very large.That is, the recovery by the NR algorithm was perfect for more than 75 percent of the 2000 data sets, but for a few percent of them, recovery was considerably bad, which increased the averages for the NR one.In contrast, the maximum AAD of loadings and unique variances for the MD algorithm are 0.0041 and 0.0013, respectively, which are small enough to be ignored.That is, the proposed MD algorithm well recovered the true parameter values for all of the 2000 data sets.We can thus conclude that the MD algorithm is superior to the existing ones in the goodness of recovery.

y equals the 1 p
× zero vector p 0 , the factor analysis (FA) model is expressed as = p-variables × m-factors loading matrix, x an 1 m × latent factor score vector, e a 1 p × error vector, and p m > .The expectations for x and e are assumed to satisfy loss function(7) is rewritten as fact of Ψ being a diagonal matrix, the loss func- tion (7) can be rewritten as

Figure 1 .
Figure 1.Change in the GLS loss function value as a function of the number of iteration.

, 7 DU
denoting the discrete uniform distribution defined for the integers within the range [ ] , I J .Step 2. Draw each loading in Λ from denoting the uniform distribution over the range [ ] , α β .Step 3. Draw each elements of Z in (

Table 1 .
Loadings and unique variances Ψ1 p for personality rating data.
the loading matrix and the square roots of unique variances.It should be noticed that each row of Y , X , and E corresponds to ′

Table 2 .
Statistics for the differences between the true parameter values and their estimated counterparts.