Ridge Penalized Logistical and Ordinal Partial Least Squares Regression for Predicting Stroke Deficit from Infarct Topography

JBiSE). ABSTRACT Improving the ability to assess potential stroke deficit may aid the selection of patients most likely to benefit from acute stroke therapies. Methods based only on 'at risk' volumes or initial neurological condition do predict eventual outcome but not perfectly. Given the close relationship between anatomy and function in the brain, we propose the use of a modified version of partial least squares (PLS) regression to examine how well stroke outcome covary with infarct location. The modified version of PLS incorporates penalized regression and can handle either binary or ordinal data. This version is known as partial least squares with penalized logistic regression (PLS-PLR) and has been adapted from its original use for high-dimensional microarray data. We have adapted this algorithm for use in imaging data and demonstrate the use of this algorithm in a set of patients with aphasia (high level language disorder) following stroke.


INTRODUCTION
Correlations between brain lesions and clinical symptoms have yielded valuable insights into brain function in the past.In individual patient care, these clinico-lesion correlations may play a role in predicting neurological deficits following stroke.More recently, attempts have been made to utilize the information obtained from brain imaging studies to aid prediction of neurological outcome.Initial approaches depended upon measurement of infarct volume but volumetric approaches proved to be inaccurate predictors of neurological outcome.The correlation between infarct volume and the National Institutes of Health Stroke Scale (NIHSS) is moderate at best [1,2].One factor ignored in volumetric approaches is the information on stroke location available in the images.We have recently demonstrated that the relationship between tissue damage assessed at the voxel level and neurological disability can be predicted using a new method: Ridge Penalized Logistic Partial Least Squares (RPL-PLS).This method allows both stroke extent and location to be incorporated into the predictive model for neurological deficit.
Previously, voxel-based statistical techniques have concentrated on the relationship between involvement of individual voxels or clusters of voxels and neurological deficit [3][4][5].However, strokes often involve large ensembles of voxels and the task involved in prediction is to establish the relationship between involvement of ensemble as a whole and neurological deficit.The functional inter-correlation between different groups of voxels is likely to mean that their contribution to outcome is not independent.One method of dealing with this kind of issue is principal components regression (PCR), which uses orthogonal linear combinations of the original predictor variables as predictors in a multiple linear regression [6].In PCR orthogonal linear combinations of the original predictor variables are first constructed as principle components (PCs) to maximize the variance of data.These PCs are then used as predictors in a multiple linear regression.Thus dimension reduction in PCR is achieved without regard the response variable.Partial least squares regression (PLS), as an alternative method, has the advantage over PCR in that it takes into account the response variable when performing the dimension reduction step [7,8].
Bookstein 1994 [9], McIntosh, et al. [10] and Leibovitch, et al. [11] introduced a variant of the partial least squares (PLS) approach to the brain imaging community.Here singular-value decomposition (SVD) is applied to the cross-correlation matrix between dependent and independent variables to yield latent variables which are linear combinations of the original variables, and which maximise the explained covariance.This characteristic of PLS makes it more suited to the purpose of prediction on the basis of involvement of functionally related ensembles of voxels.The reduction in dimensionality achieved may reflect the functional relationships between brain regions.
There are several challenges in using the PLS technique to build a prediction model.First, there is a high degree of correlation among neighbouring voxels due to shared function and shared vascular blood supply.This leads to collinearity thus preventing stable estimates of regression coefficients.Second, the outcome variables are binary or ordinal and are correctly dealt with using logistic regression with the dependent variable being transformed into a logit variable describing the odds of a specific outcome.Thirdly, estimates of model coefficients using generalized least squares may still fail to converge.The solution of the first issue is the introduction of a ridge estimator to PLS and such analysis has recently been shown to provide stable estimate in microarray data analysis [12][13][14].The solution of the second issue is achieved by embedding the usual PLS steps within the iterative re-weighted least square (IRLS) [15].In this setting, the binary variables were transformed to the continuous-valued pseudo-response variable by logit conversion.Variables from logistic regression are further constrained to be finite by penalizing with a ridge estimator for overcoming the convergence issue before feeding to the PLS.Finally, standard PLS method has been extended to weighted partial least squares (WPLS) to further reduce noise effects and to improve the convergence of the PLS.WPLS penalizes or regularizes PLS model by giving samples different weights (based on their relevance to the study).This additional weight determines how much each observation in the data set influences the final parameter estimates and the 'dispersion matrix', from logistical regression, can be severed as weights for the WPLS (detailed in methods section).
We have successful demonstrated RPL-PLS in stroke deficit prediction study [16].In this study we describe this modification of PLS method to take into account binary as well as ordinal outcome variables.To illustrate the use of this technique we describe its use in predicting stroke outcome using only knowledge of the location and extent of infarction.In Section 2 we describe the theory of the algorithm and its implementation; in Section 3 we describe an application of the method to stroke data, in Section 4 is result and in Section 5 is discussion.

Partial Least Squares Regression (PLS) and
Weighted PLS PLS [7] is a dimension reduction technique, which ad-dresses the issue of multiple regression when the number of variables are greater than the number of observations.The n observations described by p dependent variables are stored in a n × p matrix denoted Y, and the values of m predictors collected on these observations are in a n × m matrix X. PLS regression searches k number, with k <= m, of principle component scores and loadings (latent variables) by performing an iterative simultaneous decomposition of independent data X and dependent data Y.
In matrix form, PLS decomposes X and Y into the form: where the T and U are (n × k) score matrices, the (m × k) P and the (p × k) Q are matrices of loadings.E and F are matrices of residuals.The regression model is then step up between the scores: These matrixes are column centered and normalized (the symbol means "to normalize the result of operation").The PLS regression method described here is based in the nonlinear iterative partial least squares (NIPLALS) algorithm [7], Iterative decomposition starts with random initialization the Y-score vector u, with initial E = X and initial F = Y, and iteratively go though the following steps until a stopping criterion is met or E becomes a null matrix.
Step 1. w ∝ E T u (estimate E weights) Step 2. t ∝ Ew (estimate E scores) Step 3. q ∝ F T t (estimate F weights) Step 4. u ∝ Fq (estimate F scores) Step 5. Check covariance, if t has not converged, goes to Step1, else go to Step 6.
Step 6. b ∝ t T u (compute regression coefficient) Step 7. E = E -tp T (residual matrix of E) Step 8. F = F -btq T (residual matrix of F) By regressing E on t and F on u, the loading vectors p = (t T t) -1 E T t and q = (u T u) -1 F T u can be computed.In this way it finds the weight vectors, w, q such that 2 2 where the sample covariance between two variables are kept maximized through maximizing the sample covariance between the two scores (components) at each decomposition step.In such a way, it minimizes the norm of Y while keeping the correlation between X and Y by the inner relation Eq.3.Once the relationship has been built, the dependent variables are predictable using multivariate regression formula, the outer relation, as: Copyright © 2010 SciRes.JBiSE Y = TBQ T = XB PLS (5) with B PLS = (P T+ )BQ T (where P T+ is Moore-Penrose pseudo-inverse of P T ).
Least squares solution of linear regression is only appropriate when the variances of the predictor variable are uniform [17].When there are unreliable data or errors in the data measurement, unequal diagonal elements in the variance of the error matrix will lead to instability of parameter estimate for the least squares formula.Weighted partial least squares (WPLS) generalize PLS with an empirical weighted squared error in the same way that weighted least square regression generalized least squares regression.The main idea is to penalize or regularize the coefficients of WPLS model and to facilitate model interpretation and further reduce noise effects of the samples: instead of weighting all samples equally, they are weighted such that samples with great weight contribute more to fit.WPLS defines k number of V weighted orthogonal scores t k , linear combination of X such that for all k, and performs a V weighted least squares regression of Y through U on T. V is a symmetric positive definite matrix with v ii is the weight assigned to each sample, is induced with the belief that observations with small variances provide more reliable information about the regression function than those with large variances.PLS is a special case of WPLS with V as identical matrix.In this study we will use WPLS to compensate the problem of possible unequal variance in the error matrix.The element v ii of V is a probability of occurrence of sample i obtained from logistic regression step (detailed in following).

Ridge Penalized Logistic Regression
PLS was originally designed for normal random response variables.In the presence of binary response variable, linear regression can result in regression coefficients, which cannot guarantee that response values only have two possible predicted values, 0 and 1. Logistic regression is one of the approaches to this issue.Let variable y i indicates the class of sample i for response variable y and π i the probability that y i = 1.Consequently, the probability of sample represents a class 0 is then 1π i .Let x ij indicate the jth independent variable in the ith sample.The logistic regression model is: where η i is called the linear predictor in the jargon of generalized linear model.It is connected to π i by so-called link function f with In vector format β is unknown parameter and could be estimated by the maximum likelihood estimator (MLE), T i x β .The log-likelihood of the observations for the value β of the parameters L(β) is given by 1 1 ( ) log ( 1)log( 1) x is full column-rank and the configuration of n samples in the observation space is overlap, the solution exists and is unique.This solution could be computed by the literately reweighted least squares (IRLS) [18].Let V T be the n × n diagonal matrix with T T T [1 ] ii i i at iteration t and β = β T .Each iteration divides into two steps, where g is the calculated new response variables (detailed in Appendix).Multicollinearity can still exist even after dimension reduction in the setting of our study: many voxels will show nearly identical patterns across the samples and they may supply no additional information to the model.This issue can be further addressed by introducing the ridge estimator, the regularization on sum of the squares of regression coefficients [19], into the logistic regression [20].
The ridge estimator, ˆR β , is defined as the (unique) maximum of the penalized log-likelihood where λ > 0 is the shrinkage parameter, the stronger its influence and the smaller the 2 j  's are forced to be.ˆR β , always existing, is unique.Ridge-IRLS (RIRLS) replaces the weighted regression (Eq.10) in IRLS by a weighted ridge regression 1 ( ) where R is a diagonal matrix with entries 1,1 0 R  and ., T , 1 ( g T in Eq.12 is built as in Eq.9.λ can be chosen as the minimum, over a given range, of the Bayesian information criterion (BIC) which gives the best balance between model complexity and the best fit to the data [21], Copyright © 2010 SciRes.JBiSE

Ridge Penalized Logistic Partial Least Squares Regression
Embedding ridge penalized logistic regression into PLS procedure forms RPL-PLS.This method involves two steps.The first step, ridge penalty logistic regression (RIRLS), builds a continuous response variable g ∞ and 'dispersion matrix' [V ∞ ] -1 for the input of the second step.Second step is weighted PLS (WPLS) [12]. 1) ( , ) ( , , ) 2) ˆ( , , , ) There are two parameters, shrinkage parameter  and number of component k, to be determined in RPL-PLS.
, as stated early, is determined by BIC in the first step.
The optimal number k is empirically chosen by selecting the minimal number of components that give the minimum leave-one-out cross-validation (LOOCV) error rate for the training data.RPL-PLS provides unique an estimate ˆPLS β for given Y, X,  and k.
Binary logistic regression can be easily extended to ordinal response variables by creating a sequence of binary response variables, one for each response category [18]:

Otherwise
The same technique can be applied to RPL-PLS to form more generalized multi-ordinal RPL-PLS.

Choosing the Model
The maximum number of components from RPL-PLS is equal to the number of samples in the dataset.Since these components are sorted in a descending order according to the proportion of variance they explained, only the first of few components were needed and the rest were considered as noise.The number of components could be made up to number of samples and optimal number of components was determined by Leaveone-out cross-validation step and when the error rate became stable.These models were illustrated up to 6 components which have already comprised most of variance of the data.The optimal number of components for each model was selected by choosing the value of k minimizing LOOCV error rate in cross-validation of the training dataset.

MATERIALS
Patients were recruited if they had an ischemic stroke in the anterior circulation.38 patients were used for development of the model (training dataset) and 22 patients were used for the model validation (validation dataset).Neurological deficit from stroke was measured on an ordinal scale of the NIHSS and assessment was performed immediately prior to MR imaging.The domain of interests for this demonstration was aphasia (higher language disorder).The NIHSS language component is rated 0 (normal), 1 (mild to moderate), 2 (severe) and 3 (mute and global aphasia).In our ordinal model, a score of 1 correspond to NIHSS language score of 0, a score of 1 correspond to NIHSS language score of 1-2 and a score of 3 correspond to NIHSS score of 3.
MR scans were acquired within three months after stroke onset.Fast spin echo T2-weighted images were acquired on 1.5T scanner (GE, Milwaukee, WI) with thickness 6 mm/1.7 mm, matrix 256 × 256, and TR/TE/ ETL 2000 ms/102 ms/12.Images from different subjects were aligned to a standard brain template registration [22] by manual registration using 9-parameter linear transformation [23].Infarcts were manually segmented on standard space images using interactive mouse driven software.Due to memory limitations of the PC, binary images were resampled to 4 mm × 4 mm × 4 mm as the input of RPL-PLS.The computation scripts were implemented in MATLAB (Mathworks, Inc., MA).

RESULTS
RPL-PLS is a robust method and has convergence for all three models.In LOOCV, the optimal number of the components, k, was 2 for aphasia (binary), and 3 for aphasia (ordinal).The algorithm correctly identified 37 of 38 samples for aphasia (binary) using two components and 37 of 38 samples for aphasia (ordinal) using 3 components.In a model, the coefficients of each voxel in the components indicate the relative importance of that voxel (anatomical locations) to the associated neurological deficit.The cross-validation results of models consisted different number of components were illustrated in Table 1.
In external validation with new data set consist of 22 samples.Binary aphasia model produced 4 errors (81.8% correct) and ordinal aphasia model produced 5 errors (77.3% correct).
Figure 1 corresponds to the binary aphasia model.Since the optimal number of components for this model was 2, left column is the first component of the model wiliest the second column is the second component of the model.The brighter the voxel is, the higher the weighting of the voxel with respect to aphasia deficit.
In Figure 2 we presented coefficient images of three Copyright © 2010 SciRes.In this study, we developed novel approach of generalized regression method, RPL-PLS, for predicting neurological deficit from MRI image data.The method incorporates dimension reduction techniques and ridge penalized logistic regression for addressing the problem of large collinearity dataset with binary and ordinal response variables.The PLS algorithm described in this paper is known as the 'standard' PLS algorithm and has been presented in detail elsewhere [7,8,[24][25][26].no requirement from human expert observer.This novel approach of using infarct topography to describe neurological deficit is an improvement of cruder volumetric methods.This study provides support of the concept that information presented in image can be used to predict the outcome of stroke.This concept paves way for the development of similar model for understanding the neuroanatomy of neurological deficits and determ ining the outcome of rehabilitation and acute stroke trial.

Number of components
For this proof-of-concept study we examined patients with well-defined infarcts on MRI scans acquired 3 months after infarction to predict outcome at 3 month.In this aspects, the model described here does not conform to a typical definition of a prediction model which is to use early MRI scans (< 1 week) to predict long term outcome (at 3 months).Nevertheless, the concept developed here can be used to obtain the "holy grail" of prediction.We would anticipate that with the appropriate training set, the method would also perform well at other time points after infarction, for example in the acute stage (less than 1 week).To increase the homogeneity of the group for this proof of concept study, we restricted the analysis to patients with infarcts in the anterior circulation.Future studies involving other infarct territories will be required to assess whether this method of correlating infarct extent and location will perform as well for other brain regions.

Maximum likelihood (ML) estimate of logistic regression and Iterative reweighted least squares (IRLS)
When response variable y 1 , y 2 , …, y n are binary, taking on the values 0 and 1 with probabilities and 1π, respectively, with expect value E{y} = π, covariates x ij {i = 1, 2, …, n; j = 1, 2,…, m} are also available, the logistic regression model would construct by a canonical link function Since the y i observations are independent, their joint probability function is It is often more convenient to work with the logarithm of the joint probability function to find the maximum likelihood estimate If we substitute (A.2) to (A.5) and consider equation a1, we have ( ) where x i is shorthand of [x i1 , x i2 , … x ip ] and L(α, β) replaces h(y 1 , y 2 ,…y n ) to show explicitly that we now view this function as the likelihood function of the parameters to be estimated.Denote This suggests that we can start with an initial γ 0 and iteratively apply (A.10) until the algorithm reaches convergence, at which point 0 ''( ) 0  γ  and (A.10) does not change.This is what called Newton optimization and in the linear modeling setting is a vector.Newton's method has a generalization (Newton-Raphson) using the multivariate Taylor's series.
where V is a diagonal matrix with element W(i, i) equal to (1 ) i i    .We can plug these results into (A.11)

Component 2 Figure 1 .Figure 2 .
Figure 1.Image representation of first 2 components of aphasia binary model (left side image corresponds to right side of the patient, radiological conversion).
γ 0 is an estimated initial value of γ 0 .To maximize ( ) γ  we can differentiate with respect to γ and solve for γ

Table 1 .
Number of errors in LOOCV of 38 training data samples.