Adaptive Sparse Group Variable Selection for a Robust Mixture Regression Model Based on Laplace Distribution

The traditional estimation of Gaussian mixture model is sensitive to heavy-tailed errors; thus we propose a robust mixture regression model by assuming that the error terms follow a Laplace distribution in this article. And for the variable selection problem in our new robust mixture regression model, we introduce the adaptive sparse group Lasso penalty to achieve sparsity at both the group-level and within-group-level. As numerical experiments show, compared with other alternative methods, our method has better performances in variable selection and parameter estimation. Finally, we apply our proposed method to analyze NBA salary data during the period from 2018 to 2019.


Introduction
The mixture regression model is a powerful tool to explain the relationships between the response variable and the covariates when the population is heterogeneous and consists of several homogeneous components, and the early research can trace back to [1].In 1977, EM algorithm was first proposed by [2]; it greatly simplified the solution procedure of the mixture regression model.Then the mixture regression model attracted a lot of interest from statisticians; it was widely applied in many fields, such as business, marketing and social sciences.
Recently, the research about the mixture regression model is becoming more and more detailed.On the one hand, statisticians paid attention to improving The rest of this article is organized as follows.In Section 2, we introduce the robust mixture regression model based on Laplace distribution and adopt the adaptive sparse group Lasso for variable selection.In Section 3, we prove some asymptotic properties for our proposed method.In Section 4, we solve the problem of tuning parameters and components selection.Section 5 conducts a numerical simulation to evaluate the performance of our method.In Section 6, we apply our proposed method to NBA salary data.Finally, the conclusion of this paper is given in Section 7.

Model Overview
where the mixing probabilities satisfy It is known that the mixture linear regression model is sensitive to outliers or heavy-tailed error distributions, and outliers impact more heavily on the mixture linear regression model than on the usual linear regression model, since outliers not only affect the estimation of the regression parameters, but also possibly totally blur the mixture structure.In order to improve the robustness of the estimation procedure, we introduce a robust mixture regression model with a Laplace distribution Then we can estimate the unknown parameter Ψ by maximizing the log-likelihood function

Adaptive Sparse Group Lasso for Variable Selection
Now, we consider a situation that covariates have natural grouping structures and can be divided into K groups as  is a group which contains k p variables and . Then, the log-likelihood function can be written as In order to exploit the grouping structures of covariates, we apply the adaptive sparse group Lasso (adaSGL) to the robust mixture regression model, the penalized log-likelihood function [ ] where .represents the Euclidean norm, Moreover, the weights are defined based on the maximum penalized log-likelihood estimator  Ψ when ( ) Next, we follow the approach of Hunter and Li [13] and consider to maximize the ε -approximate penalized log-likelihood function . Here for some small 0 ε > , and the weights are Following Hunter and Li [13], we can similarly show that uniformly as 0 ε → , over any compact subset of the parameter space.

EM Algorithm for Robust Mixture Regression
However, the above penalized log-likelihood does not have an explicit maximizer.We introduce an EM algorithm to simplify the computation and denote ij Z as a latent Bernoulli variable such that ( ) If the complete data set ( ) where ( ) According to Andrews and Mallows [14], we know that a Laplace distribution can be expressed as a mixture of a normal distribution and another distribution related to the exponential distribution.To be specific, there are latent scale variables ( )  such that we have the complete log-likelihood function ; ; Ψ is a parameter estimate for the rth iteration.In E step of EM algorithm, we can get and The calculation for ( ) r ij δ follows the same argument as in Phillips [15].
In M step, we will maximize we follow the tactic of [16] and find a local quadratic approximation of in a neighborhood of 0 ψ .Then, we can replace the penalty function ( ) where ( ) ( ) . Similarly, from Lange [17], we have in a neighborhood of ( ) , where p is the dimensionality of ψ .
And we apply (19) to in a neighborhood of ( ) r Ψ .Note that (20) can be block-wise maximized in the Advances in Pure Mathematics coordinates of the parameter components π , α , β and where ∇ is the gradient operator, ζ is a positive scalar and 0 is a zero vec- tor.Then we have the set of simultaneous equations 0, where ( ) , for each j.
According to ( ) Therefore, the ( ) , we have the updates and Similarly, for the parameter jt β in kth group, we obtain the updated formula by solving Based on the above, we propose the following EM algorithm.
3) M-Step: at the ( )  x β for some i, j and r.As a result, ( ) 1 r ij δ + will be- come very large and numerical instability.In this article, we simply introduce a hard threshold to control the extremely small LAD residuals, ( ) 1 r ij δ + will be as- signed a value of 10 6 when the perfect LAD fit occurs.

Convergence Analysis
The EM algorithm is iterated until some convergence criterion is met.Let tol be a small tolerance constant and M be the maximum iterations for the proposed algorithm, we believe the algorithm has converged to an ideal state when or the iterations over the maximum iterations M. See [17] for details regarding the relative merits of convergence criteria.
According to Dempster et al. [2], each iteration of the E step and M step of EM algorithm monotonically non-decreases the objective function (8), i.e.
, for all 0 r ≥ .Moreover, Wu [18] proved that if the EM sequence Ψ is a stationary point of (8) under some general conditions for ( ) Given the facts above, in this article, we run multiple times from different initializations ( ) 0 Ψ in order to obtain an appropriate limit point.

Asymptotic Properties
For the regression coefficient vector j β in jth component, we can separate it into ( ) , where 1 j β is the set of non-zero effects and 2 j β is the set of zero effects.Naturally, we decompose the parameter ( ) Ψ contains all zero effects, namely 2 j β , 1, , j g =  .The true parameter is denoted as 0 Ψ and the elements of 0 Ψ are denoted with a subscript, such as 0 jt β .
For the purpose of easy discussion, we define { } M z (possibly depending on 0 Ψ ) such that for Ψ in a neighborhood of ( ) A4.The Fisher information matrix is finite and positive definite for each ∈ Ψ Ω .Theorem 1.Let ( ) Proof.Let Now, there is a local maximum in { } with large probability, and this local maximizer ˆn Ψ satisfies ( ) Without loss of generality, we assume that the first j d coefficients of 0 j β are non-zero and the first j K groups contain all non-zero effects of 0 j β , where 0 j β is the true regression coefficient vector in the jth component of the mixture regression model.Hence, we have Since ( ) µ is a sum of non-negative terms, removing terms corres- ponding to zero effects makes it smaller, By Taylor's expansion, triangular inequality and arithmetic-geometric mean inequality, Regularity conditions indicate that ( ) ( ) tive definite, and it is not difficult to find that the sign of ( ) . Therefore, for any given 0 ε > , there is a sufficiently large M ε such that ( ) which implies (29), this completes the proof.
Theorem 2. Suppose the conditions given in Theorem 1 and g is known, 0 Then, for any n -consistent maximum penalized log-likelihood estimator ˆn Ψ , we have the following: 1) Sparsity: As n → ∞ , ( ) 2) Asymptotic normality: ˆ, , ( ) I Ψ is a Fisher information when all zero effects are removed.
Proof.In order to prove the sparsity of Theorem 2, we consider the partition ( ) is the maximizer of the penalized log-likelihood , which is regarded as a function of 1 Ψ .It suffices to show that in the neighborhood ( ) , there is the , , 0 1 On the other hand, ( By the mean value theorem, ( , for some ( ) . By the mean value theorem and regularity condition A3, we can get .
Here 01 Ψ is a subvector of 0 Ψ with all zero regression coefficients removed.Advances in Pure Mathematics Regularity conditions imply that ( ) ( ) , . In this case, we have for large n.And for the penalized function . Since with probability to one as n → ∞ .This completes the proof of the sparsity.
For the asymptotic normality of Theorem we still use the same argument as in Theorem 1 and consider ( ) By the Taylor's expansion, (
by the Slutsky's theorem.This completes the proof of the asymptotic normality.Now, we know that, as long as the conditions 0 nb → ∞ are satisfied when n → ∞ , the conclusions of theorem 1 and theorem 2 are tenable.Since the estimator  β based on the Lasso penalty, it can be n -consistent.Then, for any ( )

Tuning Parameters and Components Selection
In this section, we need to solve two problems.One concerns the number of components g and the other problem is the selection of the tuning parameters ( ) , , , , . Until now, there is little theoretical support for the selection of these hyper parameters.In former literatures, the cross validation [19] and the generalized cross validation [20] provided some effective guidances for these problems.Grün and Leisch [21] and Nguyen and McLachlan [22] indicated that the Bayesian information criterion (BIC) has a good performance in solving these problems.In this paper, we still use the BIC, ( ) ( ) where j d the number of non-zero regression coefficients in the jth regression model.
Suppose that there is a set of parameter combinations ( ) ( ) For each parameter combination ( ) for our robust mixture model, where * arg min BIC s s s = .

Numerical Simulation
To quantify the performance of our proposed robust mixture regression model based on adaptive sparse group Lasso (adaSGL-RMR), we design a numerical simulation and generate sample data ( ) 1 , where Z is a component indicator.There are 6 K = groups and each group consists of 5 covariates, covariates within the same group are correlated, whereas those in different groups are uncorrelated.In order to generate the covariates , 1 6, 1 5.
BIC, the predicted logged salaries from the stepwise-BIC linear model shows a mean square error (MSE) of 0.60 and adjusted R 2 of 0.42, these terrible results motivate us to conduct further research for this problem.The logged salaries histogram shows multi-modality from Figure 1, it is acceptable to use the mixture regression model for predicting the logged salaries.
For comparison, we run multiple analyses that include three sets of starting parameters for each of ( )

Conclusion
In this paper, we propose a robust mixture regression model based on a Laplace distribution and consider the adaptive sparse group Lasso for variable selection.
Its oracle properties are proved completely in Section 3. In addition, the numerical simulation and real data application show that our method has better performance in parameter estimation and variable selection than other methods.A limitation of this study is that we only consider the mixture regression model . Naturally, we can obtain the penalized complete log-likelihood we adopt Lagrangian multiplier to update π by solving

4 )
Repeat E-Step and M-Step until convergence is obtained.Note that if a perfect least absolute deviation (LAD) fit occurs in EM algorithm, i.e.
an open parameter space.In order to prove the asymptotic properties of the proposed algorithm, some regularity conditions on the joint distribution of z are also required.A1.The density ( ); f z Ψ has common support in z for all ∈ Ψ Ω and ( ) ; f z Ψ is identifiable in Ψ upto a permutation of the components of the mixture.A2.For each ∈ Ψ Ω , the density ( ) ; f z Ψ admits third partial derivatives with respect to Ψ for almost all z .A3.For each 0 ∈ Ψ Ω, there are functions

→
, , i n =  , be a random sample from the joint density function ( ) ; f z Ψ that satisfies the regularity conditions A1-A4.Suprepresents convergence in probability.
, , s S =  , we can obtain the parameter estimate , ˆn s Ψ by the proposed algorithm, and there is a BIC s which depends on corresponding , ˆn s Ψ .Finally, we set the best parameter combination ,3, 4 g = models.The predicted results from the 2 g = adaSGL-RMR model (BIC = 625) have a MSE of 0.11 and adjusted R 2 of 0.90.The predicted results from the 3 g = adaSGL-RMR model (BIC = 598) have a MSE of 0.05 and adjusted R 2 of 0.95.The predicted results from the 4 g = adaSGL-RMR model (BIC = 517) have a MSE of 0.04 and adjusted R 2 of 0.96.See Table 2 for more details.These results suggest that the 4 g = adaSGL-RMR model has the smallest MSE and explains the largest proportion of variance for the logged salaries from the 2018/19 NBA regular season.Moreover, from Figure 2, the predicted densities show a good characterization of the multi-modality in the logged salaries for the adaSGL-RMR models, with the stepwise-BIC linear model not being able to model this.

Figure 1 .
Figure 1.Histogram and density estimate for logged salaries

Figure 2 .
Figure 2. Summary of densities for predicted and observed logged salaries.

2.1. Robust Mixture Regression with Laplace Distribution
 is the response variable which is dependent on corresponding x .Furthermore, for g mixture components, we can say that ( ) there is a n