^{1}

^{*}

^{1}

The traditional estimation of Gaussian mixture model is sensitive to heavy-tailed errors; thus we propose a robust mixture regression model by assuming that the error terms follow a Laplace distribution in this article. And for the variable selection problem in our new robust mixture regression model, we introduce the adaptive sparse group Lasso penalty to achieve sparsity at both the group-level and within-group-level. As numerical experiments show, compared with other alternative methods, our method has better performances in variable selection and parameter estimation. Finally, we apply our proposed method to analyze NBA salary data during the period from 2018 to 2019.

The mixture regression model is a powerful tool to explain the relationships between the response variable and the covariates when the population is heterogeneous and consists of several homogeneous components, and the early research can trace back to [

Recently, the research about the mixture regression model is becoming more and more detailed. On the one hand, statisticians paid attention to improving the robustness of mixture regression model, [_{1}-norm penalty, SCAD penalty and Hard penalty. Furthermore, the adaptive Lasso was introduced as a penalty function in [

However, the above regularization methods are more focused on individual variable selection. They all ignore the grouping structures which describe the inherent interconnections among predictors. It may lead to inefficient models. In order to achieve both the robustness of mixture regression model and correct identification of group structures, we assume random errors follow a Laplace distribution and consider a situation that covariates have natural grouping structures, where those in the same group are correlated. In this case, variable selection should be conducted at both the group-level and within-group-level; thus we use the adaptive sparse group Lasso [

The rest of this article is organized as follows. In Section 2, we introduce the robust mixture regression model based on Laplace distribution and adopt the adaptive sparse group Lasso for variable selection. In Section 3, we prove some asymptotic properties for our proposed method. In Section 4, we solve the problem of tuning parameters and components selection. Section 5 conducts a numerical simulation to evaluate the performance of our method. In Section 6, we apply our proposed method to NBA salary data. Finally, the conclusion of this paper is given in Section 7.

Let { ( x 1 , y 1 ) , ⋯ , ( x n , y n ) } be a random sample of observations from the population ( x , y ) , where x = ( x 1 , ⋯ , x p ) T ∈ ℝ p is a p-dimensional covariate vector, and y ∈ ℝ is the response variable which is dependent on corresponding x . Furthermore, for g mixture components, we can say that ( x , y ) follows a mixture linear regression model based on a normal distribution if the conditional density of y given x is

f ( y | x , Ψ ) = ∑ j = 1 g π j 2 π σ j 2 exp ( − ( y − α j − x T β j ) 2 2 σ j 2 ) , (1)

where the mixing probabilities satisfy ∑ j = 1 g π j = 1 , π j > 0 and the parameter Ψ = ( π 1 , α 1 , β 1 , σ 1 2 , ⋯ , π g , α g , β g , σ g 2 ) . For the jth mixture component, there are intercept α j ∈ ℝ , regression coefficients β j = ( β j 1 , ⋯ , β j p ) T and variance σ j 2 ( σ j > 0 ) .

It is known that the mixture linear regression model is sensitive to outliers or heavy-tailed error distributions, and outliers impact more heavily on the mixture linear regression model than on the usual linear regression model, since outliers not only affect the estimation of the regression parameters, but also possibly totally blur the mixture structure. In order to improve the robustness of the estimation procedure, we introduce a robust mixture regression model with a Laplace distribution

f ( y | x , Ψ ) = ∑ j = 1 g π j 2 σ j 2 exp ( − 2 | y − α j − x T β j | σ j ) . (2)

Then we can estimate the unknown parameter Ψ by maximizing the log-likelihood function

L n ( Ψ ) = ∑ i = 1 n log ∑ j = 1 g π j 2 σ j 2 exp ( − 2 | y i − α j − x i T β j | σ j ) . (3)

Now, we consider a situation that covariates have natural grouping structures and can be divided into K groups as x = ( x [ 1 ] T , ⋯ , x [ k ] T , ⋯ , x [ K ] T ) T ∈ ℝ p by some known rules, where x [ k ] = ( x 1 [ k ] , ⋯ , x p k [ k ] ) T is a group which contains p k variables and ∑ k = 1 K p k = p . Then, the log-likelihood function can be written as

L n ( Ψ ) = ∑ i = 1 n log ∑ j = 1 g π j 2 σ j 2 exp ( − 2 | y i − α j − ∑ k = 1 K x i [ k ] T β j [ k ] | σ j ) , (4)

where β j [ k ] = ( β j 1 [ k ] , ⋯ , β j p k [ k ] ) T .

In order to exploit the grouping structures of covariates, we apply the adaptive sparse group Lasso (adaSGL) to the robust mixture regression model, the penalized log-likelihood function

F n ( Ψ ) = L n ( Ψ ) − P n ( Ψ ) . (5)

Here

P n ( Ψ ) = n ∑ j = 1 g π j ∑ t = 1 p λ j t 1 | β j t | + n ∑ j = 1 g π j ∑ k = 1 K λ j k 2 ‖ β j [ k ] ‖ , (6)

where ‖ . ‖ represents the Euclidean norm, λ j t 1 = λ j 1 ω j t and λ j k 2 = λ j 2 ξ j k . Moreover, the weights are defined based on the maximum penalized log-likelihood estimator Ψ ˜ when P n ( Ψ ) is a Lasso penalty,

ω j t = | β ˜ j t | − 1 , ξ j k = ‖ β ˜ j [ k ] ‖ − 1 . (7)

Next, we follow the approach of Hunter and Li [

F n , ε ( Ψ ) = L n ( Ψ ) − P n , ε ( Ψ ) . (8)

Here

P n , ε ( Ψ ) = n ∑ j = 1 g π j ∑ t = 1 p λ j t 1 β j t 2 + ε 2 + n ∑ j = 1 g π j ∑ k = 1 K λ j k 2 ∑ t = 1 p k β j t 2 [ k ] + ε 2 (9)

for some small ε > 0 , and the weights are

ω j t = ( β ˜ j t 2 + ε 2 ) − 1 / 2 , ξ j k = ( ∑ t = 1 p k β ˜ j t 2 [ k ] + ε 2 ) − 1 / 2 . (10)

Following Hunter and Li [

However, the above penalized log-likelihood does not have an explicit maximizer. We introduce an EM algorithm to simplify the computation and denote Z i j as a latent Bernoulli variable such that

Z i j = ( 1, if i th observation ( x i , y i ) is from j th component ; 0, otherwise . (11)

If the complete data set T = ( X , Y , Z ) is observable, the complete log-likelihood function is

L n ( Ψ ; T ) = ∑ i = 1 n ∑ j = 1 g Z i j log [ π j 2 σ j 2 exp ( − 2 | y i − α j − ∑ k = 1 K x i [ k ] T β j [ k ] | σ j ) ] , (12)

where X = ( x 1 , ⋯ , x n ) T , Y = ( y 1 , ⋯ , y n ) and Z = ( Z 11 , ⋯ , Z n g ) .

According to Andrews and Mallows [

L n ( Ψ ; D ) = ∑ i = 1 n ∑ j = 1 g Z i j log { π j V i π σ j 2 exp [ − V i 2 ( y i − α j − ∑ k = 1 K x i [ k ] T β j [ k ] ) 2 σ j 2 ] } + ∑ i = 1 n ∑ j = 1 g Z i j log [ 1 V i 3 exp ( − 1 2 V i 2 ) ] , (13)

where D = ( T , V ) . Naturally, we can obtain the penalized complete log-likelihood F n , ε ( Ψ ; D ) = L n ( Ψ ; D ) − P n , ε ( Ψ ) .

Suppose that Ψ ( r ) is a parameter estimate for the rth iteration. In E step of EM algorithm, we can get E [ F n , ε ( Ψ ; D ) | S , Ψ ( r ) ] by calculating

τ i j ( r ) = E [ Z i j | S , Ψ ( r ) ] , δ i j ( r ) = E [ V i 2 | S , Ψ ( r ) , Z i j = 1 ] , (14)

where S = ( X , Y ) . And we can show that

τ i j ( r ) = π j ( r ) σ j − 1 ( r ) exp ( − 2 | y i − α j ( r ) − ∑ k = 1 K x i [ k ] T β j [ k ] ( r ) | σ j − 1 ( r ) ) ∑ j = 1 g π j ( r ) σ j − 1 ( r ) exp ( − 2 | y i − α j ( r ) − ∑ k = 1 K x i [ k ] T β j [ k ] ( r ) | σ j − 1 ( r ) ) (15)

and

δ i j ( r ) = σ j ( r ) 2 | y i − α j ( r ) − ∑ k = 1 K x i [ k ] T β j [ k ] ( r ) | . (16)

The calculation for δ i j ( r ) follows the same argument as in Phillips [

In M step, we will maximize E [ F n , ε ( Ψ ; D ) | S , Ψ ( r ) ] for updating Ψ . Now, we follow the tactic of [

ψ 2 + ε 2 ≃ ψ 0 2 + ε 2 + ψ 2 − ψ 0 2 2 ψ 0 2 + ε 2 (17)

in a neighborhood of ψ 0 . Then, we can replace the penalty function P n , ε ( Ψ ) in ( r + 1 ) th iteration by

P ˜ n , ε ( Ψ ; Ψ ( r ) ) = n ∑ j = 1 g π j ∑ t = 1 p λ j t 1 [ η j t ( r ) + β j t 2 − β j t 2 ( r ) 2 η j t ( r ) ] + n ∑ j = 1 g π j ∑ k = 1 K λ j k 2 [ γ j k ( r ) + ∑ t = 1 p k ( β j t 2 [ k ] − β j t 2 [ k ] ( r ) ) 2 γ j k ( r ) ] , (18)

where η j t ( r ) = β j t 2 ( r ) + ε 2 and γ j k ( r ) = ∑ t = 1 p k β j t 2 [ k ] ( r ) + ε 2 . Similarly, from Lange [

( y − α − x T ψ ) 2 ≃ 1 p ∑ t = 1 p [ y − α − x T ψ 0 − p x t ( ψ t − ψ 0 t ) ] 2 (19)

in a neighborhood of ψ 0 = ( ψ 01 , ⋯ , ψ 0 p ) T , where p is the dimensionality of ψ . And we apply (19) to E [ L n , ε ( Ψ ; D ) | S , Ψ ( r ) ] , there is a local approximation Q ( Ψ ; Ψ ( r ) ) of E [ F n , ε ( Ψ ; D ) | S , Ψ ( r ) ]

Q ( Ψ ; Ψ ( r ) ) = − ∑ i = 1 n ∑ j = 1 g τ i j ( r ) δ i j ( r ) p σ j 2 ∑ t = 1 p [ y i − α j − x i T β j ( r ) − p x i t ( β j t − β j t ( r ) ) ] 2 − n 2 ∑ j = 1 g π j ( ∑ t = 1 p λ j t 1 η j t ( r ) β j t 2 + ∑ k = 1 K λ j k 2 γ j k ( r ) ∑ t = 1 p k β j t 2 [ k ] ) + ∑ i = 1 n ∑ j = 1 g τ i j ( r ) ( log π j − 1 2 log σ j 2 ) − n ∑ j = 1 g π j [ ∑ t = 1 p λ j t 1 ( η j t ( r ) − β j t 2 ( r ) 2 η j t ( r ) ) + ∑ k = 1 K λ j k 2 ( γ j k ( r ) − ∑ t = 1 p k β j t 2 [ k ] ( r ) 2 γ j k ( r ) ) ] − ∑ i = 1 n ∑ j = 1 g τ i j ( r ) ( 1 2 log π + log δ i j ( r ) + 1 2 δ i j ( r ) ) (20)

in a neighborhood of Ψ ( r ) . Note that (20) can be block-wise maximized in the coordinates of the parameter components π , α , β and σ 2 . Here, π = ( π 1 , ⋯ , π g ) , α = ( α 1 , ⋯ , α g ) , β = ( β 11 , ⋯ , β g p ) and σ 2 = ( σ 1 2 , ⋯ , σ g 2 ) .

Under the constraints that ∑ j = 1 g π j = 1 and π j > 0 , we adopt Lagrangian multiplier to update π by solving

∇ Q ( π , α ( r ) , β ( r ) , σ 2 ( r ) ; Ψ ( r ) ) = 0 , (21)

where ∇ is the gradient operator, ζ is a positive scalar and 0 is a zero vector. Then we have the set of simultaneous equations

a j / π j − b j − ζ = 0 , (22)

where a j = ∑ i = 1 n τ i j ( r ) and b j = n ∑ t = 1 p λ j t 1 η j t ( r ) + n ∑ k = 1 K λ j k 2 γ j k ( r ) , for each j.

According to π j = a j / ( b j + ζ ) and ∑ j = 1 g π j = 1 , we can obtain the unique root ζ * by solving the equation

∑ j = 1 g a j b j + ζ − 1 = 0. (23)

Therefore, the ( r + 1 ) th iterate

π j ( r + 1 ) = a j b j + ζ * . (24)

Furthermore, by solving ∇ Q ( π ( r + 1 ) , α , β ( r ) , σ 2 ; Ψ ( r ) ) = 0 , we have the updates

α j ( r + 1 ) = ∑ i = 1 n τ i j ( r ) δ i j ( r ) ( y i − x i T β j ( r ) ) ∑ i = 1 n τ i j ( r ) δ i j ( r ) (25)

and

σ j 2 ( r + 1 ) = 2 ∑ i = 1 n τ i j ( r ) δ i j ( r ) ( y i − α j ( r + 1 ) − x i T β j ( r ) ) 2 ∑ i = 1 n τ i j ( r ) . (26)

Similarly, for the parameter β j t in kth group, we obtain the updated formula

β j t ( r + 1 ) = ∑ i = 1 n 2 τ i j ( r ) δ i j ( r ) x i t ( y i − α j ( r + 1 ) − x i T β j ( r ) + p x i t β j t ( r ) ) n π j ( r + 1 ) σ j 2 ( r + 1 ) ( λ j t 1 / η j t ( r ) + λ j k 2 / γ j k ( r ) ) + 2 p ∑ i = 1 n τ i j ( r ) δ i j ( r ) x i t 2 (27)

by solving ∇ Q ( π ( r + 1 ) , α ( r + 1 ) , β , σ 2 ( r + 1 ) ; Ψ ( r ) ) = 0 .

Based on the above, we propose the following EM algorithm.

1) Choose an initial value Ψ ( 0 ) = ( π 1 ( 0 ) , α 1 ( 0 ) , β 1 ( 0 ) , σ 1 2 ( 0 ) , ⋯ , π g ( 0 ) , α g ( 0 ) , β g ( 0 ) , σ g 2 ( 0 ) ) .

2) E-Step: at the ( r + 1 ) th iteration, calculate τ i j ( r ) and δ i j ( r ) by (15) and (16).

3) M-Step: at the ( r + 1 ) th iteration, update π j , α j , σ j 2 and β j t by (24), (25), (26) and (27).

4) Repeat E-Step and M-Step until convergence is obtained.

Note that if a perfect least absolute deviation (LAD) fit occurs in EM algorithm, i.e. y i − α j ( r ) − x i T β j ( r ) ≈ 0 for some i, j and r. As a result, δ i j ( r + 1 ) will become very large and numerical instability. In this article, we simply introduce a hard threshold to control the extremely small LAD residuals, δ i j ( r + 1 ) will be assigned a value of 10^{6} when the perfect LAD fit occurs.

The EM algorithm is iterated until some convergence criterion is met. Let tol be a small tolerance constant and M be the maximum iterations for the proposed algorithm, we believe the algorithm has converged to an ideal state when

| F n , ε ( Ψ ( r + 1 ) ) − F n , ε ( Ψ ( r ) ) | < t o l , (28)

or the iterations over the maximum iterations M. See [

According to Dempster et al. [

For the regression coefficient vector β j in jth component, we can separate it into β j = ( β 1 j T , β 2 j T ) T , where β 1 j is the set of non-zero effects and β 2 j is the set of zero effects. Naturally, we decompose the parameter Ψ = ( Ψ 1 , Ψ 2 ) such that Ψ 2 contains all zero effects, namely β 2 j , j = 1 , ⋯ , g . The true parameter is denoted as Ψ 0 and the elements of Ψ 0 are denoted with a subscript, such as β 0 j t .

For the purpose of easy discussion, we define a n = max j , t { λ j t 1 : β 0 j t ≠ 0 } , a n * = max j , k { λ j k 2 : β 0 j [ k ] ≠ 0 } , b n = min j , t { λ j t 1 : β 0 j t = 0 } , b n * = min j , k { λ j k 2 : β 0 j [ k ] = 0 } . Furthermore, we let f ( z ; Ψ ) be the joint density function of z = ( x , y ) and Ω be an open parameter space. In order to prove the asymptotic properties of the proposed algorithm, some regularity conditions on the joint distribution of z are also required.

A1. The density f ( z ; Ψ ) has common support in z for all Ψ ∈ Ω and f ( z ; Ψ ) is identifiable in Ψ up to a permutation of the components of the mixture.

A2. For each Ψ ∈ Ω , the density f ( z ; Ψ ) admits third partial derivatives with respect to Ψ for almost all z .

A3. For each Ψ 0 ∈ Ω , there are functions M 1 ( z ) and M 2 ( z ) (possibly depending on Ψ 0 ) such that for Ψ in a neighborhood of N ( Ψ 0 ) ,

| ∂ f ( z ; Ψ ) ∂ Ψ u | ≤ M 1 ( z ) , | ∂ 2 f ( z ; Ψ ) ∂ Ψ u ∂ Ψ v | ≤ M 1 ( z ) , | ∂ 3 log f ( z ; Ψ ) ∂ Ψ u ∂ Ψ v ∂ Ψ w | ≤ M 2 (z)

such that ∫ M 1 ( z ) d z < ∞ and ∫ M 2 ( z ) f ( z ; Ψ ) d z < ∞ .

A4. The Fisher information matrix

I ( Ψ ) = E { [ ∂ ∂ Ψ log f ( z ; Ψ ) ] [ ∂ ∂ Ψ log f ( z ; Ψ ) ] T }

is finite and positive definite for each Ψ ∈ Ω .

Theorem 1. Let z i = ( x i , y i ) , i = 1 , ⋯ , n , be a random sample from the joint density function f ( z ; Ψ ) that satisfies the regularity conditions A1-A4. Suppose that n a n → p 0 and n a n * → p 0 , as n → 0 , then there is a local maximizer Ψ ^ n of the model (5) for which

‖ Ψ ^ n − Ψ 0 ‖ = O p { n − 1 / 2 } ,

where → p represents convergence in probability.

Proof. Let r n = n − 1 / 2 . It suffices that for any given ε > 0 , there is a constant M ε such that

lim n → ∞ P { sup ‖ μ ‖ = M ε F n ( Ψ 0 + r n μ ) < F n ( Ψ 0 ) } ≥ 1 − ε . (29)

Now, there is a local maximum in { Ψ 0 + r n μ : ‖ μ ‖ ≤ M ε } with large probability, and this local maximizer Ψ ^ n satisfies ‖ Ψ ^ n − Ψ 0 ‖ = O p ( r n ) . Then we let

Δ n ( μ ) = F n ( Ψ 0 + r n μ ) − F n ( Ψ 0 ) = [ L n ( Ψ 0 + r n μ ) − L n ( Ψ 0 ) ] − [ P n ( Ψ 0 + r n μ ) − P n ( Ψ 0 ) ] . (30)

Without loss of generality, we assume that the first d j coefficients of β 0 j are non-zero and the first K j groups contain all non-zero effects of β 0 j , where β 0 j is the true regression coefficient vector in the jth component of the mixture regression model. Hence, we have

P n ( Ψ 0 ) = n ∑ j = 1 g π j ∑ t = 1 d j λ j t 1 | β 0 j t | + n ∑ j = 1 g π j ∑ k = 1 K j λ j k 2 ‖ β 0 j [ k ] ‖ . (31)

Since P n ( Ψ 0 + r n μ ) is a sum of non-negative terms, removing terms corresponding to zero effects makes it smaller,

Δ n ( μ ) ≤ L n ( Ψ 0 + r n μ ) − L n ( Ψ 0 ) − n ∑ j = 1 g π j ∑ t = 1 d j λ j t 1 [ | β 0 j t + r n μ j t | − | β 0 j t | ] − n ∑ j = 1 g π j ∑ k = 1 K j λ j k 2 [ ‖ β 0 j [ k ] + r n μ j [ k ] ‖ − ‖ β 0 j [ k ] ‖ ] . (32)

By Taylor’s expansion, triangular inequality and arithmetic-geometric mean inequality,

L n ( Ψ 0 + r n μ ) − L n ( Ψ 0 ) = n − 1 / 2 L ′ n ( Ψ 0 ) T μ − 1 2 [ μ T I ( Ψ 0 ) μ ] { 1 + o p ( 1 ) } , | n ∑ j = 1 g π j ∑ t = 1 d j λ j t 1 [ | β 0 j t + r n μ j t | − | β 0 j t | ] | ≤ n a n ∑ j = 1 g d j ‖ μ ‖ , | n ∑ j = 1 g π j ∑ k = 1 K j λ j k 2 [ ‖ β 0 j [ k ] + r n μ j [ k ] ‖ − ‖ β 0 j [ k ] ‖ ] | ≤ n a n * ∑ j = 1 g K j ‖ μ ‖ . (33)

Regularity conditions indicate that L ′ n ( Ψ 0 ) = O p ( n 1 / 2 ) and I ( Ψ 0 ) is positive definite, and it is not difficult to find that the sign of Δ n ( μ ) is completely determined by − 1 2 [ μ T I ( Ψ 0 ) μ ] { 1 + o p ( 1 ) } . Therefore, for any given ε > 0 , there is a sufficiently large M ε such that

lim n → ∞ P { sup ‖ μ ‖ = M ε Δ n ( μ ) < 0 } > 1 − ε , (34)

which implies (29), this completes the proof.

Theorem 2. Suppose that the conditions given in Theorem 1 and g is known, n a n → p 0 , n a n * → p 0 , n b n → p ∞ and n b n * → p ∞ , as n → 0 . Then, for any n -consistent maximum penalized log-likelihood estimator Ψ ^ n , we have the following:

1) Sparsity: As n → ∞ , P ( β ^ 2 j = 0 ) → 0 , j = 1 , ⋯ , g .

2) Asymptotic normality:

n { [ P ″ n ( Ψ 01 ) n + I 1 ( Ψ 01 ) ] ( Ψ ^ 1 − Ψ 01 ) + P ′ n ( Ψ 01 ) n } → d N ( 0 , I 1 ( Ψ 01 ) ) ,

where → d denotes convergence in distribution and I 1 ( Ψ 01 ) is a Fisher information when all zero effects are removed.

Proof. In order to prove the sparsity of Theorem 2, we consider the partition Ψ = ( Ψ 1 , Ψ 2 ) and let ( Ψ ^ 1 , 0 ) is the maximizer of the penalized log-likelihood function F n ( Ψ 1 , 0 ) , which is regarded as a function of Ψ 1 . It suffices to show that in the neighborhood ‖ Ψ − Ψ 0 ‖ = O ( n − 1 / 2 ) , there is the probability P ( F n ( Ψ 1 , Ψ 2 ) − F n ( Ψ ^ 1 , 0 ) < 0 ) → 1 as n → ∞ . Then we have

F n ( Ψ 1 , Ψ 2 ) − F n ( Ψ ^ 1 , 0 ) = [ F n ( Ψ 1 , Ψ 2 ) − F n ( Ψ 1 , 0 ) ] + [ F n ( Ψ 1 , 0 ) − F n ( Ψ ^ 1 , 0 ) ] ≤ [ F n ( Ψ 1 , Ψ 2 ) − F n ( Ψ 1 , 0 ) ] . (35)

On the other hand,

F n ( Ψ 1 , Ψ 2 ) − F n ( Ψ 1 , 0 ) = [ L n ( Ψ 1 , Ψ 2 ) − L n ( Ψ 1 , 0 ) ] − [ P n ( Ψ 1 , Ψ 2 ) − P n ( Ψ 1 , 0 ) ] . (36)

By the mean value theorem,

L n ( Ψ 1 , Ψ 2 ) − L n ( Ψ 1 , 0 ) = [ ∂ L n ( Ψ 1 , ξ ) ∂ Ψ 2 ] T Ψ 2 (37)

for some ‖ ξ ‖ ≤ ‖ Ψ 2 ‖ = O ( n − 1 / 2 ) . By the mean value theorem and regularity condition A3, we can get

‖ ∂ L n ( Ψ 1 , ξ ) ∂ Ψ 2 − ∂ L n ( Ψ 01 , 0 ) ∂ Ψ 2 ‖ ≤ ‖ ∂ L n ( Ψ 1 , ξ ) ∂ Ψ 2 − ∂ L n ( Ψ 1 , 0 ) ∂ Ψ 2 ‖ + ‖ ∂ L n ( Ψ 1 , 0 ) ∂ Ψ 2 − ∂ L n ( Ψ 01 , 0 ) ∂ Ψ 2 ‖ ≤ [ ∑ i = 1 n M 1 ( z i ) ] ‖ ξ ‖ + [ ∑ i = 1 n M 1 ( z i ) ] ‖ Ψ 1 − Ψ 01 ‖ = { ‖ ξ ‖ + ‖ Ψ 1 − Ψ 01 ‖ } O p ( n ) = O p ( n 1 / 2 ) . (38)

Here Ψ 01 is a subvector of Ψ 0 with all zero regression coefficients removed. Regularity conditions imply that ∂ L n ( Ψ 01 , 0 ) / ∂ Ψ 2 = O p ( n 1 / 2 ) , therefore ∂ L n ( Ψ 1 , ξ ) / ∂ Ψ 2 = O p ( n 1 / 2 ) . In this case, we have

L n ( Ψ 1 , Ψ 2 ) − L n ( Ψ 1 , 0 ) = O p ( n 1 / 2 ) ∑ j = 1 g ∑ t = d j + 1 p | β j t | (39)

for large n. And for the penalized function P n ( Ψ ) ,

P n ( Ψ 1 , Ψ 2 ) − P n ( Ψ 1 , 0 ) ≥ ∑ j = 1 g π j ∑ t = d j + 1 p n b n n | β j t | + ∑ j = 1 g π j ∑ k = K j + 1 K n b n * n ‖ β j [ k ] ‖ . (40)

Since n b n → ∞ and n b n * → ∞ as n → ∞ , we have

[ L n ( Ψ 1 , Ψ 2 ) − L n ( Ψ 1 , 0 ) ] − [ P n ( Ψ 1 , Ψ 2 ) − P n ( Ψ 1 , 0 ) ] < 0 (41)

with probability to one as n → ∞ . This completes the proof of the sparsity.

For the asymptotic normality of Theorem 2, we still use the same argument as in Theorem 1 and consider F n ( Ψ 1 , 0 ) is a function of Ψ 1 , there is a

By the Taylor’s expansion,

Substituting into (42), we have

In addition, regularity conditions imply that

Finally, we can get

by the Slutsky’s theorem. This completes the proof of the asymptotic normality.

Now, we know that, as long as the conditions

In this section, we need to solve two problems. One concerns the number of components g and the other problem is the selection of the tuning parameters

where

Suppose that there is a set of parameter combinations

To quantify the performance of our proposed robust mixture regression model based on adaptive sparse group Lasso (adaSGL-RMR), we design a numerical simulation and generate sample data

where Z is a component indicator. There are

The model parameters include

We use three methods for comparing. The Gaussian mixture model (GMM) based on Lasso penalty (Lasso-GMM), the GMM model based on adaptive Lasso penalty (adaL-GMM) and the adaSGL-RMR model. The fmr package of the R programming language is used to compute the parameter estimates of Lasso-GMM and adaL-GMM.

In this article, the algorithm is terminated when the change in the penalized complete log-likelihood function is less than 10^{−5} or meets the maximum iterations 10^{5}. Furthermore, we adopt a threshold value for

and

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives for each fitted model.

As shown in

Model | nvars | ngroups | cgroups | FNR | FPR | RMSE | |
---|---|---|---|---|---|---|---|

True model | 12 | 4 | 100 | 0 | 0 | 0 | |

N (0, 1) | Lasso-GMM | 49.71 | 11.97 | 2.08 | 0.00 | 0.79 | 0.09 |

adaL-GMM | 20.38 | 8.66 | 51.00 | 0.01 | 0.17 | 0.06 | |

adaSGL-RMR | 13.66 | 4.00 | 88.00 | 0.00 | 0.03 | 0.06 | |

Laplace (1) | Lasso-GMM | 46.19 | 11.97 | 3.00 | 0.00 | 0.71 | 0.12 |

adaL-GMM | 26.15 | 9.95 | 34.83 | 0.01 | 0.29 | 0.10 | |

adaSGL-RMR | 13.47 | 4.05 | 89.25 | 0.00 | 0.03 | 0.09 | |

t_{3} | Lasso-GMM | 40.84 | 11.81 | 6.83 | 0.00 | 0.63 | 0.15 |

adaL-GMM | 29.34 | 10.65 | 24.67 | 0.01 | 0.36 | 0.12 | |

adaSGL-RMR | 13.67 | 4.09 | 88.42 | 0.00 | 0.04 | 0.11 | |

Lasso-GMM | 50.26 | 11.99 | 1.83 | 0.00 | 0.79 | 0.09 | |

adaL-GMM | 21.02 | 8.93 | 46.75 | 0.01 | 0.19 | 0.07 | |

adaSGL-RMR | 13.69 | 4.00 | 88.00 | 0.00 | 0.04 | 0.06 |

In this section, we will analyze how NBA players’ performances of regular season affect their salaries. We gather salaries for all players in the NBA from the website https://hoopshype.com/salaries/players/2018-2019, during the period from 2018 to 2019. Performance measures for individuals are gathered from the website https://www.foxsports.com/nba/stats?season=2018 in the 2018-2019 regular season, which include scoring, rebounding, assists and defense statistics. By liminating missing data, we obtain a complete dataset, which contains salaries for 248 NBA players and 27 measures of performance.

These performance measures are divided into five groups and covariates in the same group are correlated. Score: points per game (PPG), points per 48 minutes (PTS/48), field goals made per game (FGM/G), field goal attempts per game (FGA/G), 3 point FG made per game (3FGM/G), 3 point FG attempts per game (3FGA/G), free throws made per game (FTM/G), free throw attempts per game (FTA/G), high point total in a single game (HIGH), points per shot (PPS). Rebound: offensive rebounds per game (ORPG), defensive rebounds per game (DRPG), rebounds per 48 minutes (RPG/48), offensive rebound% (OFF REB%), defensive rebound% (DEF REB%), rebound% (REB%). Assist: assists per game (APG), assists per 48 minutes (AST/48), assist% (AST%), turnovers per game (TPG), turnover% (TO%). Steal: steals per game (SPG), steals per 48 minutes (STL/48), steal% (STL%). Block: blocks per game (BPG), blocks per 48 minutes (BLK/48), block% (BLK%).

The matrix ^{2} of 0.42, these terrible results motivate us to conduct further research for this problem. The logged salaries histogram shows multi-modality from

For comparison, we run multiple analyses that include three sets of starting parameters for each of ^{2} of 0.90. The predicted results from the ^{2} of 0.95. The predicted results from the ^{2} of 0.96. See

Covariates | g = 2 | g = 3 | g = 4 | ||||||
---|---|---|---|---|---|---|---|---|---|

comp.1 | comp.2 | comp.1 | comp.2 | comp.3 | comp.1 | comp.2 | comp.3 | comp.4 | |

π | 0.56 | 0.44 | 0.17 | 0.41 | 0.42 | 0.19 | 0.36 | 0.41 | 0.04 |

Intercept | 14.93 | 16.39 | 14.39 | 15.18 | 16.43 | 14.39 | 15.21 | 16.43 | 17.12 |

PPG | 0.44 | 0.19 | 0.50 | 0.23 | 0.13 | ||||

PTS/48 | |||||||||

FGM/G | 0.23 | 0.18 | |||||||

FGA/G | 0.11 | 0.49 | |||||||

3FGM/G | −0.03 | ||||||||

3FGA/G | |||||||||

FTM/G | 0.09 |

FTA/G | 0.09 | ||||||||
---|---|---|---|---|---|---|---|---|---|

HIGH | |||||||||

PPS | |||||||||

ORPG | 0.06 | ||||||||

DRPG | |||||||||

RPG/48 | |||||||||

OFF REB% | −0.20 | −0.07 | |||||||

DEF REB% | 0.07 | ||||||||

REB% | 0.31 | 0.06 | 0.23 | ||||||

APG | 0.11 | ||||||||

AST/48 | 0.08 | 0.06 | |||||||

AST% | |||||||||

TPG | |||||||||

TO% | |||||||||

SPG | |||||||||

STL/48 | |||||||||

STL% | |||||||||

BPG | |||||||||

BLK/48 | |||||||||

BLK% | 0.06 |

In this paper, we propose a robust mixture regression model based on a Laplace distribution and consider the adaptive sparse group Lasso for variable selection. Its oracle properties are proved completely in Section 3. In addition, the numerical simulation and real data application show that our method has better performance in parameter estimation and variable selection than other methods. A limitation of this study is that we only consider the mixture regression model with K no-overlapping groups and ignore the case when there are some overlaps between different groups. In our future work, we will pay more attention to this problem.

The authors declare no conflicts of interest regarding the publication of this paper.

Wang, J.T. and Ye, W.Z. (2020) Adaptive Sparse Group Variable Selection for a Robust Mixture Regression Model Based on Laplace Distribution. Advances in Pure Mathematics, 10, 39-55. https://doi.org/10.4236/apm.2020.101004