Variable Selection for Robust Mixture Regression Model with Skew Scale Mixtures of Normal Distributions

Abstract

In this paper, we propose a robust mixture regression model based on the skew scale mixtures of normal distributions (RMR-SSMN) which can accommodate asymmetric, heavy-tailed and contaminated data better. For the variable selection problem, the penalized likelihood approach with a new combined penalty function which balances the SCAD and l2 penalty is proposed. The adjusted EM algorithm is presented to get parameter estimates of RMR-SSMN models at a faster convergence rate. As simulations show, our mixture models are more robust than general FMR models and the new combined penalty function outperforms SCAD for variable selection. Finally, the proposed methodology and algorithm are applied to a real data set and achieve reasonable results.

Share and Cite:

Chen, T. and Ye, W. (2022) Variable Selection for Robust Mixture Regression Model with Skew Scale Mixtures of Normal Distributions. Advances in Pure Mathematics, 12, 109-124. doi: 10.4236/apm.2022.123010.

1. Introduction

In applied statistics, the arc-sine laws for the Wiener process and the skew Brownian motion [1] are widely used in finance if the market is homogeneous. However, the problem of modeling heterogeneous data has been extensively studied in recent years and the Finite Mixture of Regression (FMR) model is an important tool for heterogeneous cases. A large number of applications associate a random response variable Y with covariates x through FMR models and the assumption is that for each observation data point ( x 1 , Y 1 ) , , ( x n , Y n ) , the regression coefficients are not the same. More details about the FMR model can be found in [2].

The Gaussian FMR model is the most common FMR model, which assumes that the random error of each subgroup follows the normal distribution. It is well known that using the normal distribution to model data with asymmetric and heavy-tailed behaviors is unsuitable, and the parameter estimates are sensitive to outliers. To overcome the potential shortcomings of Gaussian mixture models, McLachlan et al. [3] proposed to replace the mixtures of normal with mixtures of t-distribution which results in a more robust mixture model. Basso et al. [4] studied a finite mixture model based on scale mixtures of skew-normal distributions, and Franczak et al. [5] proposed a mixture model using shifted asymmetric Laplace distributions which parameterize the skewness as well as the location and the scale.

The problem of variable selection in FMR models has been widely discussed recently. There are generally two types of variable selection methods. One is the optimal subset selection method and the discontinuous penalty method based on the information criterion, including stepwise regression, best subset regression, BIC criterion, AIC criterion and so on. The other is the continuous penalty method. By imposing penalties on the parameters of the objective function, one can select significant variables and obtain the parameter estimates simultaneously. The Least Absolute Shrinkage and Selection Operator (LASSO), elastic net regularization [6], MCP penalty [7] and SCAD penalty [8] are penalty functions for variable selection. We utilize the SCAD and a new penalty function proposed in this paper which balance the SCAD and l 2 penalty to perform variable selection on a robust mixture regression model based on Skew Scale Mixtures of Normal (SSMN) distributions [9] and this robust model can accommodate asymmetric and heavy-tailed data better.

The paper is organized as follows. In Section 2, a robust mixture regression model using the skew scale mixtures of normal distributions (RMR-SSMN) is introduced. Then, variable selection methods with SCAD penalty function and a newly proposed penalty function are presented in Section 3. Section 4 outlines the adjusted EM algorithm for estimating and a BIC method for selecting turning parameters and components. In Section 5, we carry out simulation studies to compare the performances between FMR models and RMR-SSMN models, and show the effect of variable selection with penalty functions. An application to a real data set of the method is discussed in Section 6 and some conclusions are obtained in Section 7.

2. Robust Mixture Regression Model with SSMN Distributions

It is known that the FMR model can model heterogeneous data and the Skew Scale Mixtures of Normal (SSMN) distributions [9] cover both asymmetric and heavy-tailed distributions. Therefore, we propose a robust mixture regression model whose regression errors of components follow SSMN distributions. Unsurprisingly, this model is more robust than general FMR models for heterogeneous cases.

2.1. Skew Scale Mixtures of Normal Distributions

If a random variable Y follows a skew-normal (SN) distribution with location parameter μ , scale parameter σ 2 , and skewness parameter λ , denoted Y S N ( μ , σ 2 , λ ) , then its density function is given as follows:

f ( y ) = 2 ϕ ( y ; μ , σ 2 ) Φ ( λ y μ σ ) . (1)

where ϕ ( y ; μ , σ 2 ) and Φ ( y ; μ , σ 2 ) are the probability density function and the cumulative distribution function of N ( μ , σ 2 ) calculated at y , respectively.

Note that when λ = 0 , the S N ( μ , σ 2 , λ ) reduces to the N ( μ , σ 2 ) , and as given in [9], the SN distribution’s marginal stochastic representation is presented by:

Y = μ + σ ( δ | T 0 | + ( 1 δ 2 ) 1 2 T 1 ) , (2)

where δ = λ / ( 1 + λ 2 ) 1 / 2 , T 0 N ( 0 , 1 ) and T 1 N ( 0 , 1 ) are independent.

Furthermore, if a random variable Y follows a SSMN distribution [9] with location parameter μ , scale parameter σ 2 , and skewness parameter λ , then its probability density function is given by:

f ( y ) = 2 0 ϕ ( y ; μ , l ( u ) σ 2 ) d H ( u ; τ ) Φ ( λ y μ σ ) , (3)

where H ( u ; τ ) is the cumulative distribution function of U who derived from the parameter vector τ , and U is a positive random variable, and l ( u ) is a strictly positive function. If the probability density function of the random variable Y is shown as the Equation (3), it can be denoted as: Y S S M N ( μ , σ 2 , λ , H ; l ) .

For Y S S M N ( μ , σ 2 , λ , H ; l ) , its hierarchical representation has the form as follows:

Y | U = u S N ( μ , l ( u ) σ 2 , λ l 1 2 ( u ) ) , U H ( τ ) . (4)

This paper will consider the following distributions in the SSMN distributions family:

· The skew Student-t-normal distribution (STN) [10] with U G a m m a ( v / 2 , v / 2 ) , v > 0 and l ( u ) = 1 / u , which follows probability density:

f ( y ) = 2 1 σ v π Γ ( ( v + 1 ) / 2 ) Γ ( v / 2 ) ( 1 + d v ) ( v + 1 2 ) Φ ( λ y μ σ ) , (5)

where d = ( y μ ) 2 / σ 2 and Γ ( . ) is the gamma function. We can obtain that U | Y = y G a m m a ( ( v + 1 ) / 2 , ( v + d ) / 2 ) .

· The skew contaminated normal distribution (SCN). U is a discrete random variable taking one of two values and l ( u ) = 1 / u . Given the parameter vector τ = ( v , γ ) T , 0 < v < 1 , 0 < γ < 1 , the density function of U is h ( u ; τ ) = v I ( u = γ ) + ( 1 v ) I ( u = 1 ) . Naturally get as follows:

f ( y ) = 2 { v ϕ ( y ; μ , σ 2 / γ ) + ( 1 v ) ϕ ( y ; μ , σ 2 ) } Φ ( λ y μ σ ) . (6)

Therefore, the conditional distribution U | Y = y can be obtained as:

f ( u | Y = y ) = 1 f 0 ( y ) { v ϕ ( y ; μ , σ 2 / γ ) I ( u = γ ) + ( 1 v ) ϕ ( y ; μ , σ 2 ) I ( u = 1 ) } , (7)

where f 0 ( y ) = v ϕ ( y ; μ , σ 2 / γ ) I ( u = γ ) + ( 1 v ) ϕ ( y ; μ , σ 2 ) I ( u = 1 ) .

· The skew power-exponential distribution (SPE) has following probability density:

f ( y ) = 2 v 2 1 / 2 v σ Γ ( 1 / 2 v ) e d v / 2 Φ ( λ y μ σ ) , 0.5 < v 1 , (8)

Ferreira et al. [9] have proved that E [ l 1 ( U ) | Y = y ] = v d v 1 .

2.2. Robust Mixture Regression Model with SSMN Distributions

Suppose we have n independent random variables y 1 , y 2 , , y n , which are taken from a mixture of SSMN distributions. The conditional density function of the robust mixture regression model with SSMN distributions (RMR-SSMN) which has K components is given by:

f ( y i ; x i , Ψ ) = k = 1 K ω k S S M N ( y i ; α k + x i T β k , σ k 2 , λ k , τ k ; l ) , (9)

with covariate vector x i q and q-dimensional unknown regression coefficients vector β k , k = 1 , , K . ω k , k = 1 , , K denote the mixing proportions satisfying ω k 0 , k = 1 K ω k = 1 .

Ψ = ( ω 1 , , ω K 1 , α 1 , , α K , β 1 T , , β K T , σ 1 2 , , σ K 2 , λ 1 , , λ K , τ 1 T , , τ K T ) T is the parameter vector of the model. For convenience, let ω = ( ω 1 , , ω K 1 ) T , α = ( α 1 , , α K ) T , β = ( β 1 T , , β K T ) T , σ 2 = ( σ 1 2 , , σ K 2 ) T , λ = ( λ 1 , , λ K ) T , and τ = ( τ 1 T , , τ K T ) T . In this paper, RMR-SSMN models contain the robust mixture regression model with STN distribution (RMR-STN), SCN distribution (RMR-SCN), SPE distribution (RMR-SPE) and SN distribution (RMR-SN).

3. Variable Selection Method

If a component in the q-dimensional explanatory variable x has no significant effect on the response variable y , the regression coefficient of this component estimated by the maximum likelihood method will close to 0 rather than 0. Thus, this covariate is not excluded from the model and makes the model unstable. To avoid this problem, we use a penalized likelihood approach [11] for selecting variables and estimating parameters simultaneously. Let { ( y i , x i ) ; i = 1 , , n } be sample observations from RMR-SSMN models. The log-likelihood function of Ψ is given by:

l ( Ψ ) = i = 1 n log k = 1 K ω k S S M N ( y i ; α k + x i T β k , σ k 2 , λ k , τ k ; l ) . (10)

Following the idea in [11], we can get the estimates of Ψ by maximizing the penalized log-likelihood function which is defined as:

L ( Ψ ) = l ( Ψ ) p ( Ψ ) , (11)

with the penalty function is given by:

p ( Ψ ) = n k = 1 K ω k j = 1 q p a k ( | β k j | ) , (12)

where p a k ( . ) is a nonnegative and non-decreasing function in | β k j | with the turning parameter a k , k = 1 , , K . The turning parameter controls the intensity of the penalty for the regression coefficients.

The SCAD penalty has a type of oracle property as discussed in [8]. In this work, we complete the variable selection procedure using the following SCAD penalty function:

p a k ( | β k j | ) = { a k | β k j | , | β k j | a k ( β k j 2 2 c a k | β k j | + a k 2 ) 2 ( c 1 ) , a k < | β k j | c a k a k 2 ( c + 1 ) 2 , | β k j | > c a k (13)

Meanwhile, inspired by [12], we propose a combined penalty function which balance the SCAD penalty and l 2 penalty. This penalty function by introducing a connection parameter b is more effective in variable selection than directly mixing SCAD and l 2 , and the specific form is given by:

p a k ( | β k j | ) = { a k [ b | β k j | + ( 1 b ) β k j 2 ] , | β k j | a k b ( β k j 2 2 c a k | β k j | + a k 2 ) 2 ( c 1 ) + a k ( 1 b ) β k j 2 , a k < | β k j | c a k a k 2 b ( c + 1 ) 2 + a k ( 1 b ) β k j 2 , | β k j | > c a k (14)

We call this new penalty function as MIXL2-SCAD. Some asymptotic properties of the penalty function are showed in [12], and the constant a k > 0 and c > 2 . Following the idea of [8], let c = 3.7 . In particular, the constant b , 0 b 1 and a k in MIXL2-SCAD jointly control the speed of contraction of β k j , and when b = 1 , MIXL2-SCAD penalty reduces to the SCAD penalty.

4. Numeric Solutions

The expectation-maximization (EM) algorithm can be applied to mixture regression models based on SSMN distributions for maximizing the penalized log-likelihood function. When the M-step of EM is analytically intractable for SSMN distributions, it can be replaced with a sequence of conditional maximization (CM) steps which is derived from ECM algorithm [13]. Furthermore, we also maximize the constrained actual marginal log-likelihood function which called CML steps [14] for simplicity.

4.1. Maximization of the Penalized Log-Likelihood Function

Let us introduce the latent vector Z i = ( z i 1 , , z i K ) T with the component indicator variable z i k which has the following form:

z i k = { 1 , the i th sample comes from the latent k th component, 0 , otherwise . (15)

Using the Equations (2) and (4), we can get the following hierarchical representation for the mixture of SSMN distributions.

Y i | ( T i = t i , U i = u i , z i k = 1 ) ~ N ( α k + x i T β k + l ( u i ) σ k λ k ( 1 + λ k 2 l ( u i ) ) 1 / 2 t i , σ k 2 l ( u i ) 1 + λ k 2 l ( u i ) ) , U i | z i k = 1 ~ H ( τ k ) , T i | z i k = 1 ~ T N ( 0 , 1 ; ( 0 , ) ) , Z i ~ M ( 1 ; ω 1 , , ω K ) . (16)

T N ( 0 , 1 ; ( 0 , ) ) denotes the truncated normal distribution. Let t = ( t 1 , , t n ) T , u = ( u 1 , , u n ) T , Y = ( y 1 , , y n ) T and Z = ( Z 1 , , Z n ) T . Among them, t and u are also regarded as latent vectors. Then the complete log-likelihood function with complete-data Y c = ( Y T , u T , t T , Z T ) T is given by:

l c ( Ψ ) = l c ( ω ) + l c ( α , β , σ 2 , λ , τ ) , (17)

with:

l c ( ω ) = i = 1 n k = 1 K z i k log ω k , (18)

l c ( α , β , σ 2 , λ , τ ) = i = 1 n k = 1 K z i k [ C log σ k 2 t i 2 2 σ k 2 + t i λ k σ k 2 ( y i α k x i T β k ) ] i = 1 n k = 1 K z i k 2 σ k 2 { [ l 1 ( u i ) + λ k 2 ] ( y i α k x i T β k ) 2 } + i = 1 n k = 1 K z i k log h ( u i ; τ k ) . (19)

C is a constant that does not depend on any unknown parameter, and h ( u i ; τ k ) is the density function of the latent variable u i .

Replacing l ( Ψ ) with l c ( Ψ ) in the penalized log-likelihood function, the complete penalized log-likelihood function is given by:

L c ( Ψ ) = l c ( Ψ ) p ( Ψ ) . (20)

Refer to the method of Fan and Li [8], given the initial parameter value Ψ ( 0 ) , p ( Ψ ) can be replaced by the following local quadratic function:

p ( Ψ ) n k = 1 K ω k j = 1 q [ p a k ( | β k j ( 0 ) | ) + p a k ( | β k j ( 0 ) | ) 2 | β k j ( 0 ) | ( β k j 2 β k j ( 0 ) 2 ) ] (21)

This approximation will be applied in the CM-step of the algorithm at each iteration. The adjusted EM algorithm proceeds with the following three steps. The E-step calculates the conditional expectation of the complete penalized log-likelihood function, the CM-step and CML-step obtain the closed-form of parameter estimates.

· The E-step. Given the current estimates Ψ ^ ( m ) , calculate the Q-function, Q ( Ψ | Ψ ^ ( m ) ) = E [ L c ( Ψ ) | Y , Ψ ^ ( m ) ] , obtained as:

Q ( Ψ | Ψ ^ ( m ) ) = Q 1 ( ω | Ψ ^ ( m ) ) + Q 2 ( α , β , σ 2 , λ | Ψ ^ ( m ) ) + Q 3 ( τ | Ψ ^ ( m ) ) p ( Ψ | Ψ ^ ( m ) ) , (22)

with:

Q 1 ( ω | Ψ ^ ( m ) ) = i = 1 n k = 1 K z ^ i k ( m ) log ω k , (23)

Q 2 ( α , β , σ 2 , λ | Ψ ^ ( m ) ) = i = 1 n k = 1 K z ^ i k ( m ) [ log σ k 2 t ^ i k 2 ( m ) 2 σ k 2 + t ^ i k ( m ) λ k σ k 2 ( y i α k x i T β k ) ] i = 1 n k = 1 K z ^ i k ( m ) 2 σ k 2 [ ( l ^ i k 1 ( m ) + λ k 2 ) ( y i α k x i T β k ) 2 ] , (24)

Q 3 ( τ | Ψ ^ ( m ) ) = E [ i = 1 n k = 1 K z i k log h ( u i ; τ k ) | Y , Ψ ^ ( m ) ] . (25)

The required expressions are z ^ i k ( m ) , t ^ i k 2 ( m ) , t ^ i k ( m ) and l ^ i k 1 ( m ) .

First, the conditional expectation z ^ i k ( m ) = E [ z i k | y i , Ψ ^ ( m ) ] is given by:

z ^ i k ( m ) = ω ^ k ( m ) S S M N ( y i ; α ^ k ( m ) + x i T β ^ k ( m ) , σ ^ k 2 ( m ) , λ ^ k ( m ) , τ ^ k ( m ) ; l ) k = 1 K ω ^ k ( m ) S S M N ( y i ; α ^ k ( m ) + x i T β ^ k ( m ) , σ ^ k 2 ( m ) , λ ^ k ( m ) , τ ^ k ( m ) ; l ) . (26)

Then, refer to [15], t ^ i k ( m ) = E [ t i | y i , Ψ ^ ( m ) , z i k = 1 ] and t ^ i k 2 ( m ) = E [ t i 2 | y i , Ψ ^ ( m ) , z i k = 1 ] can be evaluated by:

t ^ i k ( m ) = λ ^ k ( m ) e ^ i k ( m ) + σ ^ k ( m ) W Φ ( λ ^ k ( m ) e ^ i k ( m ) σ ^ k ( m ) ) , (27)

t ^ i k 2 ( m ) = ( λ ^ k ( m ) e ^ i k ( m ) ) 2 + σ ^ k 2 ( m ) + λ ^ k ( m ) σ ^ k ( m ) e ^ i k ( m ) W Φ ( λ ^ k ( m ) e ^ i k ( m ) σ ^ k ( m ) ) . (28)

with W Φ ( u ) = ϕ ( u ) / Φ ( u ) and e ^ i k ( m ) = y i α ^ k ( m ) x i T β ^ k ( m ) .

Further, l ^ i k 1 ( m ) has different expressions for RMR-SSMN models with different distributions in the SSMN family, obtained as:

l ^ i k 1 ( m ) = { 1 , forRMR-SNmodel v ^ k ( m ) + 1 v ^ k ( m ) + d i k , forRMR-STNmodel 1 v ^ k ( m ) + v ^ k ( m ) γ ^ k ( m ) 3 2 exp [ ( 1 γ ^ k ( m ) ) d i k / 2 ] 1 v ^ k ( m ) + v ^ k ( m ) γ ^ k ( m ) 1 2 exp [ ( 1 γ ^ k ( m ) ) d i k / 2 ] , forRMR-SCNmodel v ^ k ( m ) d i k v ^ k ( m ) 1 , forRMR-SPEmodel (29)

with d i k = ( y i α ^ k ( m ) x i T β ^ k ( m ) ) 2 / σ ^ k 2 ( m ) .

· The CM-step. Maximize Q ( Ψ | Ψ ^ ( m ) ) with respect to Ψ on the ( m + 1 ) th iteration. As in [11], the mixing proportions are updated by:

ω ^ k ( m + 1 ) = 1 n i = 1 n z ^ i k ( m ) , (30)

which are the approximate iterated values. Maximizing Q 1 ( ω | Ψ ^ ( m ) ) with respect to the ω instead of maximizing Q ( Ψ | Ψ ^ ( m ) ) will simplify the computation of ω ^ k ( m + 1 ) and this updating scheme works well in our simulations.

We now consider that ω is constant, and maximize Q ( Ψ | Ψ ^ ( m ) ) with respect to the rest parameters in Ψ . The updates of ( α k , σ k 2 , λ k , β k ) T are given by:

α ^ k ( m + 1 ) = i = 1 n z ^ i k ( m ) [ t ^ i k ( m ) λ ^ k ( m ) + ( l ^ i k 1 ( m ) + λ ^ k ( m ) 2 ) ( y i x i T β ^ k ( m ) ) ] i = 1 n z ^ i k ( m ) ( l ^ i k 1 ( m ) + λ ^ k ( m ) 2 ) , (31)

σ ^ k 2 ( m + 1 ) = i = 1 n z ^ i k ( m ) [ t ^ i k 2 ( m ) 2 t ^ i k ( m ) λ ^ k ( m ) e ^ i k ( m ) + ( l ^ i k 1 ( m ) + λ ^ k ( m ) 2 ) e ^ i k ( m ) 2 ] i = 1 n 2 z ^ i k ( m ) , (32)

λ ^ k ( m + 1 ) = i = 1 n z ^ i k ( m ) t ^ i k ( m ) e ^ i k ( m ) / i = 1 n z ^ i k ( m ) e ^ i k ( m ) 2 , (33)

β ^ k ( m + 1 ) = [ X T A k X + n σ ^ k 2 ( m ) ω ^ k ( m ) Δ a ( β ^ k ( m ) ) ] 1 X T A k B k , (34)

with:

A k = [ diag ( l ^ 1 k 1 ( m ) , , l ^ n k 1 ( m ) ) + λ ^ k ( m ) 2 I n ] diag ( z ^ 1 k ( m ) , , z ^ n k ( m ) ) ,

B k = ( b ^ 1 k ( m ) , , b ^ n k ( m ) ) T , b ^ i k ( m ) = y i α ^ k ( m ) t ^ i k ( m ) λ ^ k ( m ) l ^ i k 1 ( m ) + λ ^ k ( m ) 2 ,

Δ a ( β ^ k ( m ) ) = diag ( p a k ( | β ^ k 1 ( m ) | ) | β ^ k 1 ( m ) | , p a k ( | β ^ k 2 ( m ) | ) | β ^ k 2 ( m ) | , , p a k ( | β ^ k q ( m ) | ) | β ^ k q ( m ) | ) ,

and I n is an identity matrix of order n , and X = ( x 1 , , x n ) T is a matrix of order n × q .

· The CML-step. Fix Ψ ^ p ( m + 1 ) = ( α ^ k ( m + 1 ) , β ^ k ( m + 1 ) , σ ^ k 2 ( m + 1 ) , λ ^ k ( m + 1 ) ) T and ω ^ k ( m + 1 ) , update τ to get τ ^ ( m + 1 ) = ( τ ^ 1 ( m + 1 ) , , τ ^ K ( m + 1 ) ) T by optimizing the constrained log-likelihood function:

τ ^ ( m + 1 ) = argmax τ 1 , , τ K i = 1 n log k = 1 K ω ^ k ( m + 1 ) S S M N ( y i ; Ψ ^ p ( m + 1 ) , τ k ; l ) . (35)

The above iterations are repeated alternately until the maximum number of iterations is reached or a suitable stopping rule is met. In this work, the iterations will be completed when Ψ ^ ( m + 1 ) Ψ ^ ( m ) is sufficiently small, such as 10−5.

4.2. Selection of Turning Parameters and Components

When using the methods proposed in this paper, we also need to consider how to determine the components K and the size of the turning parameters in the penalty function. Cross-Validation (CV), Generalized Cross-Validation (GCV), AIC and BIC are commonly used criteria for the selection of turning parameters.

As showed in [12], the final selected model will be overfitting if the turning parameter selected by GCV and they use the BIC to choose. In this paper, we also propose a proper BIC criterion for RMR-SSMN models to select turning parameters a = ( a 1 , , a K ) T , the constant b and the components K .

Let θ = ( a , b , K ) T , we should take a set of θ at a time over a suitable range and use the proposed adjusted EM algorithm to obtain the corresponding parameter estimates Ψ ^ . The optimal set of θ is selected by minimizing the following BIC criterion:

B I C ( θ ) = 2 l ( Ψ ^ ) + ( p ˜ K 1 + k = 1 K η k ) × log ( n ) . (36)

where η k represents the number of non-zero regression coefficients of β k and p ˜ is either equal to 4 (RMR-SN model), 5 (RMR-STN and RMR-SPE models) or 6 (RMR-SCN model).

5. Simulation Studies

We perform Monte Carlo simulations to evaluate the performance of the proposed robust mixture model and adjusted EM algorithm. To evaluate the effect of variable selection and the accuracy of parameter estimates, we use the correctly estimated zero coefficients (S1), correctly estimated non-zero coefficients (S2), the mean estimate over all falsely identified non-zero predictors ( M N Z ) [16] of β and the mean squared error (MSE) of regression coefficients ( MSE ( β ^ ) ),

MSE ( β ^ ) = E ( β ^ k β k ) T ( β ^ k β k ) .

5.1. Simulation 1

The first simulation uses the SCAD penalty function to select significant variables for RMR-STN, RMR-SPE and RMR-SCN models, and compare the simulation results with the Gaussian FMR model and RMR-SN model.

We set K = 2 for the simulation so that the sample data set { ( y i , x i ) ; i = 1 , , n } for the mixture regression model is derived from the following model:

y = { α 1 + x T β 1 + ε 1 , Z = 1 α 2 + x T β 2 + ε 2 , Z = 2 (37)

where Z is used to identify the subgroup that the sample belongs to. α 1 = 2 , β 1 = ( 4 , 1 , 2 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) T , α 2 = 2 , β 2 = ( 1 , 3 , 0 , 2 , 3 , 0 , 0 , 0 , 0 , 0 ) T , ω 1 = 0.6 and ω 2 = 0.4 .

The covariate x is generated from a multivariate normal with mean 0, variance 1, and two correlation structures: ρ i j = 0.5 | i j | , 1 i , j n . The simulation considers the following three error distributions cases: 1) the random errors ε 1 and ε 2 follow the t-distribution with 3 degrees of freedom ( t ( 3 ) ); 2) the random errors ε 1 and ε 2 follow the chi-square distribution with 3 degrees of freedom ( χ 2 ( 3 ) ); 3) The random errors ε 1 and ε 2 follow the mixture distribution of normal 0.9 N ( 0 , 1 ) + 0.1 N ( 0 , 5 2 ) . So that there are 15 sets of combination, and for each combination, we respectively performed 100 repetitions for the simulation with n = 300 .

From Table 1, the value of S1 in Com1 and Com2 from RMR-STN, RMR-SPE and RMR-SCN are all bigger than the value in FMR model for all three cases, respectively. In case (1), the S1 in Com2 from RMR-SPE is biggest (S1 = 0.9533), however, the S1 in Com2 from FMR is smallest (S1 = 0.8533). In case (2), the RMR-SCN model has the biggest S1 (S1 = 0.9033) in Com2 while the S1 in Com2 from FMR is 0.8167. In case (3), when the S2 in Com1 and Com2 from FMR are 0.9933 and 0.9950, respectively, the values of S2 in both components from RMR-STN are 1.00.

Furthermore, the value of MSE ( β ^ ) in Com1 and Com2 from RMR-STN, RMR-SPE and RMR-SCN are much smaller than the value in FMR model. When errors follow χ 2 ( 3 ) distribution, RMR-SN performs well with the smallest MSE ( β ^ ) and S2 = 1.00 in both Com1 and Com2 that indicates the non-zero coefficients are all identified correctly. Overall, RMR-SSMN models are more robust than FMR for variable selection when the data set is asymmetric ( χ 2 ( 3 ) ), heavy-tailed ( t ( 3 ) ) and contaminated ( 0.9 N ( 0 , 1 ) + 0.1 N ( 0 , 5 2 ) ).

5.2. Simulation 2

Simulation 2 uses the MIXL2-SCAD penalty function to select significant variables for RMR-STN, RMR-SPE and RMR-SCN models. By comparing the results of Simulation 1 and Simulation 2, the effects of SCAD and MIXL2-SCAD penalty function on variable selection are analyzed. In addition, the generation of the sample data set and the distributions of random errors in this simulation are the same as in Simulation 1, and both n = 300 and n = 500 cases are considered.

Table 1. Results of FMR, RMR-SN, RMR-STN, RMR-SPE and RMR-SCN using SCAD penalty function on 100 replicates.

From Table 2, we can know that as the sample size n increases, the values of S1 and S2 in Com1 and Com2 are getting closer and closer to 1, and the value of MSE ( β ^ ) is getting smaller and smaller, indicating the asymptotic property of parameter estimates. When n = 500 and errors follow t ( 3 ) distribution, the values of S1 and S2 in Com1 from RMR-SPE model are equal to 1.00, which indicates that the MIXL2-SCAD penalty ensures the non-zero and zero coefficients

Table 2. Results of RMR-STN, RMR-SPE and RMR-SCN using MIXL2-SCAD penalty function on 100 replicates.

can be identified completely. When n = 500 and errors follow 0.9 N ( 0 , 1 ) + 0.1 N ( 0 , 5 2 ) , the same result appears in Com2 from RMR-SPE and RMR-SCN model. The absolute values of mean estimate over all falsely identified non-zero predictors ( M N Z ) are smaller than 0.01 from MIXL2-SCAD when n = 500 .

By comparing Table 1 and Table 2, we can see that the values of S1 and S2 in Com1 and Com2 from MIXL2-SCAD are all greater than or equal to the values from SCAD penalty for all cases when n = 300 . It is worth noting that in case (3), when n = 300, the values of S2 in Com1 and Com2 from RMR-STN, RMR-SPE and RMR-SCN using MIXL2-SCAD penalty are all 1.00, however, the values of S2 in Com1 from RMR-SPE and RMR-SCN using SCAD penalty are 0.9933. From these comparisons of experimental data, we can know that MIXL2-SCAD performs better than SCAD penalty in variable selection.

6. Real Data Analysis

In this section, we obtain the Seoul bike sharing demand data set from the website http://archive.ics.uci.edu/ml/datasets.php. From this dataset, we screen out the total number of bikes rented from 10:00 am to 11:00 am every functional day of bike rental system in Seoul from December 1, 2017 to November 30, 2018 with 12 features that may affect the demand of rental bikes. There are 353 observations in total. The 12 features are: temperature ( x 1 ), humidity ( x 2 ), wind-speed ( x 3 ), visibility ( x 4 ), dew point temperature ( x 5 ), solar radiation ( x 6 ), rainfall ( x 7 ), snowfall ( x 8 ), holiday (holiday = 1, else = 0; x 9 ), spring (spring = 1, else = 0; x 10 ), summer (summer = 1, else = 0; x 11 ) and autumn (autumn = 1, else = 0; x 12 ). x 9 - x 12 are dummy variables and x 10 - x 12 indicate different seasons. Considering that there may be further differential effects between seasons and holiday, we continue to introduce 3 interaction terms between dummy variables, namely x 9 x 10 , x 9 x 11 , x 9 x 12 . This leads to a set of 15 potential covariates affecting rented bike count (RBC) from 10:00 am to 11:00 am.

Let Y = RBC / s d ( RBC ) be the response variable, where s d ( RBC ) is the standard deviation of RBC. Figure 1 shows the histogram and density estimate of Y , we can see that the data set has obvious heterogeneity, so that the RMR-STN model is applicable. We also apply RMR-SPE and RMR-SCN models to this real data set, the outcomes are worse than RMR-STN’s result, thus we do not report the results here.

The parameter estimates under FMR, RMR-STN ( K = 2 ) and RMR-STN ( K = 3 ) with BIC method and MIXL2-SCAD penalty function are given in Table 3. The K = 3 RMR-STN model has the lowest BIC (542.5) and the K = 2 RMR-STN model ranks second (BIC = 544.7) when FMR model has the biggest BIC (562.8). Furthermore, the predicted rented bike count from the K = 3 RMR-STN model has the smallest MSE of 0.09 and the biggest regression R ˜ 2 of 0.90.

From Table 3, the bike rented demand can be divided into three categories: “low”, “medium” and “high” during the time period from 10:00 am to 11:00 am

Figure 1. Histogram and density estimate for Y = RBC / s d ( RBC ) .

Table 3. Summary of FMR, RMR-STN ( K = 2 ) and RMR-STN ( K = 3 ) model with BIC method and MIXL2-SCAD penalty for Seoul bike sharing demand data set.

with K = 3 RMR-STN model. Humidity is a negative factor for all three types of demand. When the bike rented demand is “medium”, warmer temperature and increased solar radiation help increase bike demand, while rainfall, snowfall, and holidays reduce the demand. In contrast, when bike rented demand is “high”, the positive effect of dew point temperature on bike demand is greatest, while the negative effects of holidays and snowfall disappear. In addition, we can also find that the rented bike count has a strong seasonality and the rented count will be more in other seasons than in winter.

7. Conclusion

In this paper, we mainly propose a robust mixture regression model based on the skew scale mixtures of normal distributions (RMR-SSMN) which can avoid the potential limitation of normal mixtures. A new penalty function (MIXL2-SCAD) which combines SCAD and l 2 penalties is presented for variable selection. Through simulations, we find that the RMR-SSMN models are more robust than general FMR models for heterogeneous data with asymmetry and heavy-tailed properties, and outliers. Furthermore, the capability of MIXL2-SCAD to select the most parsimonious FMR model is obviously better than SCAD. The proposed methodology is applied to a real data set and achieves reasonable results. However, this paper only focuses on the mixture of the simple linear model, and further research can focus on the mixture of the semiparametric model or nonparametric model.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Krykun, I. (2018) The Arc-Sine Laws for the Skew Brownian Motion and Their Interpretation. Journal of Applied Mathematics and Physics, 6, 347-357.
https://doi.org/10.4236/jamp.2018.62033
[2] McLachlan, G.J. and Peel, D. (2000) Finite Mixture Models. Wiley, New York.
https://doi.org/10.1002/0471721182
[3] McLachlan, G.J. and Peel, D. (1998) Robust Cluster Analysis via Mixtures of Multivariate T-Distributions. Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Sydney, 11-13 August 1998, 658-666.
https://doi.org/10.1007/BFb0033290
[4] Basso, R.M., Lachos, V.H., Cabral, C.R.B. and Ghosh, P. (2011) Robust Mixture Modeling based on Scale Mixtures of Skew-Normal Distributions. Computational Statistics & Data Analysis, 54, 2926-2941.
https://doi.org/10.1016/j.csda.2009.09.031
[5] Franczak, B.C., Browne, R.P. and McNicholas, P.D. (2014) Mixtures of Shifted Asymmetric Laplace Distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 1149-1157.
https://doi.org/10.1109/TPAMI.2013.216
[6] Zou, H. and Hastie, T. (2005) Regularization and Variable Selection via Elastic Net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301-320.
https://doi.org/10.1111/j.1467-9868.2005.00503.x
[7] Zhang, C.H. (2010) Nearly Unbiased Variable Selection under Minimax Concave Penalty. The Annals of Statistics, 38, 894-942.
https://doi.org/10.1214/09-AOS729
[8] Fan, J. and Li, R. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360.
https://doi.org/10.1198/016214501753382273
[9] Ferreira, C.S., Bolfarine, H. and Lachos, V.H. (2011) Skew Scale Mixtures of Normal Distributions: Properties and Estimation. Statistical Methodology, 8, 154-171.
https://doi.org/10.1016/j.stamet.2010.09.001
[10] Gomez, H.W., Venegas, O. and Bolfarine, H. (2007) Skew-Symmetric Distributions Generated by the Normal Distribution Function. Environmetrics, 18, 395-407.
https://doi.org/10.1002/env.817
[11] Khalili, A. and Chen, J. (2007) Variable Selection in Finite Mixture of Regression Models. Journal of the American Statistical Association, 102, 1025-1038.
https://doi.org/10.1198/016214507000000590
[12] Khalili, A. (2010) New Estimation and Feature Selection Methods in Mixture-of-Experts Models. Canadian Journal of Statistics, 38, 519-539.
https://doi.org/10.1002/cjs.10083
[13] Meng, X.L. and Rubin, D.B. (1993) Maximum Likelihood Estimation via the ECM Algorithm: A General Framework. Biometrika, 80, 267-278.
https://doi.org/10.1093/biomet/80.2.267
[14] Liu, C. and Rubin, D.B. (1994) A Simple Extension of EM and ECM with Faster Monotone Convergence. Biometrika, 81, 633-648.
https://doi.org/10.1093/biomet/81.4.633
[15] Ferreira, C.S. and Lachos, V.H. (2016) Nonlinear Regression Models under Skew Scale Mixtures of Normal Distributions. Statistical Methodology, 33, 131-146.
https://doi.org/10.1016/j.stamet.2016.08.004
[16] Lloyd-Jones, L.R., Nguyen, H.D. and McLachlan, G.J. (2018) A Globally Convergent Algorithm for Lasso-Penalized Mixture of Linear Regression Models. Computational Statistics and Data Analysis, 119, 19-38.
https://doi.org/10.1016/j.csda.2017.09.003

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.