Selecting the Quantity of Models in Mixture Regression

Mixture regression is a regression problem with mixed data. Specifically, in the observations, some data are from one model, while others from other models. Only after assuming the quantity of the model is given, EM or other algorithms can be used to solve this problem. We propose an information criterion for mixture regression model in this paper. Compared to ordinary information citizen by data simulations, results show our citizen has better performance on choosing the correct quantity of models.


Introduction
Mixture regression is a special situation in regression problem.Rather than getting samples in one distribution, the data of mixture regression are from multiple distributions (the information of which distribution every observation from is unknown), which will make a bad effect in parameter estimation.The mixture regression problem can be described as follows [1]: , 1, 2, , ( ) n p X * is independent observation matrix with n observations with p variables., 1, 2, , i x i n =  means ith observation vector from n observations.The length of i x is p. 1 n Y * is response variable from observation data with the length of n. k β and ( ) is the unknown parameters (weight) of the variable and scale parameter in different models. is a random error independent from i X .ik π is the probability of ith observation is from the kth distribution k Ω . (i.e. ( ) ∈ Ω ).To solve the mixture regression problem, it need two parts.Firstly, confirming which model every sample is from is required.Secondly, parameters in each model should be estimated.That is the reason to call mixture regression model as model-based clustering [2] [3].
For all the mixture regression problem, ik π is unknown which has: Furthermore, n k Z * is defined as classification matrix of mixture regression.Every element of classification matrix is the estimator of π : π .And shows the information of ith observation is from kth distribution or not ( 1 i z k = means ith observation is from kth distribution).Classification matrix is one of the most important results in mixture regression problem.If we know the true Z, we can simply split the data into different linear regression and get the parameter estimation.
Parameter estimation can be obtained by EM algorithm.Fraley et al., [4] [5] state the EM algorithm in ordinary mixture regression model which means every model in it is an ordinary linear regression.EM algorithm of ordinary mixture regression is as follows: Column vector i z in classification matrix Z can be considered as a multinomial distribution.The probability of this multinomial distribution is 1 2 , , , . And complete-data likelihood is: E-step in mixture regression model can be obtained by: When ik z is fixed, M-step is finished by maximizing k τ and θ by Formula (3).For a normal mixture regression problem, k f of E-step can be replaced by PDF of normal distribution function k φ : ( ) As every observation is independent, covariance matrix can be defined as k I λ Σ = , parameter of E-step can be calculated quickly in M-step by: 1 1 ˆ; ; .
Song et al., [6] has finished EM algorithm with robust mixture regression.Q Wu et al., [7] proposed EM algorithm in quantile regression.Furthermore D. Lang et al., [8] explained a fast iteration method for mixture regression problem which can solve mixture regression when random error in different distributions.
Moreover, all the algorithms mentioned below is considering the quantity of models g is known.However this will not happened in every condition.The number of models g need to be chosen before the algorithm.When X is a low dimension matrix, a scatter plot can be drawn for choosing g.To get the true quantity of models, watching scatter plot and giving a conclusion is not suitable for a high-dimension situation.It was meaningful to discussing how to create a proper method choosing the right quantity of models in a mixture regression problem.
The rest of the paper is organized as follows.Section 2 will discuss the equivalence between mixture regression and ordinary regression when classification matrix is fixed.We extend a method based on information criterion in Section 3. Section 4 is the data simulation of different information criterions.Proof of theorem is in the Appendix section.

Equivalence of Linear Regression
Unsupervised learning has its method to choose the quantity of clusters, like GAP statics in K-means [9].
Mixture regression can be regards as a model based clusting including judging which cluster every observation should be grouped as well as the parameter estimation.
To find a proper method for choosing the quantity of models, we need to find the relationships between mixture regression and other algorithms.In some conditions, such as classification matrix Z is fixed and random error has the same variance, mixture regression can be written as a linear regression.
When random error in every model is independent and identically distributed from a normal distribution ( 2 Random error in mixture regression is from a normal distribution, either. ( ) The proof can be found in the Appendix.
After proofing this theorem, we can use the evaluation methodology from regression to solve the quantity choosing in mixture regression.

Information Criterion
For a regression problem, Akaike information criterion (AIC) or Bayesian information criterion (BIC) [10] is always used for evaluating a regression model [11].Information criterion is based on information theory, it shows the information lost in a specify model.A trade-off between goodness of fitting and the complexity of the model is considered in information criterion: The best model is the one with the minimum AIC (BIC).L is the likelihood function which states the goodness of fitting (expression (3)).k is the penalty of the information criterion standing for the number of unknown parameters in the model.In linear regression, k means the number of dependent variables.As for BIC, the penalty is larger, weight of penalty comes to ( ) ln n from 2.

Information Criterion in Mixture Regression
In mixture regression, parameters in classification matrix should be considered as part of the estimator variables.Despite these variables, the model will tend to choosing a larger quantity of models which is also an overfitting problem.
For every observation, 0 1 − variable with the number of 1 g − can ensure classification among g models.For example, if 2 g = , for the ith observation, 1 i z can complete determinate ith observation is from which cluster(model).As for the situation of 3 g = , 1 2 , i i z z are requested to determinate the ith observation.k value (number of unknown parameters in the model) in information criterion of mixture regression should be: ( ) Akaike information criterion for Mixture regression(AICM) and Bayesian information criterion for mixture (BICM) regression is: AICM and BIC can be used for the quantity selecting in mixture regression problem.However, penalty weight for g in BICM is ( ) ln n n , rather than 2n in AICM which will lead to an underfitting result when g is larger.We will see the details of this point in next section.

Data Simulation
In order to validating the rationality of the model, we designed numeric simulations and generated sample data

Simulation I
Models from simulation I is: where ( ) . Every distribution has 50 observations.See Figure 1 to see the results when 1, 2,3, 4 g = .We repeated the simulation for 100 times, use Mixreg package in R [12] to got the answer in Table 1.Table 1.Simulation I of selecting quantity of models.

Simulation II
The models in simulation II is same as simulation I. While, the samples in simulation II is 100 for each distribution.
Figure 2 can be found in Appendix for simulation 2. Table 2 below is results for repeating 100 simulation.

Simulation III
Simulation III has three distributions with 50 samples in each distribution.
See Figure 3 for simulation III in Appendix, and result is shown in Table 3. Table 2. Simulation II of selecting quantity of models.Table 3. Simulation III of selecting quantity of models.

Conclusion
According to the results in three simulations, we can see AICM and BICM show a good result in small g ( 2 g = ) which choose the true quantity of models at a rate over 98%.While, ordinary AIC and BIC cannot point out the right quantity even once.In large samples, AICM and BICM perform well in simulation II.In small samples, simulation I, AICM tends to overfit the quantity and BICM tend to underfit the quantity in low probability of 2%.Simulation III shows an interesting results when 3 g = ; BICM is too underfitting, which means the weight of penalty is too large for selecting the quantity.AICM choose correctly for 97 times among 100 times.That validates the information we gave in Section 3.

;
~0, In mixture regression problem, ith observation , , , ,  can be written as: We have:   x y is samed as ith single observation above.In this way, a mixture regression can be written as Y Xβ = +    .As for the distribution of random error

Theorem 1 (
Equivalence between Mixture Regression and Linear Regression) If the estimater of π , classification Ẑ π = is fixed, mixture regression can be written as