Comparison of Different Regularized and Shrinkage Regression Methods to Predict Daily Tropospheric Ozone Concentration in the Grand Casablanca Area

Tropospheric ozone (O3) is one of the pollutants that have a significant impact on human health. It can increase the rate of asthma crises, cause permanent lung infections and death. Predicting its concentration levels is therefore important for planning atmospheric protection strategies. The aim of this study is to predict the daily mean O3 concentration one day ahead in the Grand Casablanca area of Morocco using primary pollutants and meteorological variables. Since the available explanatory variables are multicollinear, multiple linear regressions are likely to lead to unstable models. To counteract the multicollinearity problem, we compared several alternative regression methods: 1) Continuum Regression; 2) Ridge & Lasso Regressions; 3) Principal component regression (PCR); 4) Partial least Square regression & sparse PLS and; 5) Biased Power Regression. The aim is to set up a good prediction model of the daily ozone in the Grand Casablanca area. These models are fitted on a training data set (from the years 2013 and 2014), tested on a data set (from 2015) and validated on yet another data set data (from 2015). The Lasso model showed a better performance for the prediction of ozone concentrations compared to multiple linear regression and its other alternative methods.


Introduction
Tropospheric ozone (O3) is a dangerous air pollutant that threatens the human health [1].Indeed, epidemiologic studies have shown that current ambient exposures are associated with reduced baseline lung function, exacerbation of asthma and premature mortality [2].It is a secondary trace gas in the atmosphere, not directly emitted from any natural or anthropogenic source, but rather formed through a complex set of several chemical reactions in presence of sunlight [3].
As all the large cities in the world, Casablanca has a serious photochemical tropospheric ozone (O3) air pollution problem.The urban emission pattern of O3-forming pollutants is caused by meteorological factors: exposure to sunshine, temperature and wind speed and also by a series of atmospheric reactions involving precursor pollutants caused by car and industry emissions.
Various statistical methods are available to predict daily O3 [4] [5] [6] [7] Multiple linear regression (MLR) is frequently used by several environmental protection agencies involved in air quality monitoring (e.g.[8] [9] [10] etc.).The prediction ability of this type of models is generally satisfactory, notwithstanding the fact that, very often, the predictor variables are highly collinear.In the following, MLR will stand as the standard method to which alternative methods will be compared.In order to tackle the multicollinearity issue, various methods are proposed in the literature [11].Ridge Regression [12] was certainly the first method proposed in this context.This method of analysis is based on a regularization strategy which aims at constraining the length (as measured by the L2 norm) of the vector of regression coefficients to be relatively small.Similarly, Lasso regression [13] follows the same principle as Ridge Regression, but, this time, the length of the regression coefficients is measured by L1 norm.Other alternative methods to MLR encompass Principal Component Regression [14] and Partial Least Squares regression [15] [16] [17].These latter techniques were combined into a single approach, Continuum Regression (CR), proposed by Stone and Brooks [18].Sundberg [19] shows that CR is also related to Ridge regression.Recently, a new biased regression strategy consisting in gradually shedding off the correlations among the independent variables was proposed by Qannari and El Ghaziri [20].
In this study, we compare different regression models to predict the daily mean O3 concentration in the Grand Casablanca area using O3 persistence and meteorological variables.We follow two successive stages.In the first stage, we fit statistical models using two years (2013-2014, calibration sets) of pollutants and observed meteorological data.In the second stage, in order to choose the best predictive model, we compare the prediction abilities using the observed

Data
The

Exploratory Data Analysis
Inevitably, the collected data contain missing values and it is important to tackle  Once the data were imputed, we applied a standardized Principal Components Analysis (PCA) to investigate the relationships between the variables and assess the degree of collinearity among the predictor variables [28] [29] [30].

Linear Modelling Approach
Several statistical models are available to predict tropospheric ozone concentration.Since the available explanatory variables are potentially highly correlated, we investigate alternative methods to the classical Multiple Linear Regression (MLR) that circumvent the problem of multicollinearity which is likely to lead to unstable models.The emphasis is put on: 1) classical regularized regression methods: Principal Components Regression [14], Partial Least Squares Regression [16], Sparse PLS [31] [32] and Continuum Regression introduced by Stone and Brooks in 1990; 2) Penalized regression methods: Ridge [12] and Lasso developed by Tibshirani [13] and finally; 3) Biased Power Regression recently introduced by Qannari and El Ghaziri [20].

Multiple Linear Regression (MLR)
We assume the MLR model using Equation (1): where 3 i O : ozone concentration at day i; Equation (1) can be written using usual matrix format after centring of the response variable: where y is an ( ) ( ) The prediction of y using OLS, OLS ŷ , is given by OLS OLS = y Xβ .It is well known that this estimator is likely to lead to an unstable model and poor predictions in presence of quasi-collinearity among the predictors or in the case of the small sample and high dimensional setting.

Principal Component Regression (PCR)
The principal components regression (PCR) approach involves running PCA on the predictor variables and, thereafter, using the first m principal components , as the predictors in a linear regression model [14] [34].
The appropriate number, m, of first principal components to be introduced in the model can be determined in practice by a validation technique such as Leave One Out cross validation (LOO).The PCR model can be written as in Equation ( 4): PCR gives a biased estimate of the regression coefficients.If all the PCs are included in the model, we retrieve the usual MLR estimator, OLS β .

Partial Least Squares Regression (PLS)
As with PCR, PLS regression, introduced by [35], consists in regressing y on components, also called latent variables, which are linear combinations of the p predictor variables.The major difference between PCR and PLS regression is that, whereas PCR uses only X to construct the components to be used as regressors, PLS regres- sion uses both X and y to determine such components.More precisely, the PLS components are determined sequentially and, at each step, we seek to determine a new component, constrained to be orthogonal to the components determined at the previous stages, so as to maximize the covariance between this component and the independent variable.
Suppose that m PLS components are determined.Again, the number m of latent variables to be introduced in the model can be selected by means of LOO cross validation technique in practice.These latent variables can be stacked into is the X-weight matrix.Equation (5) gives the vector of fitted values obtained by regressing y on ( ) m T , namely: We can show in Equation ( 6) the expression of PLS regression coefficients as: where ( ) PLS regression is often helpful to reduce the number of predictors to a small number of latent variables constructed by linear combinations of the columns of original predictors.It yields a biased estimate of the regression coefficients.

Sparse PLS Regression (SPLS)
The Sparse PLS method defined by [32] is a direct adaptation of the PLS regression method.It allows us to operate a dimensionality reduction using regression PLS.
In SPLS regression, w the first vector of loadings is sought as an optimal solution to: where w is the L 1 -norm of vector w and 0 η > is a scalar which controls the degree of sparsity.
The regression coefficients estimation of y on X is calculated in the fol- lowing way: The coefficients of the non-selected variables are set to 0, and the coefficients of the selected variables are those obtained by means of the "standard" PLS regression.We can also give an expression of the SPLS regression coefficients defined by (7) [32] ( ) ( ) PLS SPLS ˆ, if 0 and 1, , ˆ0 otherwise The interest of the SPLS is two folds.On the one hand, thanks to the sparsity, it yields an easy to interpret model and, on the other hand, it prevents the problem of multicollinearity by using the PLS framework.SPLS estimator is biased comparing to OLS estimator.Moreover, SPLS is computationally efficient with a tunable sparsity parameter to select the important variables.

Continuum Regression (CR)
The CR prediction model is chosen from a continuum of candidates among which we find methods of analysis related to OLS estimation, PCR, PLSR.[19] gives a general overview of the continuum approach regression and shows how different methods relate to "least squares ridge regression".As with PCR and PLS regression, CR consists in a regression upon latent variables (i.e.optimal linear combinations of the independent variables).More precisely, these latent Xβ .The CR achieves a reduction of the variance of estimator at the cost of introducing a small bias [18].
CR aims at transforming the explanatory variables into new latent predictors which are orthogonal to each other and constructed as linear combinations of the original predictors.It makes it possible to circumvent the problem of multicollinearity between predictors.However, the CR regression does not specifically aim at selecting a subset of variables [18].

Penalized Regression
Another general strategy to circumvent the problem of multicollinearity consists in imposing a constraint on the vector of regression coefficients.The two most popular methods in this context are Ridge and Lasso regressions.

 Ridge regression
Ridge regression is the first regularization procedure that was proposed to cope with the multicollinearity problem [12].The Ridge estimator is given by ( 8): ( ) where 0 k ≥ is a constant to be selected.Note that if 0 k = , the Ridge estima- tor amounts to the least-squares estimator.
Ridge estimator is obtained as a solution to the following least squares problem defined by ( 9): , ˆarg min where 0 There is a one to one correspondence between the Ridge parameter k and the upper bound, δ, imposed on the vector of regression coefficients, β .From a prac- tical point of view, these parameters can be selected by means of a cross-validation technique.
The Ridge regression shrinks the OLS estimators towards 0. It yields a biased estimator, but with a smaller variance than that of OLS estimator.

 Lasso regression
The Least Absolute Shrinkage and Selection Operator, or Lasso [13] is another penalized regression where L 2 penalty of ridge regression is replaced by an L 1 penalty: . This is a subtle change that has important consequences.Indeed, this constraint entails that some of the regression coefficients are shrunk exactly to zero.This means that this regression strategy operates de facto a selection of variables since the unimportant variables are discarded, their regression coefficients being equal to zero.Formally, the lasso estimator is given as a solution to the following optimization problem by ( 10): ( ) where 0 δ ≥ .The parameter δ controls the degree of sparsity and, in practice; it is determined by Leave One Out (LOO) cross validation procedure.The smaller this parameter is, the larger is the number of discarded variable.Contrariwise, if δ is larger than

Biased Power Regression (BPR)
Recently, a new biased regression called Biased Power regression (BPR) strategy was proposed [20].It consists in gradually shedding off the correlation among the independent variables by means of a tuning parameter α.More precisely, the BPR estimator of β is given by ( 11): ( ) where α is a tuning parameter which ranges between 0 and 1.
In practice, α is selected using a cross validation procedure.
Clearly, when 0 α = , we retrieve the OLS estimator and as α increases, the variance-covariance matrix of the predictor variables is shrunk to the identity matrix.The prediction of y using BPR, BP ŷ is given by BP BP = y Xβ .BP-regression shares the same properties as Ridge regression (see Section 2.3.4) and thus can highlight those variables whose coefficients become very small.However, it was not designed to select a subset of variables [20].

Evaluation of the Methods
To assess the prediction ability of the various models listed above on the Grand Casablanca O3 data, we performed a cross validation technique on a training set to determine the appropriate parameters (number of components, Ridge or lasso constant…) to be used in the prediction models.Using these parameters, the performance of the different models is assessed on the basis of a fresh data set.
More precisely, we partitioned the available data into two complementary data- The performance of the models is measured with standard indicators defined by Equations ( 12)-( 14) generally used to compare statistical models [36].
In a first stage, an internal validation (2013 and 2014 datasets) is performed on the basis of the following criteria in order to assess the quality of the model adjustment: The multiple correlation coefficient R 2 allows us to assess the quality of the adjustment based on the training set: , where train n is the size of training sample.
The Root Mean Squared Errors (RMSE): This is computed according to the following expression: ( ) The smallest value of this criterion corresponds to the best adjustment of the model.
For the external validation (on summer 2015 observed dataset), the following criterion is used to assess the prediction ability of the models [37]: The Root Mean Squared Errors of Prediction (RMSEP).This criterion is similar to RMSE but, this time, the validation data set is used instead of the training data set.
( ) where obstest n is the size of the observed the validation set (obstest).Obviously, the best predictive model corresponds to the smallest RMSEP.The following criterion is used to assess the performance of the models with observed meteorological data (obstest) and real meteorological forecast data (prevtest) for summer period of 2015.
In the same way, we define the RMSEP of prevision based on the forecasted dataset as: where prevtest n is the size of the sample size of the forecasted data (prevtest).

Results and Discussion
Experiments were run on an Intel(R) Core(TM) i7-6600U CPU computer with 2.60 GHz, 8 Go in RAM, Windows 10 Professional 64 bits.
All the statistical analyses were performed using the free software R. (http://www.rproject.org/).

Data Description
In this study, the dataset is composed of 25 explanatory variables.Appendix A gives the abbreviation of these variables.Minimum, maximum, mean and standard deviation statistics are provided to describe the characteristics of the data set.
The 2013 and 2014 studied periods are characterized by high temperatures.
We can notice that, in the Grand Casablanca Area, the maximal temperature (TMPMAX) can go up to 37.5˚C and the minimal temperature is 16.2˚C.The maximal daily total sunshine duration is of 13.3 hours.We can also notice that there is almost no rain is these periods (RRQUOT).The Wind strength average is relatively high at 18 hours (FFVM18h = 7 m/s).The O3 concentrations are between 10 and 130 µg/m 3 .
There are in total 90 missing values for the 366 recording days, distributed on 14 variables.This represents around 2% of missing values to be imputed before the prediction models are performed.

Missing Values Imputation
As mentioned above, a strategy based on the K-nearest neighbors was per- led to the best results [38].

Multivariate Analysis
Figure 2 shows the scatter plots associated with pairs of the available variables.It highlights the pairwise correlations between these variables.We also indicate in Figure 2, the histograms associated with each variable and the correlation coefficients between each pair of variables.For instance, we can see that there is a high correlation between O3veilleJahid and O3 Jahid (the two last columns) with a correlation coefficient equal to 0.92.We can also observe large correlations (around 0.94) between the first three explanatory variables (TMPMAX and TMPMOY, TMPMIN and TMPMOY).We also performed a PCA on the imputed dataset.PCA is run on the complete data (2013 and 2014) after imputation and standardization of the variables.
The data is composed of 366 days (from April to September of 2013 and 2014) and 24 variables.The first five principal components recover up to 65% of the total variance (Table 2).In the following, only the results related to the first two principal components which recover around 40% of the total inertia are shown.
Figure 3 shows the correlations of the explanatory variables with the first two principal components.The variable O3 Jahid is superimposed as an illustrative variable with a blue arrow to depict its relationships with the explanatory variables.This figure highlights the strong correlations among the variables, which may be harmful for the prediction models.
The first PC is linked to wind direction (Vx06, Vx12, Vy12 and Vx18) and pressure variables (PRESTN at 06 h, 12 h and 18 h).We can notice, for example, that variables TMPMAX, TMPMIN and TMPMOY as well as PRESTN06h, PRESTN12h and PRESTN18h are strongly correlated.A strong correlation also exists between variables Vx06, Vy06, Vx18 and Vx12.O3veilleJahid variable is very correlated to O3 Jahid but it is not very well represented in the plan (PC1-PC2).

Prediction Models
In this section, we compare the results obtained from the different regression models described in section 2.     The most important finding (Table 3, Figure 4) is that the Lasso regression model has the best performance in predicting O3 concentrations in Jahid compared to the other models.Moreover, it clearly gives stable regression coefficients compared to Reduced MLR model.RMSEPprev.Furthermore, besides yielding a more stable model than multiple linear regression, Lasso is based on a relatively small number of explanatory variables.This feature presents a significant advantage for the daily prediction of the ozone concentration in the Grand Casablanca.

Conclusions
three years datasets used in this study are provided by the National Meteorological Office of Morocco (DMN).Their collection ranged from January 2013 to December 2015.A detailed description of the 25 available variables is given in Appendix A. The data consist of daily O3 pollutant concentrations, observed at"Jahid" monitoring site, located in the western center of the Casablanca city which is the most important industrial area (Figure1).Following the DMN recommendation, we use in this study 23 meteorological variables such as temperature, humidity, duration of sunshine, wind direction, wind speed, precipitation, pressure, etc. also measured at the center of Casablanca.After a step of pre-processing of these data which involved in particular the imputation of missing values, we dispose of a dataset containing for each day the observed meteorological data, forecasted meteorological data acquired from the numeric model ALADIN-Maroc and the measured O3 concentrations.The period of the study is limited to the hot and sunny season (April-September) when ozone concentrations are at their maximum[21].

Figure 1 .
Figure 1.Map of Grand Casablanca area in the western center of Morocco.The measurement station El-Jahid is an urban one located in the center of Casablanca.
variable j observed on day i.

) m F
is a matrix with m columns containing the m first PCs.The Ordinary Least Squares (OLS) estimator of PCR β is given by: to express this model in terms of the original variables by remarking that double effect of shrinking the β coefficients, allowing to decrease the variance of the regression coefficients as with Ridge regression, and, more importantly, to operate an automatic selection of the variables, by cancelling out some j β coefficients.

sets: 1 )
summer period of 2013 and 2014 (called the training set) used to adjust the models; and 2) summer period of 2015 (called the validation set or testing set) used to "test" the models obtained in the training phase.The models are fitted on the training set used to predict the ozone responses for a) the observed meteorological data on 2015 of the validation set (obstest) and b) the forecasted meteorological data on 2015 for real validation (prevtest).

H.
Oufdou et al.DOI: 10.4236/apm.2018.810049803 Advances in Pure Mathematics formed.Different K values were used in the literature and the choice of K = 10

Figure 2 .
Figure 2. Scatter plots highlighting the correlations between pairs of variables.

Figure 3 .
Figure 3. PCA correlation circle for Grand Casablanca data.Illustrative variable (O3) is represented with blue dashed arrow.

Figure 4
Figure 4 shows a good correlation (around 0.723) between observed O3 and forecasted O3 data one day ahead obtained with the Lasso regression model only in 2015.
Starting with a multiple linear regression model, which is plagued by multicollinearity among the predictor variables, we have considered nine more or less recent alternative methods to relate meteorological and pollution variables.The emphasis was put on the prediction ability of the daily tropospheric ozone of these models in the Grand Casablanca area as the first comparative study of its type in such region.We proposed the selected Lasso model based on a comparison of several linear forecasting methods to reduce the multicollinearity problem.The results obtained over two years of training data (2013 and 2014), verified on observed data (2015) and validated on forecast data(2015)  show that the Lasso model has the best predictive capacity O3 for the Jahid station located in Grand Casablanca area.Moreover, using the dataset of 2015, Lasso model still gives the best predictive ability for O3 in Jahid station.The Lasso model presents the interest of being relatively simple and easily interpretable.The choice of this model is explained by the fact that it yields the best criteria in comparison to the alternative models discussed in this paper.These criteria include R², RMSE, RMSEPobs and The reasons for which a value can be missing are numerous.For instance, in air quality applications, data can be missing due to a dysfunction of the equipment or an insufficient resolution of a sensor device.Therefore, it is necessary to identify missing values and DOI: 10.4236/apm.2018.810049796 Advances in Pure Mathematics this problem before further analyses are performed.
in a sequential way where at each stage, a latent variable is defined so as to realize a balance between stability (as assessed by the variance of the latent variable) and prediction ability (as assessed by the correlation between the latent variable and the dependent variable y ).The prediction of y using CR, CR ŷ , is given by CR H. Oufdou et al.DOI: 10.4236/apm.2018.810049799 Advances in Pure Mathematics variables are determined CR = y

Table 1
provides descriptive statistics of the meteorological variables and tropospheric ozone concentrations calculated on the data with missing values.

Table 1 .
Statistics of measured variables at Grand Casablanca area from 01 April 2013 to 30 September 2014.

Table 2 .
Percentage of total variance recovered by the principal components.

Table 3
As for the criterion RMSEP obs in external validation, Lasso, Ridge, PLS, BP

Table 3 .
Comparison of different models according RMSE, R 2 and RMSEP criteria.
Most important are the results of the RMSEP prev based on the forecasted meteorological data for 2015.We recall that the forecasted meteorological data will be the data used on daily basis to predict the O3 concentration as obtained by Aladin-Maroc numerical forecasted model.It turns out that Lasso has by far the best RMSEP_prev (equal to 12.74), a value close to its RMSEP_obs (11.58) followed by the PLS and BP regression model.However, these last two models keep all the predictive variables, unlike the Lasso model, which keeps fewer variables, thus obtaining a model that is simple and easy to interpret.
Table A1 of explanatory variables and TableB1of the coefficients estimated by the models used in this study shows that the explanatory variables most retained by the models are: TMPMAX, TMPMIN, DRINSQ, PRESTNO6h, Vx06 and O3veilleJahid.Indeed, the formation of ozone in the Grand Casablanca area is related more particularly to: 1) the maximum and minimum daily temperature; 2) the period of intense sunshine; 3) the weak wind that accumulates the massive concentration of ozone and; 4) the previous day's concentration, which, to a large extent, determines the next day's H.Oufdouet al. DOI: 10.4236/apm.2018.810049807 Advances in Pure Mathematics