Comparison of Different Regularized and Shrinkage Regression Methods to Predict Daily Tropospheric Ozone Concentration in the Grand Casablanca Area ()
1. Introduction
Tropospheric ozone (O3) is a dangerous air pollutant that threatens the human health [1]. Indeed, epidemiologic studies have shown that current ambient exposures are associated with reduced baseline lung function, exacerbation of asthma and premature mortality [2]. It is a secondary trace gas in the atmosphere, not directly emitted from any natural or anthropogenic source, but rather formed through a complex set of several chemical reactions in presence of sunlight [3].
As all the large cities in the world, Casablanca has a serious photochemical tropospheric ozone (O3) air pollution problem. The urban emission pattern of O3-forming pollutants is caused by meteorological factors: exposure to sunshine, temperature and wind speed and also by a series of atmospheric reactions involving precursor pollutants caused by car and industry emissions.
Various statistical methods are available to predict daily O3 [4] [5] [6] [7] Multiple linear regression (MLR) is frequently used by several environmental protection agencies involved in air quality monitoring (e.g. [8] [9] [10] etc.). The prediction ability of this type of models is generally satisfactory, notwithstanding the fact that, very often, the predictor variables are highly collinear. In the following, MLR will stand as the standard method to which alternative methods will be compared. In order to tackle the multicollinearity issue, various methods are proposed in the literature [11]. Ridge Regression [12] was certainly the first method proposed in this context. This method of analysis is based on a regularization strategy which aims at constraining the length (as measured by the L2 norm) of the vector of regression coefficients to be relatively small. Similarly, Lasso regression [13] follows the same principle as Ridge Regression, but, this time, the length of the regression coefficients is measured by L1 norm. Other alternative methods to MLR encompass Principal Component Regression [14] and Partial Least Squares regression [15] [16] [17]. These latter techniques were combined into a single approach, Continuum Regression (CR), proposed by Stone and Brooks [18]. Sundberg [19] shows that CR is also related to Ridge regression. Recently, a new biased regression strategy consisting in gradually shedding off the correlations among the independent variables was proposed by Qannari and El Ghaziri [20].
In this study, we compare different regression models to predict the daily mean O3 concentration in the Grand Casablanca area using O3 persistence and meteorological variables. We follow two successive stages. In the first stage, we fit statistical models using two years (2013-2014, calibration sets) of pollutants and observed meteorological data. In the second stage, in order to choose the best predictive model, we compare the prediction abilities using the observed dataset of 2015 (test set) and the meteorological forecasting dataset of 2015 (prediction dataset test). The aim of the study is to select the best model in terms of prediction ability.
2. Material and Methods
2.1. Data
The three years datasets used in this study are provided by the National Meteorological Office of Morocco (DMN). Their collection ranged from January 2013 to December 2015. A detailed description of the 25 available variables is given in Appendix A. The data consist of daily O3 pollutant concentrations, observed at “Jahid” monitoring site, located in the western center of the Casablanca city which is the most important industrial area (Figure 1). Following the DMN recommendation, we use in this study 23 meteorological variables such as temperature, humidity, duration of sunshine, wind direction, wind speed, precipitation, pressure, etc. also measured at the center of Casablanca. After a step of pre-processing of these data which involved in particular the imputation of missing values, we dispose of a dataset containing for each day the observed meteorological data, forecasted meteorological data acquired from the numeric model ALADIN-Maroc and the measured O3 concentrations. The period of the study is limited to the hot and sunny season (April-September) when ozone concentrations are at their maximum [21].
2.2. Exploratory Data Analysis
Inevitably, the collected data contain missing values and it is important to tackle
Figure 1. Map of Grand Casablanca area in the western center of Morocco. The measurement station El-Jahid is an urban one located in the center of Casablanca.
this problem before further analyses are performed. The reasons for which a value can be missing are numerous. For instance, in air quality applications, data can be missing due to a dysfunction of the equipment or an insufficient resolution of a sensor device. Therefore, it is necessary to identify missing values and choose an appropriate imputation technique in order to keep as much data as possible e.g. [22]. Various imputation procedures are used in practice. We can cite for instance regression imputation, nearest-neighbour imputation, random hot-deck imputation [23] [24] [25]. We choose the K-nearest neighbors (KNN) strategy because it is simple and efficient. With this technique, the imputation is based on the neighboring observations to each missing value [26] [27]. More precisely, missing values are replaced by values extracted from cases that are similar to the recipient with respect to the observed (i.e. non missing) characteristics.
Once the data were imputed, we applied a standardized Principal Components Analysis (PCA) to investigate the relationships between the variables and assess the degree of collinearity among the predictor variables [28] [29] [30].
2.3. Linear Modelling Approach
Several statistical models are available to predict tropospheric ozone concentration. Since the available explanatory variables are potentially highly correlated, we investigate alternative methods to the classical Multiple Linear Regression (MLR) that circumvent the problem of multicollinearity which is likely to lead to unstable models. The emphasis is put on: 1) classical regularized regression methods: Principal Components Regression [14] , Partial Least Squares Regression [16] , Sparse PLS [31] [32] and Continuum Regression introduced by Stone and Brooks in 1990; 2) Penalized regression methods: Ridge [12] and Lasso developed by Tibshirani [13] and finally; 3) Biased Power Regression recently introduced by Qannari and El Ghaziri [20].
2.3.1. Multiple Linear Regression (MLR)
We assume the MLR model using Equation (1):
(1)
where
: ozone concentration at day i;
: ozone concentration at day i − 1 (i.e. the persistence);
: Meteorological variable j observed on day i.
Equation (1) can be written using usual matrix format after centring of the response variable:
(2)
where
is an
vector of centered dependant variable (O3 concentrations at day i),
is a
matrix of standardized predictors (Observed meteorological variables and O3 concentrations at day i − 1),
is an
vector of unknown regression coefficients and
is an
vector of random errors. Classically, the distribution of
is assumed to be normal with mean equal to 0 and a variance covariance matrix equal to
; where
is the identity matrix.
The usual unbiased Ordinary Least Squares (OLS) estimator is expressed by (3) [33] :
(3)
The prediction of
using OLS,
, is given by
. It is well known that this estimator is likely to lead to an unstable model and poor predictions in presence of quasi-collinearity among the predictors or in the case of the small sample and high dimensional setting.
2.3.2. Principal Component Regression (PCR)
The principal components regression (PCR) approach involves running PCA on the predictor variables and, thereafter, using the first m principal components (PC) with
,
, as the predictors in a linear regression model [14] [34].
The appropriate number, m, of first principal components to be introduced in the model can be determined in practice by a validation technique such as Leave One Out cross validation (LOO). The PCR model can be written as in Equation (4):
(4)
where
is a matrix with m columns containing the m first PCs. The Ordinary Least Squares (OLS) estimator of
is given by:
.
It is easy to express this model in terms of the original variables by remarking that
, where
is the matrix containing the m dominant normalised eigenvectors of
. It follows that
.
PCR gives a biased estimate of the regression coefficients. If all the PCs are included in the model, we retrieve the usual MLR estimator,
.
2.3.3. Partial Least Squares Regression (PLS)
As with PCR, PLS regression, introduced by [35] , consists in regressing y on components, also called latent variables, which are linear combinations of the p predictor variables.
The major difference between PCR and PLS regression is that, whereas PCR uses only
to construct the components to be used as regressors, PLS regression uses both
and
to determine such components. More precisely, the PLS components are determined sequentially and, at each step, we seek to determine a new component, constrained to be orthogonal to the components determined at the previous stages, so as to maximize the covariance between this component and the independent variable.
Suppose that m PLS components are determined. Again, the number m of latent variables to be introduced in the model can be selected by means of LOO cross validation technique in practice. These latent variables can be stacked into a matrix
,
where
is the X-weight matrix. Equation (5) gives the vector of fitted values obtained by regressing
on
, namely:
(5)
We can show in Equation (6) the expression of PLS regression coefficients as:
(6)
where
and
, where
is the L2 norm.
PLS regression is often helpful to reduce the number of predictors to a small number of latent variables constructed by linear combinations of the columns of original predictors. It yields a biased estimate of the regression coefficients.
2.3.4. Sparse PLS Regression (SPLS)
The Sparse PLS method defined by [32] is a direct adaptation of the PLS regression method. It allows us to operate a dimensionality reduction using regression PLS.
In SPLS regression,
the first vector of loadings is sought as an optimal solution to:
subject to
,
,
where
,
is the L1-norm of vector
and
is a scalar which controls the degree of sparsity.
The regression coefficients estimation of
on
is calculated in the following way: The coefficients of the non-selected variables are set to 0, and the coefficients of the selected variables are those obtained by means of the “standard” PLS regression. We can also give an expression of the SPLS regression coefficients defined by (7) [32]
(7)
The interest of the SPLS is two folds. On the one hand, thanks to the sparsity, it yields an easy to interpret model and, on the other hand, it prevents the problem of multicollinearity by using the PLS framework. SPLS estimator is biased comparing to OLS estimator. Moreover, SPLS is computationally efficient with a tunable sparsity parameter to select the important variables.
2.3.5. Continuum Regression (CR)
The CR prediction model is chosen from a continuum of candidates among which we find methods of analysis related to OLS estimation, PCR, PLSR. [19] gives a general overview of the continuum approach regression and shows how different methods relate to “least squares ridge regression”. As with PCR and PLS regression, CR consists in a regression upon latent variables (i.e. optimal linear combinations of the independent variables). More precisely, these latent variables are determined in a sequential way where at each stage, a latent variable is defined so as to realize a balance between stability (as assessed by the variance of the latent variable) and prediction ability (as assessed by the correlation between the latent variable and the dependent variable
). The prediction of
using CR,
, is given by
. The CR achieves a reduction of the variance of estimator at the cost of introducing a small bias [18].
CR aims at transforming the explanatory variables into new latent predictors which are orthogonal to each other and constructed as linear combinations of the original predictors. It makes it possible to circumvent the problem of multicollinearity between predictors. However, the CR regression does not specifically aim at selecting a subset of variables [18].
2.3.6. Penalized Regression
Another general strategy to circumvent the problem of multicollinearity consists in imposing a constraint on the vector of regression coefficients. The two most popular methods in this context are Ridge and Lasso regressions.
Ø Ridge regression
Ridge regression is the first regularization procedure that was proposed to cope with the multicollinearity problem [12]. The Ridge estimator is given by (8):
(8)
where
is a constant to be selected. Note that if
, the Ridge estimator amounts to the least-squares estimator.
Ridge estimator is obtained as a solution to the following least squares problem defined by (9):
(9)
There is a one to one correspondence between the Ridge parameter k and the upper bound, δ, imposed on the vector of regression coefficients,
. From a practical point of view, these parameters can be selected by means of a cross-validation technique.
The Ridge regression shrinks the OLS estimators towards 0. It yields a biased estimator, but with a smaller variance than that of OLS estimator.
Ø Lasso regression
The Least Absolute Shrinkage and Selection Operator, or Lasso [13] is another penalized regression where L2 penalty of ridge regression is replaced by an L1 penalty:
. This is a subtle change that has important consequences. Indeed, this constraint entails that some of the regression coefficients are shrunk exactly to zero. This means that this regression strategy operates de facto a selection of variables since the unimportant variables are discarded, their regression coefficients being equal to zero. Formally, the lasso estimator is given as a solution to the following optimization problem by (10):
, (10)
where
.
The parameter δ controls the degree of sparsity and, in practice; it is determined by Leave One Out (LOO) cross validation procedure. The smaller this parameter is, the larger is the number of discarded variable. Contrariwise, if δ is larger than
(where
are the OLS estimators) then
. Lasso regression has the double effect of shrinking the β coefficients, allowing to decrease the variance of the regression coefficients as with Ridge regression, and, more importantly, to operate an automatic selection of the variables, by cancelling out some
coefficients.
2.3.7. Biased Power Regression (BPR)
Recently, a new biased regression called Biased Power regression (BPR) strategy was proposed [20]. It consists in gradually shedding off the correlation among the independent variables by means of a tuning parameter α. More precisely, the BPR estimator of
is given by (11):
(11)
where α is a tuning parameter which ranges between 0 and 1.
In practice, α is selected using a cross validation procedure.
Clearly, when
, we retrieve the OLS estimator and as α increases, the variance-covariance matrix of the predictor variables is shrunk to the identity matrix. The prediction of y using BPR,
is given by
.
BP-regression shares the same properties as Ridge regression (see Section 2.3.4) and thus can highlight those variables whose coefficients become very small. However, it was not designed to select a subset of variables [20].
2.4. Evaluation of the Methods
To assess the prediction ability of the various models listed above on the Grand Casablanca O3 data, we performed a cross validation technique on a training set to determine the appropriate parameters (number of components, Ridge or lasso constant…) to be used in the prediction models. Using these parameters, the performance of the different models is assessed on the basis of a fresh data set. More precisely, we partitioned the available data into two complementary datasets: 1) summer period of 2013 and 2014 (called the training set) used to adjust the models; and 2) summer period of 2015 (called the validation set or testing set) used to “test” the models obtained in the training phase. The models are fitted on the training set used to predict the ozone responses for a) the observed meteorological data on 2015 of the validation set (obstest) and b) the forecasted meteorological data on 2015 for real validation (prevtest).
The performance of the models is measured with standard indicators defined by Equations (12)-(14) generally used to compare statistical models [36].
In a first stage, an internal validation (2013 and 2014 datasets) is performed on the basis of the following criteria in order to assess the quality of the model adjustment:
The multiple correlation coefficient R2 allows us to assess the quality of the
adjustment based on the training set:
, where
is the
size of training sample.
The Root Mean Squared Errors (RMSE): This is computed according to the following expression:
(12)
The smallest value of this criterion corresponds to the best adjustment of the model.
For the external validation (on summer 2015 observed dataset), the following criterion is used to assess the prediction ability of the models [37] :
The Root Mean Squared Errors of Prediction (RMSEP). This criterion is similar to RMSE but, this time, the validation data set is used instead of the training data set.
(13)
where
is the size of the observed the validation set (obstest).
Obviously, the best predictive model corresponds to the smallest RMSEP.
The following criterion is used to assess the performance of the models with observed meteorological data (obstest) and real meteorological forecast data (prevtest) for summer period of 2015.
In the same way, we define the RMSEP of prevision based on the forecasted dataset as:
(14)
where
is the size of the sample size of the forecasted data (prevtest).
3. Results and Discussion
Experiments were run on an Intel(R) Core(TM) i7-6600U CPU computer with 2.60 GHz, 8 Go in RAM, Windows 10 Professional 64 bits.
All the statistical analyses were performed using the free software R. (http://www.rproject.org/).
3.1. Data Description
In this study, the dataset is composed of 25 explanatory variables. Appendix A gives the abbreviation of these variables.
Table 1 provides descriptive statistics of the meteorological variables and tropospheric ozone concentrations calculated on the data with missing values.
Table 1. Statistics of measured variables at Grand Casablanca area from 01 April 2013 to 30 September 2014.
Minimum, maximum, mean and standard deviation statistics are provided to describe the characteristics of the data set.
The 2013 and 2014 studied periods are characterized by high temperatures. We can notice that, in the Grand Casablanca Area, the maximal temperature (TMPMAX) can go up to 37.5˚C and the minimal temperature is 16.2˚C. The maximal daily total sunshine duration is of 13.3 hours. We can also notice that there is almost no rain is these periods (RRQUOT). The Wind strength average is relatively high at 18 hours (FFVM18h = 7 m/s). The O3 concentrations are between 10 and 130 µg/m3.
There are in total 90 missing values for the 366 recording days, distributed on 14 variables. This represents around 2% of missing values to be imputed before the prediction models are performed.
3.1.1. Missing Values Imputation
As mentioned above, a strategy based on the K-nearest neighbors was performed. Different K values were used in the literature and the choice of K = 10 led to the best results [38].
3.1.2. Multivariate Analysis
Figure 2 shows the scatter plots associated with pairs of the available variables. It highlights the pairwise correlations between these variables. We also indicate in Figure 2, the histograms associated with each variable and the correlation coefficients between each pair of variables. For instance, we can see that there is a high correlation between O3veilleJahid and O3 Jahid (the two last columns) with a correlation coefficient equal to 0.92. We can also observe large correlations (around 0.94) between the first three explanatory variables (TMPMAX and TMPMOY, TMPMIN and TMPMOY).
Figure 2. Scatter plots highlighting the correlations between pairs of variables.
The diagonal entries show the histograms associated with the various variables and the upper entries indicate the coefficients of correlations between pairs of variables.
We also performed a PCA on the imputed dataset. PCA is run on the complete data (2013 and 2014) after imputation and standardization of the variables. The data is composed of 366 days (from April to September of 2013 and 2014) and 24 variables. The first five principal components recover up to 65% of the total variance (Table 2). In the following, only the results related to the first two principal components which recover around 40% of the total inertia are shown.
Figure 3 shows the correlations of the explanatory variables with the first two principal components. The variable O3 Jahid is superimposed as an illustrative variable with a blue arrow to depict its relationships with the explanatory variables. This figure highlights the strong correlations among the variables, which may be harmful for the prediction models.
The first PC is linked to wind direction (Vx06, Vx12, Vy12 and Vx18) and pressure variables (PRESTN at 06 h, 12 h and 18 h). We can notice, for example, that variables TMPMAX, TMPMIN and TMPMOY as well as PRESTN06h, PRESTN12h and PRESTN18h are strongly correlated. A strong correlation also exists between variables Vx06, Vy06, Vx18 and Vx12. O3veilleJahid variable is very correlated to O3 Jahid but it is not very well represented in the plan (PC1-PC2).
3.2. Prediction Models
In this section, we compare the results obtained from the different regression models described in section 2.3, namely: 1) the Multiple Linear Regression (MLR) model applied to all the variables of the dataset (24 variables), 2) the Reduced MLR with seven variables selected by means of Akaike Information Criterion (AIC) [39] [40] , 3) The Principal Component Regression (PCR) model, 4) PLS and sparse PLS models, 5) Continuum Regression (CR), 6) Ridge and Lasso
Table 2. Percentage of total variance recovered by the principal components.
Figure 3. PCA correlation circle for Grand Casablanca data. Illustrative variable (O3) is represented with blue dashed arrow.
regressions, 7) Biased Power regression (BP-regression).
A cross validation procedure (LOO) is applied on the data collected during the period extending from April 1st to September 30th in 2013 and 2014 (training data) to determine for each model the parameters (number of components, Ridge and Lasso parameters…) leading to the minimum of the Root Mean Squared Error (RMSE). Then, the prediction ability according to the Root Mean Squared Error Predicted (RMSEP) of the various models is assessed on the basis of: 1) observed data (test data), RMSEPobs, and 2) forecasted data, RMSEPprev from the summer period of 2015.
Table 3 shows the results of the various methods according to several criteria (RMSE, R2, RMSEPobs and RMSEPprev).
Concerning the adjustment of the model on training data (internal validation), not surprisingly, the MLR model leads to the lowest RMSE (9.503), but the other models lead to close values and take into account multicollinearity problem. However, if the goal is to get the best predictive model, the RMSE alone is unsufficient and we need to analyze the RMSEP to assess the predictive quality of each model.
As for the criterion RMSEPobs in external validation, Lasso, Ridge, PLS, BP
Table 3. Comparison of different models according RMSE, R2 and RMSEP criteria.
regressions and CR outperform the other methods. Lasso shows the best predictive ability since it has the smallest RMSEPobs. Moreover, this method of analysis has yet another advantage since the model is based on fewer predictive variables (11 variables such as TMPMAX, TMPMIN, DRINSQ, HUMREL12h, PRESTN06h, FFVM06h, FFVM18h, DDVM12h, Vx06h, Vy06h and O3veilleJahid) than the other models with the exception of the reduced model (7 predictive variables such as TMPMIN, TMPMOY, DRINSQ, PRESTN06, Vx06, Vx12, O3veilleJahid). Among these significant variables, TMPMIN and TMPMOY are strongly correlated so the reduced model does not solve the multicollinearity problem by comparing it to the Lasso model.
Most important are the results of the RMSEPprev based on the forecasted meteorological data for 2015. We recall that the forecasted meteorological data will be the data used on daily basis to predict the O3 concentration as obtained by Aladin-Maroc numerical forecasted model. It turns out that Lasso has by far the best RMSEP_prev (equal to 12.74), a value close to its RMSEP_obs (11.58) followed by the PLS and BP regression model. However, these last two models keep all the predictive variables, unlike the Lasso model, which keeps fewer variables, thus obtaining a model that is simple and easy to interpret.
Figure 4 shows a good correlation (around 0.723) between observed O3 and forecasted O3 data one day ahead obtained with the Lasso regression model only in 2015.
The most important finding (Table 3, Figure 4) is that the Lasso regression model has the best performance in predicting O3 concentrations in Jahid compared to the other models. Moreover, it clearly gives stable regression coefficients compared to Reduced MLR model. Table A1 of explanatory variables and Table B1 of the coefficients estimated by the models used in this study shows that the explanatory variables most retained by the models are: TMPMAX, TMPMIN, DRINSQ, PRESTNO6h, Vx06 and O3veilleJahid. Indeed, the formation of ozone in the Grand Casablanca area is related more particularly to: 1) the maximum and minimum daily temperature; 2) the period of intense sunshine; 3) the weak wind that accumulates the massive concentration of ozone and; 4) the previous day’s concentration, which, to a large extent, determines the next day’s
Figure 4. Predicted O3 using the Lasso regression model versus observed O3.
ozone concentration.
4. Conclusions
Starting with a multiple linear regression model, which is plagued by multicollinearity among the predictor variables, we have considered nine more or less recent alternative methods to relate meteorological and pollution variables. The emphasis was put on the prediction ability of the daily tropospheric ozone of these models in the Grand Casablanca area as the first comparative study of its type in such region.
We proposed the selected Lasso model based on a comparison of several linear forecasting methods to reduce the multicollinearity problem. The results obtained over two years of training data (2013 and 2014), verified on observed data (2015) and validated on forecast data (2015) show that the Lasso model has the best predictive capacity O3 for the Jahid station located in Grand Casablanca area. Moreover, using the dataset of 2015, Lasso model still gives the best predictive ability for O3 in Jahid station. The Lasso model presents the interest of being relatively simple and easily interpretable. The choice of this model is explained by the fact that it yields the best criteria in comparison to the alternative models discussed in this paper. These criteria include R², RMSE, RMSEPobs and RMSEPprev. Furthermore, besides yielding a more stable model than multiple linear regression, Lasso is based on a relatively small number of explanatory variables. This feature presents a significant advantage for the daily prediction of the ozone concentration in the Grand Casablanca.
This contribution proposes the first linear model of daily O3 concentration forecast in Morocco and more particularly in the Grand Casablanca area.
In perspective, we plan to widen our study by comparing the performances of the Lasso model with those of other non-parametric models and we will add more data (2017-2018) to ensure model validation. The most appropriate forecast model will be routinely implemented by the National Meteorological Office of Morocco (DMN).
Acknowledgements
The National Meteorological Office of Morocco (Direction de la Météorologie Nationale DMN) is gratefully acknowledged for providing the necessary data for undertaking the present study.
Appendix A
Table A1. Variables abbreviation and units of measurement.
Appendix B
Table B1. Comparison of regression coefficients estimated by the different models.