Time Series Forecasting of Hourly Pm10 Using Localized Linear Models

The present paper discusses the application of localized linear models for the prediction of hourly PM10 concentration values. The advantages of the proposed approach lies in the clustering of the data based on a common property and the utilization of the target variable during this process, which enables the development of more coherent models. Two alternative localized linear modelling approaches are developed and compared against benchmark models, one in which data are clustered based on their spatial proximity on the embedding space and one novel approach in which grouped data are described by the same linear model. Since the target variable is unknown during the prediction stage, a com-plimentary pattern recognition approach is developed to account for this lack of information. The application of the developed approach on several PM10 data sets from the Greater Athens Area, Helsinki and London monitoring networks returned a significant reduction of the prediction error under all examined metrics against conventional forecasting schemes such as the linear regression and the neural networks.


Introduction
Environmental health research has demonstrated that Particulate Matter (PM) is a top priority pollutant when considering public health.Studies of long-term exposure to air pollution, mainly to PM, suggest adverse long-and short-term health effects, increased mortality (e.g.[1,2]), increased risk of respiratory and cardiovascular related diseases (e.g.[3]), as well as increased risk of developing various types of cancer [4].Hence, the development and use of accurate and fast models for forecasting PM values reliably is of immense interest in the process of decision making and modern air quality management systems.
In order to evaluate the ambient air concentrations of particulate matter, a deterministic urban air quality model should include modelling of turbulent diffusion, deposition, re-suspension, chemical reactions and aerosol processes.In recent years, an emerging trend is the application of Machine Learning Algorithms (MLA), and particularly, that of the Artificial Neural Networks (ANN) as a means to generate predictions from observations in a location of interest.The strength of these methodologies lies in their ability to capture the underlying characteristics of the governing process in a non-linear manner, without making any predefined assumptions about its properties and distributions.Once the final models have been determined, it is then a straight-forward and exceedingly fast process to generate predictions.However, ANN have also inherent limitations.The main one is the extension of models in terms of time period and location; this always requires training with locally measured data.Moreover, these models are not capable of predicting spatial concentration distributions.
Owing to the importance and significant concentrations of PM in major European cities, there is an increasing amount of literature concerned with the application of statistical models for the prediction of point PM values.For the purposes of the EU-funded project APPETISE, an inter-comparison of different air pollution forecasting methods was carried out in Helsinki [5].Neural networks demonstrated a better forecasting accuracy than other approaches such as linear regression and deterministic models.
In [6], Perez et al. compared predictions produced by three different methods: a multilayer neural network, linear regression and persistence methods.The three methods were applied to hourly averaged PM2.5 data for the years of 1994 and 1995, measured at one location in the downtown area of Santiago, Chile.The prediction errors for the hourly PM2.5 data were found to range from 30% to 60% for the neural network, from 30% to 70% for the persistence approach, and from 30% to 60% for the linear regression, concluding however that the neural network gave overall the best results in the prediction of the hourly concentrations of PM2.5.
In [7], Gardner undertook a model inter-comparison using Linear Regression, feed forward ANN and Classification and Regression Tree (CART) approaches, in application to hourly PM10 modelling in Christchurch, New Zealand (data period: 1989-1992).The ANN method outperformed CART and Linear Regression across the range of performance measures employed.The most important predictor variables in the ANN approach appeared to be the time of day, temperature, vertical temperature gradient and wind speed.
In [8], Hooyberghs et al. presented an ANN for forecasting the daily average PM10 concentrations in Belgium one day ahead.The particular research was based upon measurements from ten monitoring sites during the period 1997-2001 and upon the ECMWF (European Centre for Medium-Range Weather Forecasts) simulations of meteorological parameters.The most important input variable identified was the boundary layer height.The extension of this model with further parameters showed only a minor improvement of the model performance.Day-to-day fluctuations of PM10 concentrations in Belgian urban areas were to a larger extent driven by meteorological conditions and to a lesser extent by changes in anthropogenic sources.
In [9], Ordieres et al. analyzed several neural-network methods for the prediction of daily averages of PM2.5 concentrations.Results from three different neural networks (feed forward, Radial Basis Function (RBF) and Square Multilayer Perceptron) were compared to two classical models.The results clearly demonstrated that the neural approach not only outperformed the classical models but also showed fairly similar values among different topologies.The RBF shows up to be the network with the shortest training times, combined with a greater stability during the prediction stage, thus characterizing this topology as an ideal solution for its use in environmental applications instead of the widely used and less effective ANN.
The problem of the prediction of PM10 was addressed in [10], using several statistical approaches such as feed-forward neural networks, pruned neural networks (PNNs) and Lazy Learning (LL).The models were designed to return at 9 a.m. the concentration estimated for the current day.The forecast accuracy of the different models was comparable.Nevertheless, LL exhibited the best performances on indicators related to average goodness of the prediction, while PNNs were superior to the other approaches in detecting the exceedances of alarm and attention thresholds.
In view of the recent developments in PM forecasting, the present paper introduces an innovative approach based on localized linear modelling.Specifically, two alternative localized liner modelling approaches are developed and compared against benchmark models such as the linear regression and the artificial neural networks.The advantage of the proposed approach is the identification of the finer characteristics and underlying properties of the examined data set through the use of suitable clustering algorithms and the subsequent application of a customized linear model on each one.Furthermore, the use of the target variable in the clustering stage enhances the coherence of the localized models.The developed approach is applied on several data sets from the monitoring networks of the Greater Athens Area and Helsinki, during different seasons.

Modelling Approaches
Time series analysis is used for the examination of a data set organised in sequential order so that its predominant characteristics are uncovered.Very often, time series analysis results in the description of the process through a number of equations (Equation ( 1)) that in principle combine the current value of the series, y t , to lagged values, y t-k , modelling errors, e t-m , exogenous variables, x t-j , and special indicators such as time of the day.Thus, the generalized form of this process could be written as follows:

Linear Regression
This approach uses linear regression models to determine whether a variable of interest, y t , is linearly related to one or more exogenous variable, x t , and lagged variables of the series, y t .The expression that governs this model is the following: The coefficients c, β, γ are usually estimated from a least squares algorithm.The inputs should be a set of statistically significant variables, defined under Student's t-test, estimated from the examination of the correlation coefficients or using a backward elimination selection procedure from a larger initial set.

Artificial Neural Networks (ANN)
The multi-layer perceptron or feed-forward ANN [11] has a large number of processing elements called neurons, which are interconnected through weights (w iq , v qj ).The neurons expand in three different layer types: the input, the output, and one or more hidden layers.The signal flow is from the input layer towards the output.Each neuron in the hidden and output layer is activated by a nonlinear function that relies on a weighted sum of its inputs and a neuron-specific parameter, called bias, b.The response of a neuron in the output layer as a function of its inputs is given by Equation ( 3), where f 1 and f 2 can be sigmoid, linear or threshold activation functions.( ( )) The strength of neural networks lies in their ability to simulate any given problem from the presented example, which is achieved from the modification of the network parameters through learning algorithms.In this study, the Levenberg-Marquardt [12] algorithm is applied because of its speed and robustness against the conventional back-propagation.
The most important issue concerning the introduction of ANN in time series forecasting is "generalization", which refers to their ability to produce reasonable forecasts on data sets other than those used for the estimation of the model parameters.This problem has two important parameters that should be accounted for.The first is data preparation, which involves pre-processing and the selection of the most significant variables.The second embraces the determination of the optimum model structure that is closely related with the estimation of the model parameters.Although, there is no systematic approach, which can be followed [13], some useful insight can be found using statistical methods such as the correlation coefficients.
The second aspect can be jointly tackled under the cross-validation training scheme.The data set is split into three smaller sets the training (TS), the evaluation or validation (ES) and the prediction or testing (PS) sets.The model is initialized with a few parameters.The next step is to train the model using data from the training set and when the error of the evaluation set is minimized, the model parameters and configuration are stored.The number of parameters is then increased and a new network is trained from the beginning.If ES error is lower compared to the previously found minimum, then the parameters of this new model are stored.This iterative process is terminated when a predefined number of iterations are reached (Figure 1).
In this study, ES was formed using a Euclidean metric withholding a percent value (here 25% is used) of the TS data that are located nearest to other data.The strength of this approach lies in the fact that TS covers more distinct characteristics of the process, thus, allowing for the development of a model with better generalization capabilities.

Nearest Neighbours
This class of hybrid models includes a local modelling and a function approximation to capture recent dynamics of the process.The underlying aim of these predictors is that segments of the series neighbouring under some distance measure may correspond to similar future values.This claim was endorsed by the work of Farmer and Sidorowich [14] that showed that the chaotic time-series prediction is several orders of magnitude better using local approximation techniques rather than universal approximators.The tricky part in these models is the selection of the embedding dimension, which effectively determines segments of the series, and the number of neighbours.Initially, it is required to estimate the embedding dimension d and time delay τ of the attractor as follows: In this study, a value of τ = 1 was used and Y(t) had the same parameters as the linear regression model.The number of neighbours was not pre-determined but was set to vary between predefined limits.A small number of neighbours increase the variance of the results whereas a large number can compromise the local validity of a model and increase the bias of results.Once the nearest neighbours to Y(t) have been identified, an averaging procedure is followed in the present study to generate predictions.

Local Models with Clustering Algorithms (LMCA)
The idea behind the application of clustering algorithms in time series analysis is to identify groups of data that share some common characteristics.On each of these groups, the relationships amongst the members are modelled through a single equation model.Consequently, l  each of the developed models has a different set of parameters.The process is described in the following steps: 1) Selection of the input data for the clustering algorithm.This can contain lagged and/or future characteristics of the series, as well as other relevant information.
C(t) = [y t , y t-k, x t-j ].Empirical evidence suggests that the use of the target variable y t is very useful to discover unique relationships between input-output features.Additionally, higher quality modelling is ensured with the function approximation since the targets have similar properties and characteristics.However, this occurs to the expense of an additional process needed to account for this lack of information in the prediction stage.
2) Application of a clustering algorithm combined with a validity index or with user defined parameters, so that n cl clusters will be estimated.
3) Assign all patterns from the training set to the n cl clusters.For each of the clusters, apply a function approximation model, , so that n cl forecasts are generated.
Successful application of this method has been reported on the prediction of locational electricity marginal prices [15], Mckay Glass and daily electric load peak series [16], the A and D series of the Santa Fe forecasting competition [17] and hourly electric load [18].
In this study, the k means clustering algorithm was selected [19].It is a partitioning algorithm that attempts to directly decompose the data set into a set of groups through the iterative optimization of a certain criterion.More specifically, it re-estimates the cluster centres through the minimization of a distance-related function between the data and the cluster centres.The algorithm terminates when the cluster centres stop changing.
The optimal number of clusters is determined using a modified cluster validity index, CVI, [20], which is directly related to the determination of the user-defined (here the number of clusters) parameters of the clustering algorithm.Two indices are used for showing an estimate of under-partitioning (U u ) and over-partitioning (U o ) of the data set: MD i is the mean intra-cluster distance of the i-th cluster.Here, d min is the minimum distance between cluster centres, which is a measure of intra-cluster separation.The optimum number is found from the minimization of a normalized combinatory expression of these two indices.

Hybrid Clustering Algorithm (HCA)
The hybrid clustering algorithm is an iterative procedure that groups data, based on their distance from the hyper-plane that best describes their relationship.It is implemented through a series of steps, which are presented below: 1) Determine the most important variables.
2) Form the set of patterns H(t) = [y t , y t-k , x t-k ].
3) Select the number of clusters n h .4) Initialize the clustering algorithm so that n h clusters are generated and assign patterns.
5) For each new cluster, apply a linear regression model to y t using as explanatory variables the remaining of the set H t .
6) Assign each pattern to a cluster based on their distance.
7) Go to 5) unless any of the termination procedures is reached.
The following termination procedures are considered: a) the maximum pre-defined number of iterations is reached and b) the process is terminated when all patterns are assigned to the same cluster as in the previous iteration in 6).The selection of the most important lagged variables, 1), is based on the examination of the correlation coefficients of the data.
The proposed clustering algorithm is a complete time series analysis scheme with a dual output.The algorithm generates clusters of data, the identical characteristic of which is that they "belong" to the same hyper-plane, and synchronously, estimates a linear model that describes the relationship amongst the members of a cluster.Therefore, a set of n h linear equations is derived (Equation ( 6)).
Like any other hybrid model that uses the target variables in the development stage, the model requires a secondary scheme to account for this lack of information in the forecasting phase.For HCA and LMCA, the only requirement is the determination of the cluster number, n h and n cl respectively, which is equivalent to the estimation of the final forecast.
The optimum number of HCA clusters is found from a modified cluster validity criterion.An estimate of under-partition (U u ) of the data was formed using the inverse of the average value of the coefficient of determination (R i 2 ) on all regression models.U o indicates the over-partitioning of the data set, and d min is the minimum distance between linear models (Equation ( 7)).The optimum number is found from the minimization of a normalized combinatory expression of these two indices.

Pattern Recognition
A pattern recognition scheme with three alternative approaches was then applied to convert the LMCA and HCA output to the final predictions.Initially, a conventional clustering (k-means) algorithm was employed to identify similar historical patterns in the time series.The second was to determine n cl / n h at each time step, using information contained in the data of the respective cluster.
(p1) Select a second data vector using only histori- Apply a k-means clustering algorithm on P t .(p4) Assign data vectors to each cluster, so that each of the n k clusters should contain k m , m = 1,…, n k data.
To obtain the final forecasts the following three alternatives were examined: (M1) From the members of the k-th cluster find the most frequent LMCA / HCA cluster, i.e. n cl / n h number.
(M2) From the members of the k-th cluster estimate the final forecast as a weighted average of the LMCA/ HCA clusters.Here p i is the percentage of appearances of the LMCA / HCA cluster in the k-th cluster data.(9) The optimal number of clusters for the pattern recognition stage was determined using the modified compactness and separation criterion for the k-means algorithm discussed previously in section "Local Models with Clustering Algorithms".

Data Description and Results
The previously described forecasting methodologies were applied to eight different data sets both univariate and multivariate.The data sets were hourly PM10 concentration values from the monitoring network in the Greater Athens Area and in the cities of Helsinki and London, spanning over different seasons.It should be clarified that meteorological data were available only from the Helsinki station.The results returned by the applied algorithms for each station are discussed separately in the following sections.
In addition to the combined LMCA / HCA -PR methodology, the ideal case of a perfect knowledge of the n cl / n h parameter is also presented.This indicates the predictive potential, or the least error that the respective methodology could achieve.Also, the base-case persistent approach (y t = y t-1 ) is presented as a relative criterion for model inter-comparison amongst different data sets.The ability of the models to produce accurate forecasts was judged against the following statistical performance metrics: Root Mean Square Error

Greater Athens Area -Aristotelous Str
The selected station from the Greater Athens Area monitoring network was Aristotelous Str.The analysis revealed that the most influential variables were PM t-1 , PM t-2 , PM t-24 , PM t-25 and an indicator for the time of the day.This data set was used for the development of all methodologies and the input set for the pattern recognition scheme.The results on Table 1 indicate that with the exception of NN, all other conventional approaches demonstrate a reduction of the prediction error by approximately 6% on the basis of the RMS error compared to the base case persistent method.The difference between LR and ANN was not found to be statistically significant, although the later was marginally better under all criteria.
The application of the local linear models was able to reduce the predictive error by an order of magnitude depending on the pattern recognition scheme that was applied.Both LMCA and HCA are capable of reaching exceedingly lower prediction error, with IA above 0.98, if all n cl /n h clusters are predicted correctly at each time step.Figure 2 presents a graphical description of the prediction error of the HCA-perfect cluster forecast.The HCA coupled with the M3 scheme returned the overall best prediction error that was approximately 8% lower than that of the persistent approach.

Greater Helsinki Area -Kallio
The data from the Helsinki monitoring network were from the suburban station of Kallio, with co-ordinates 25°52΄92΄΄ W and 66°75΄47΄΄ N and elevation height of 21 m above sea level.The developed models for the prediction of PM10 val-    ues from Helsinki contained meteorological parameters that were identified using a combination of statistical correlation properties and stepwise linear regression, discarding all those that were judged statistically as not significant under Student's t-test.The finally selected parameters and their estimation from the least squares fit are shown on Table 2.
The prediction results (Table 3) demonstrate that the forecasting ability of the conventional models is somewhat similar to that of the base-case persistent approach.The large prediction error of the ANN can be partly explained by the linear nature governing process that relates PM10 values to lagged values and from the over-fitting of the applied training scheme.The introduction of the LMCA and HCA localized models coupled with the M3 pattern recognition scheme returned the least overall prediction error that was approximately 5.5% and 7% respectively lower on the RMS criterion and double under NRMS.Figure 3 shows the values of the prediction error of the LMCA-M3 modelling approach.The stepwise regression with a threshold value for the t-statistic of  1.96, corresponding to the 95% confidence interval, revealed as the most significant values PM t-1 , PM t-2 , PM t-24 .Additionally, an indicator for the time of the day was utilized.That data set was used for the development of all methodologies while the input set for the pattern recognition scheme.The analysis of the results (Table 4) indicated that none of the conventional forecasting approaches managed to return consistently lower prediction errors than the base case persistent approach.The least prediction error was returned from the ANN that was 3.6% lower than the persistent approach on the basis of the RMS error.
The developed localized linear model (HCA) has significant forecasting potential, as it can be observed in Figure 4, under the assumption of a perfect knowledge of the future cluster in the pattern recognition stage.The percentage improvement over the bench-mark persistent approach ranged from 40-70%.Similar results were found for the other two data sets

Discussion
The development and application of accurate models for forecasting PM concentration values in a rather fast and efficient manner is of primary concern in modern air quality management systems.The applied LR and ANN are nowadays mature approaches that have been integrated in many operational systems and could be used for the benchmarking of novel methodologies.The results of this work yielded that for the majority of the examined data sets, the linear approach marginally outperforms ANN.This indicates that the underlying process could possess predominantly linear characteristics.
The main focus of this work was the development and application of novel localized linear models.These were based on clustering algorithms as a means to identifying  similar properties of the time series.The LMCA identified clusters based on their proximity on the embedding space, whereas HCA identified grouped points that were described by the same linear model.As both approaches included the target variable in the model development stage, a pattern recognition scheme was needed to account for this lack of information in the prediction stage.The final prediction model was reached with the use of the modified CVI coupled with a pattern recognition scheme.The results suggested M3 as the most effective choice, because it produced consistently the least prediction error, under all metrics.For the RMS and MAPE errors, the improvement over the persistent approach ranged from 3.5% (London) to 7.7% (Athens and Helsinki).This value was almost doubled for NRMS and IA for the respective data sets.The HCA produced the least prediction error on every single examined data set, compared both to conventional approaches and the LCMA.

Conclusions
This paper introduced the application of localized linear models for forecasting hourly PM10 concentration values using data from the monitoring networks of the cities of Athens, Helsinki and London.The strength of this innovative approach is the use of a clustering algorithm that identifies the finer characteristics and the underlying relationships between the most influential parameters of the examined data set and subsequently, the development of a customized linear model.The calculated clusters incorporated the target variable in the model development phase, which was beneficial for the development of more coherent localized models.However, in order to overcome this lack of information in the prediction stage a complementary scheme was required.For the purposes of this study, a pattern recognition scheme based on the concept of weighted average distance (M3) was developed that consistently returned the least error under all examined metrics.The calculated results show that the proposed approach is capable of generating significantly lower prediction error against conventional approaches such as linear regression and neural networks, by at least one order of magnitude.

Figure 2 .
Figure 2. HCA perfect cluster forecast for the Aristotelous station (Athens)

m 3 )Figure 3 .
Figure 3. Prediction and error with LCMA -M3 approach3.3Greater London Area -BloomsburyThe data from the Greater London Area were from the

Figure 4 .
Figure 4. Index of agreement for HCA -perfect cluster forecast