Feed-Forward Artificial Neural Network Model for Air Pollutant Index Prediction in the Southern Region of Peninsular Malaysia

This paper describes the application of principal component analysis (PCA) and artificial neural network (ANN) to predict the air pollutant index (API) within the seven selected Malaysian air monitoring stations in the southern region of Peninsular Malaysia based on seven years database (2005-2011). Feed-forward ANN was used as a prediction method. The feed-forward ANN analysis demonstrated that the rotated principal component scores (RPCs) were the best input parameters to predict API. From the 4 RPCs, only 10 (CO, O3, PM10, NO2, CH4, NmHC, THC, wind direction, humidity and ambient temp) out of 12 prediction variables were the most significant parameters to predict API. The results proved that the ANN method can be applied successfully as tools for decision making and problem solving for better atmospheric management.


Introduction
Nowadays, air pollution becomes a major environmental issue throughout the world.Sudden occurrences of high concentration of vehicular and industrial exhaust emissions are the episodes of air pollution in the urban areas [1].With the rapid economic growth, air pollution is the main subject that has been adversely affecting human health, agricultural crops, animals and ecosystems.It can unavoidably cause damages to buildings, monuments and statues.Moreover, not only it reduces visibility; it even interferes with aviation.
The rapid industrial development and urbanization in the southern region of Peninsular Malaysia have contributed to high levels of atmospheric pollutants to the environment.The problems of severe air quality exist in highly urbanized areas [2].Mobiles, stationary and transboundary sources are the major sources of air pollution in Malaysia [3,4].Mobile sources include motor vehicle, are the main sources of air pollutants in Malaysia [4,5].The stationary sources within the study area are coming from the emissions of urban construction works, quarries, petrochemical and power plants [6].The uncontrolled wildfires, earthquake and volcanic eruption from neighbouring countries are the examples of trans-boundary sources within the study area [4,7].
Symptoms such as eye and skin irritation, nose, throat, headache, fatigue, dizziness, and difficulty in breathing are general of health effect experienced by human due to poor air quality [8].Worldwide, there are more deaths from indigent air quality than from automobile accident [9].The particulate matter under 10 microns (PM 10 ) and ground-level ozone (O 3 ) are the most pollutants that in-Feed-Forward Artificial Neural Network Model for Air Pollutant Index Prediction in the Southern Region of Peninsular Malaysia 2 fluence human health [10,11].Air pollutant index (API) has been used as an indicator of air quality in Malaysia since 1989 [4,12].Five criteria of air pollutants-ground level-ozone (O 3 ), carbon monoxide (CO), nitrogen dioxide (NO 2 ), sulphur dioxide (SO 2 ), and particulate matter under 10 microns (PM 10 ) were used as API calculation.The highest value of the individual sub-index is taken as the API value [5,12].The API values are used to advise or caution the public in lieu of health effects [13].
The air-quality prediction is important for planning, proper actions and controlling strategies.Due to the concern, artificial neural network (ANN) has been applied for prediction purposes, especially on air quality [14,15].The ANN can be used to evaluate the predictive performance [16][17][18][19][20], and gives a better performance comparable to other models [21].Unlike other techniques, ANN is capable in the recognition of non-linear patterns between the variables and complex patterns in data sets, which are not well described by simple mathematical formulae [22].The ANN can be trained accurately when presented with a new data set [23][24][25].
This study aims to classify variables' predictor by using the PCA method.This study also aims to predict API in the Southern Region of Peninsular Malaysia uses the varimax factors data, generated by the PCA method as input variables in ANN models.

Study Area
In this study, seven air monitoring stations were selected due to located within the area of industries and high population density, and known as the most developed area in Malaysia.Figure 1 and Table 1 illustrate the air-quality monitoring area and the description of sampling stations.The stations were identified based on the availability of data for the seven years of the period (2005-2011).The daily traffic density is classified as moderate to high and the peak periods found during morning and evening.There is no major natural disaster (such as typhoon, volcanic eruption) was occurring in these areas, which make the air-quality monitoring in the southern region of Peninsular Malaysia is under control.Less than 3% of missing data were found from the overall data and then the nearest neighbour method was applied for estimation of missing values [26] based on the endpoints of the gaps using Equation (1): where y is the interpolant, x is the time point of the interpolant, y 1 and x 1 are the coordinates of the starting point of the gap, and y 2 and x 2 are the endpoints of the gap.

Principal Component Analysis (PCA)
In this study, PCA was performed to generate the principal components (PCs) and used as input variables in the API prediction model using ANN approach.The PCs can expressed as Equation ( 2): where z is the component score, a is the component loading, x is the measured value of the variable, i is the component number, j is the sample number, and m is the total number of variables.The PCs generated by the PCA is advisable to rotate using varimax rotation due to not readily interpreted [4,12].Only the PCs with eigenvalues more than 1 are considered significant in the varimax rotations analysis [27] in order to obtain new groups of variable (varimax factors, VFs).The number of VFs from varimax rotations is equal to the number of variable in accordance with common features and can include unobservable, hypothetical, and latent variables [28].The VF coefficient with absolute values greater than 0.75 is selected due to having significant factor loadings [29].The analysis of PCA was implemented using XLSTAT 2013 add-in software.

Artificial Neural Network (ANN)-API Prediction Model
Artificial Neural Network (ANN) is an information processing unit analog to the neuron network in biological system [30].ANN has the ability to learn complex patterns of information and generalize it for the prediction, classification and clustering activities [31].ANN is widely known as the method to provide better predicting, which the results are depending on the use of a large number of inputs [32].ANN also can be used to learn future predicting events based on the patterns that have been observed in the historical data, to classify unseen data into pre-defined groups which it based on the observed characteristics, and it was able to cluster the data into natural groups based on the similarity of characteristics in data [31].
In this study, feed-forward ANN (supervised models) was used for prediction purposes and to determine the most significant parameters affecting API values.This technique only forwards information transfer but no feedback information [33].This model consists of three layers, known as the input layer, hidden layer and output layer.The total numbers of input (independent test set) and hidden layer were determined by the nature of the problem to the research and has been varied depending on predicting horizon, whereas the output layer (de-pendent test set) has a single node [34].A total of 17,885 data sets were used in this analysis.For developing the ANN model, the data were divided into three sets: 60% of the data for the whole training set (10,731 data), follows with 20% of the whole data for testing, and validation set (3577 data) respectively [8].
Three different feed-forward ANN models were developed with different input variables-Model A (this model was developed based on the original raw data, twelve parameters), Model B (this model was developed based on the twelve PCs without varimax rotation) and Model C (this model was developed using factor scores of rotated (varimax rotation) PCs with eigenvalues greater than 1 as input variables).For the Model C, prediction of the API was performed using two to four rotated principal components (RPCs), separately.The network structure for the feed-forward ANN model was presented in Figure 2.
Trial-and-error procedure between one to twelve hidden layers in the network structure was examined in order to approximate any nonlinear function with any level of accuracy and it was used to search the best model for prediction of API values.Based on theoretical studies, a network with a small number of nodes shall probably fail to learn the data, while too many nodes shall fatefully over fit the training patterns in the network and give a poor generalization performance's result, especially when dealing with noisy data in predicting problems [32].
There are two different criteria that have been used to evaluate the effectiveness of each network and its ability to make precise prediction [35], namely correlation of determination (R 2 ) and root mean square error (RMSE).The R 2 efficiency criterion is expressed as Equation (3): While, the RMSE is calculated using Equation (4): where x i denotes the observed data, y i is the predicted data and n is the number of observations and representing the percentage of the initial uncertainty which explained by the model.
Here, the lower of RMSE (RMSE = 0) and the highest of R 2 (R 2 = 1), the more accurate the prediction is [34].Then, the predicted values of ANN models were compared each other to obtain parsimonious model (a model that depends on as few variables as necessary) for API prediction.The ANN models were performed using JMP10 software, which this tool offers flexible and easy to apply.

Predicting of the API Using Feed-Forward ANN Model
From the PCA result, out of the twelve principal components (PCs) generated, only four PCs with eigenvalues Input Layer Hidden Layer Output Layer  greater than 1 was selected for the feed-forward ANN input selection parameters representing 69.8% of the total variance (Table 2).The results of the four rotated PCs (RPCs) from the loading of PCA are given in Table 3.Ten variables with strong loadings (noted in bold) were included in four selected RPCs.Table 4 and Figure 3 show the prediction performance of feed-forward ANN models for forecasting API using different combinations of PC scores as input variables.In the Model A (original raw parameters as inputs), the optimum neuron in the hidden layers was eight neurons.The R 2 values of training, testing and validation are 0.694, 0.695 and 0.724 respectively.The results produced by RMSE for training, testing and validation are 7.915, 7.943 and 7.941 respectively.
In the Model B (twelve principal component scores as inputs), the three layer network was used with twelve neurons in the input layer, eleven neurons in the hidden layer and one neuron in the output layer with 100% variation, which explained the R 2 values of training, testing and validation are 0.714, 0.749 and 0.736 respectively.Meanwhile, the RMSE values of training, testing and validation for the Model B are 7.574, 7.151 and 7.562 respectively.In the Model C, three types of ANN sub-models were developed.For each sub-models, the optimum neuron in the hidden layer was seven neurons.The feed-forward ANN model using the first two RPCs (RPC1 and RPC2) as input neurons indicates it does not perform well for the training, testing and validation phases with the cumulative percentage explaining only 47.9% variation.The R 2 values for training, testing and validation are 0.270, 0.347 and 0.360 respectively.Furthermore, the RMSE values of training, testing and validation for the two RPCs are 11.960, 11.637 and 11.553 respectively.The second sub-model of feed-forward ANN in Model B uses three RPCs (RPC1, RPC2 and RPC3) as input parameters.The cumulative percentage of this sub-model showing the variance given by three RPCs is 61.1% with the values of R 2 are 0.317, 0.399 and 0.372 in training, testing and validation respectively.The results produced by RMSE are 11.587, 11.119 and 11.896 for training, testing and validation respectively.From the results, the highest accuracy in predicting API is given by the third sub-model of feed-forward ANN, which contains four RPCs (69.8% of variation) with R 2 value of 0.357, 0.394 and 0.404 for training, testing and validation respectively.While, the RMSE values of training, testing and validation for the four RPCs are 11.269, 11.163 and 11.494 respectively.Based on the three sub-models, it is clear that the API prediction performance increases with the increase in the total number of input variables.
From the observations, the prediction performance of the feed-forward ANN model using four RPCs has significantly different from the original raw parameters and twelve PCs.However, the feed-forward ANN model from four RPCs is a better input due to use fewer variables (ten parameters) than the Model A and Model B (twelve parameters).the Model C uses fewer variables and is far less complex than Model A which the advantage over this model.Therefore, it proved that the feed-forward ANN architecture is able to predict API values from all available inputs with negligible precision.

Conclusions
In this study, a combination of PCA and ANN method was used to predict API based on 12 historical air quality parameters.The original raw data were used as a reference of predictor.Two different approaches were used: un-rotated original PCs (twelve original PCs) and varimax rotated PCs in order to obtain the latent variables as feed-forward ANN inputs.The findings show that the feed-forward ANN model from twelve original PCs (Model A) as input gives high value of R 2 .However, the Model B (un-rotated twelve PCs) gives better in prediction compared to the Model A in term of R 2 value.Using four PCs, the significant loadings for this study are known as CH 4 , NmHC, THC, CO, O 3 , PM 10 , NO 2 , humidity, ambient temperature and wind direction.Although, the prediction performance of the Model C (the model based on these 10 PC scores) is lower than Model A and Model B, but the models can predict the API within acceptable accuracy.It means that the use of rotated PC scores based models is more efficient and effective due to reduction of predictor variables without losing important information.It has also proved that these RPCs-ANN models are absolutely very useful tools in helping decision making and problem solving for better atmospheric management of the local environment.

Figure 1 .
Figure 1.Location of continuous air quality monitoring stations in the Southern Region of Peninsular Malaysia.

API 12 selected
raw air quality parameters as an input Principal Component Score after varimax rotation.

Figure 2 .
Figure 2. Example of feed-forward ANN model network structure of this study.

Figure 4 .
Figure 4. Graphs show measured and predicted API for 12 parameters (original raw) feed-forward ANN model, 12 PCs feed-forward ANN model and 4 RPCs feed-forward ANN model for (a) Training; (b) Testing; (c) Validation phase.

Table 1 . Detail description of air quality monitoring stations of the study area.
vision, Department of Environment Malaysia.The air quality and meteorological parameters used in this study are carbon monoxide (CO), nitrogen dioxide (NO 2 ), methane (CH 4 ), ozone (O 3 ), sulphur dioxide (SO 2 ), nonmethane hydrocarbon (NmHC), total hydrocarbon (THC), particulate matter under 10 microns (PM 10 ), wind direction, wind speed, ambient temperature and humidity.