Optimization of Air Quality Monitoring Network Using GIS Based Interpolation Techniques

This paper proposes a simple method of optimizing Air Quality Monitoring Network (AQMN) using Geographical Information System (GIS), interpolation techniques and historical data. Existing air quality stations are systematically eliminated and the missing data are filled in using the most appropriate interpolation technique. The interpolated data are then compared with the observed data. Pre-defined performance measures root mean square error (RMSE), mean absolute percentage error (MAPE) and correlation coefficient (r) were used to check the accuracy of the interpolated data. An algorithm was developed in GIS environment and the process was simulated for several sets of measurements conducted in different locations in Riyadh, Saudi Arabia. This methodology proves to be useful to the decision makers to find optimal numbers of stations that are needed without compromising the coverage of the concentrations across the study area.


Introduction
It is well known that the air pollution causes adverse effects on human health in addition to the impact on environment.Due to rapid urbanization and industrialization, air pollution assumes high significance particularly in large cities.Continuous monitoring of the air pollution with a well-designed air quality monitoring network (AQMN) is the first step in addressing this issue.Obtaining the continuously monitored data to ensure the safe levels of air quality is one of the primary objectives of AQMN, in addition to evaluating exposure hazards and implementing effective control strategies.Environmental protection agencies would be looking for an optimal design of AQMN meeting these objectives with an obvious focus on minimizing cost.
The methodology to design a new AQMN or evaluate an existing AQMN attracted the attention of several researchers.Maximum sensitivity of the collected data [1] [2] and maximum coverage factors such as intensity of emissions, source distance and meteorology [3] were one of the first techniques to design an AQMN.Statistical measure of information content [4] and Fisher's information measure [5] was used to determine the optimum number and location of monitors in a network.The design of an AQMN in the greater London area was conducted by Handscombe and Elson [6] based on the concept of a spatial correlation analysis.The same concept combining with potential of violation was used by Arbeloa et al. [7] to design an optimal air quality monitoring networks.Noll and Mitsutome [8] developed a method to design AQMN based on expected ambient pollutant dosage.This method ranks potential locations by calculating the ratio of a station's expected dosage over the study area's total dosage.Another methodology developed by Nakamori and Sawaragi [9] determines the representative areas of monitor stations in urban areas.A different perspective, based on the use of Shannon information, was initiated with the results of Caselton and Zidek [10], applied in a univariate setup by Sampson and Guttorp [11] and Guttorp et al. [12] and later extended to a multivariate context by Perez-Abreu and Rodriguez [13].The concept of sphere of influence and figure of merit was applied by McElroy et al. [14] to calculate the minimum number of air quality monitoring sites.A simple methodology for siting ambient air monitors at the neighborhood scale was developed by Richard et al. [15] and applied as a case study in Hudson County.
Linear programming approach was also used by many researchers to site optimum AQMN.A multi-attribute utility function method was used for siting the air quality network by Kainuma et al. [16].Trujillo-Ventura and Ellis [17] applied multiple objectives, including spatial coverage, violation detection, data validity, and a weighting method to determine the most suitable network for Tarragona, Spain.Chen et al. [18] developed a multiple objective optimization model for air quality monitoring networks in Taoyuan County, Taiwan.A multiple objective optimization model and procedure for sustainable air quality monitoring networks planning were developed.A holistic approach was adapted for optimal design of air quality monitoring network expansion in an urban area by Mofarrah and Husain [19].In this approach multiple-criterion method with the spatial correlation technique was implemented to design an expanded air quality monitoring network of Riyadh city in Saudi Arabia.Tseng and Ni-Bin [20] proposed a Genetic Algorithm (GA) based compromise programming technique for assessing the relocation of strategy of urban air quality monitoring network with respect to the multi-objective and multi-pollutant design criteria.Silva and Quiroz [21] used Shannon information index to optimize atmospheric monitoring network by excluding the least informative stations with respect to different air pollutants.An optimal design of AQMN was done around a refinery using a mathematical model based on the multiple cell approach with simultaneous multiple pollutants by Elkamel et al. [22].Lu et al. [23] used the principal component analysis (PCA) and cluster analysis to optimize the air quality monitoring network in Hong Kong.PCA and fuzzy c-means clustering was applied for assessment of air quality monitoring in Turkey by Dogruparmak et al. [24].The authors showed that a number of monitoring stations can be decreased using the methodology.Ferradas et al. [25] developed a methodology based on self-organizing map (SOM) artificial neural networks for integrating data about multiple measured pollutants to group monitoring stations according to their similar air quality.The proposed method considered the subsequent geographical mapping of the clusters of stations observed with the SOM, which made it possible to detect geographically different areas that share similar air pollution problems.
The strides that the field Geographical Information System (GIS) and its components (such as interpolation methods) are making as an application in almost every field are incredible.GIS and spatial interpolation techniques were also used in AQMN.Bayraktar et al. [26] used a Kriging-based approach to locate sampling site for assessing the air quality.Long years of the smoke data measurements were used in the determination of the region for the representation by drawing contours through the Kriging method.Then, the selection of the sampling site in this region was done based on the EPA criteria.External drift Kriging of NOx concentrations with dispersion model output was used by Kassteele et al. [27] to reduce the uncertainties in parameter estimation due to the reduction of air quality network.GIS ancillary variables were used to predict volatile organic compound and nitrogen dioxide levels at unmonitored locations [28].The predictive equations were developed by regressing the passive monitor measurements at the 22 monitored schools on land-use variables derived from GIS. Univer-sal Kriging interpolation method was found to perform better than other methods, when a comparison was made between various interpolation methods by developing EU-wide high resolution air pollution maps [29].The best method to model concentrations was selected on the basis of predefined performance measures; correlation coefficient and RMSE.A generalized monotonic regression based on B-splines with an application to air pollution data was proposed by Leitenstorfer and Tutz [30] and spline method was used to estimate the varying health risks from air pollution across Scotland by Duncan [31].
The methods summarized above are very useful, well established and has been implemented widely; however it appears a simple GIS based methodology would further reduce the complexities of AQMN design.The basic advantage of using GIS is that it organizes geographic data in such a way that the decision making process becomes easy.In addition to this, it provides several advanced functionalities to manage statistical and spatial data, interpolate the data to create smooth surface, extract data from the interpolated surface, and create algorithms to automate the process.Furthermore, it creates the results that can be visualized in interactive maps which will further simplify the decision making process.Taking the cue on these advantages, this paper proposes a simple and innovative process to optimize AQMN by using GIS, interpolation methods and the historical data.The existing stations are systematically eliminated by creating several interpolated maps and comparing it with the observed values.The number of stations that can be eliminated is governed by the pre-defined performance measures criteria.In recent times an increasing trend of air pollution has been observed in Riyadh city of Saudi Arabia and there is an emphasis on frequent air pollution measurements [32], hence Riyadh is used as a case study to test the proposed methodology.

Interpolation Methods
Interpolation predicts values for cells in a raster using a limited number of sample data points, which helps in predicting unknown values for any geographic station.Five interpolation methods were selected to estimate the concentrations of air pollution at the unknown stations.The selected methods were 1) Inverse Distance Weighted (IDW); 2) Spline (SPL); 3) Ordinary Kriging (OK); 4) Universal Kriging (UK) and 5) Natural Neighbor (NN).These methods have been widely used in estimating the air pollution concentrations.
The IDW uses a method of interpolation that estimates cell values by averaging the values of sample data points in the neighborhood of each processing cell [33].The closer a point is to the center of the cell being estimated, the more influence or weight it has in the averaging process and has been used in several instances to interpolate the air pollutant concentrations [34]- [36].
The SPL uses an interpolation method that estimates values using a mathematical function that minimizes overall surface curvature, resulting in a smooth surface that passes exactly through the input points.This method was found superior in varying health risk from air pollution [31] and also was recommended by Leitenstorfer and Tutz [30].
Kriging is an advanced geo-statistical procedure that generates an estimated surface from a scattered set of points with z-values.While Kriging is a weighted combination of monitor values, this method also uses spatial auto correlation among data to determine the weights.Generally Kriging has two different forms i.e. ordinary and universal Kriging [37].In the ordinary Kriging the mean value is assumed constant and determined during interpolation and universal Kriging assumes that data follows a known trend.Ordinary and universal Kriging have previously been used with success to model ozone [38] and particles [39] at the local scale, and to model broad scale variations in background air pollution [40].Kriging has also been implemented in several other studies [29] [41] [42].
Natural Neighbor interpolation finds the closest subset of input samples to a query point and applies weights to them based on proportionate areas to interpolate a value.It is also known as Sibson or "area-stealing" interpolation.

Performance Measures
A statistical error is the amount by which an observation differs from its expected value.The statistical indices selected to measure performance are Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), Nash-Sutcliffe equation (NSE) [43], and Accuracy Factor (ACFT) [44] given by Equations ( 1) to ( 4) respectively.Pearson correlation coefficient (r) was also measured to check the strength and direction of linear relationship between the observed and interpolated values.
( ) where Intr i = Interpolated value; Obs i = Observed value; n = number of observations.RMSE is the frequently used measure of the differences between values predicted by a model or an estimator and the values actually observed.It basically represents the sample standard deviation of the differences between predicted and observed values.RMSE gives important information in predicting the magnitude of a pollutant concentration, a measure close to zero represents good predictions.The absolute mean percentage error denoted by MAPE is calculated by dividing sum of percentage error by number of observations, and a value equal or close to zero is considered ideal.Coefficient of correlation is a measure of linear dependence between two variables, and it was chosen to get an indication of the correspondence of timing and evolution between observed and interpolated concentration values.The coefficient of efficiency (NSE) indicates the normalized fit of the model, the value ranges from −∞ to 1.It compares the mean square error generated by a particular model simulation to the variance of the output sequence; a value of 1 indicates a perfect fit [43].Ross [44] proposed a simple multiplicative factor called accuracy factor (ACFT) representing the spread of the results of the modelled data.A value of one indicates that there is perfect agreement between all the predicted and the modelled values and values larger than one indicates the less accurate average estimate.
Several studies have used error statistics in comparing observed and predicted meteorology and air quality data.RMSE was used by Shad et al. [42] in comparing the observed data with the predicted air pollution using fuzzy genetic linear Kriging.Monteiro et al. [45] used it in bias correction techniques to improve air quality ensemble predictions and Son et al. [46] used in the study of individual exposure to air pollution and lung function in Korea.EU-wide maps were prepared based on the predictor variables and the modelled air pollution concentrations were selected on the predefined performance measures r 2 and RMSE.A value of RMSE < 10 and r 2 > 0.5 were considered a good performance measure [29].Singh et al. [47] used RMSE, ACFT and NSE as performance measures of different linear and nonlinear modeling approaches for predicting urban air quality.Solar radiation and air pollution index were estimated based on linear, exponential and logarithmic models and the similar indices were used to check performance index in China [48].In the current study, RMSE, MAPE and r 2 were primarily used as performance measure; RMSE < 8 MAPE < 25% and r 2 > 0.5 were considered a good measure.

Station Elimination Process
The main objective of this optimization process is to eliminate as many stations as possible and filling in the missing information through the interpolated values.The steps involved in this process are illustrated in Figure 1 and elaborated as follows.
Step 1: Selection of stations As a first step, a single station (P) or set of stations (P 1 , P 2 ...) are selected to be eliminated from the vector dataset used for creating the raster.A particular station or set of stations can be chosen or set of all possible stations tested in a loop.
Step 2: Storing the observed values The observed concentrations at the selected station (P ()) or stations (P 1 (), P 2 () …) are stored as arrays.These values will later be compared with the interpolated values.Step 3: Creating vector layer A vector layer is a coordinate-based data model that represents geographic features, such as points.Each point is the station represented by the geographical coordinates.The layer has a column of "z" values used for interpolation.Several columns of "z" values i.e. the concentration of the pollutants are added to the vector for simulation.In this step the vector layer is created without the selected stations.
Step 4: Creating raster from the vector Raster is defined as a spatial data model that defines space as an array of equally sized cells arranged in rows and columns, and composed of single or multiple bands.Each cell contains an attribute value and location coordinates.Rasters are created using the vector data and applying appropriate interpolation techniques.In this step five rasters are created using IDW, Spline, Ordinary Kriging, Universal Kriging and Natural neighbor methods.
Step 6: Performance measure The process of interpolation and value extraction generates arrays of the interpolated values.In this step, the performance measures are applied to observed and interpolated values.RMSE, Bias, and correlation coefficient are calculated for the selected stations and the five interpolation methods.The interpolation method that generates the minimum performance measure is then chosen and the others are discarded.These measures are compared with the pre-defined threshold limits and if the measures are within the limit, they are stored in the possible station combinations array C ().
Step 7: Repeat for another station combination The process is repeated for another station combination until all the combinations are exhausted.
Step 8: Finding the best possible station combination Best possible station combinations can be chosen from the list of possible station combination array C ().The decision maker can then choose from the list of possible station combinations which can be eliminated from the AQMN.

Study Location and Field Measurements
The proposed methodology is applied to the city of Riyadh, Saudi Arabia.The city is divided into sixteen cells that are identical in area and each cell is 12 km × 12 km.The measurements were carried out intermittently from September 2011 to September 2012.Most of the measurements have been conducted approximately in the center of each cell (Figure 2) with two equipped mobile air quality monitoring stations capable of monitoring meteorological variables, as well as CO, O 3 , NO x , CH 4 , OC, EC and PM2.5.However, the methodology suggested in this study is implemented for the following four criteria pollutants i.e.SO 2 , NO x , O 3 and CO.The type of sensors used with their respective method of monitoring for the pollutants utilized in this study are: NO, NO 2 and NO x -Chemiluminescence; CO-Dual Beam NDIR; O 3 -UV Photometer; and SO 2 -UV Fluorescence.
As the measurements are staggered, in order to get a continuous dataset, 24 datasets were prepared from the available measurements by averaging the hourly measurements for the entire study period for all the 16 stations.These 24 datasets were used for simulation to create the raster with different interpolation techniques and compared with the observed values.ESRI's ArcGIS (ESRI, Redlands) exposes several functions to run the interpolations and extract the necessary data.These functions were customized to run the simulation process through Microsoft Visual Basic for application (VBA) environment in ArcGIS.

Results and Discussion
The pollutant concentration data were collected from 16 stations as shown in Figure 2. It is assumed that a minimum of 8 stations are needed to produce reliable interpolated maps, therefore up to 8 possible elimination scenarios were simulated.Table 1 outlines the simulation details such as the number of stations to be eliminated, possible number of combinations, number of simulations and the approximate time of simulation (on a high end PC).As seen in the table, the station combinations and the computing power required increase with the increase in number of stations.The methodology proposed in this study is tested on 4 pollutants namely O 3 , NO x and   ACFT decreased from 0.944 to 0.775 and 0.960 to 1.138, respectively.In order to limit MAPE within 25%, maximum number of station that could be eliminated were 6 (4, 8, 11, 12, 13, and 15), this also generated a satisfactory RMSE (6.113) and r 2 (0.831).Accuracy factor i.e.ACFT was 1.015 which is considered to be very good performance.The remaining 10 stations that are required to produce satisfactory maps of ozone are shown in Figure 6; these stations could essentially be considered as hotspots of O 3 emissions.The primary sources of ground level O 3 are automobiles, cement and power plants, construction activities and biogenic or natural sources.Small industries such as paint shops, dry cleaners and bakeries are also known to contribute O 3 .These 10 stations fall within these areas.The stations 10 and 7 are located in dense residential area, and the stations 2 and 3 are in industrial area.The agricultural areas are found near station 9, and construction activities are reported near station 14.Station 1 and 16 are the city outskirts where small scale industries are located.

NO x
The simulation results for NO x concentrations are shown in Table 3.For the one station elimination scenario, it      was found that station 11 was the best one with a RMSE = 5.999; MAPE = 3.792 and r 2 = 0.864.These values increased to 14.684, 27.468 and 0.312, respectively, for 8 stations eliminations (Figures 3-5).For MAPE scenario of ≤25, 7 a maximum of 7 stations could be eliminated (3, 5, 6, 7, 11, 14, and 16), however RMSE and r 2 values were not within the predefined limits.Taking the MAPE as priority, a minimum of 9 stations is needed to produce satisfactory NO x concentration maps as shown in Figure 7.In addition to the industrial emissions, the primary source of NO x is from the automobile emissions, these are reasonably evenly distributed over the study area, and hence less number of stations is required to produce the concentration maps.Stations 4 and 8 signify the presence of industries and the other stations are spread over the densely populated areas in the city where the automobiles move predominantly.The ACFT = 0.966 and NSE = 0.341 supports the elimination of the 7 stations.

SO 2
Station 1, which is located outskirt of the City, was the best one station to be eliminated with RMSE = 2.155,  4. To limit the MAPE value to <25, a maximum of 5 stations could be eliminated.
Figure 8 shows the rest of 11 stations required to produce the SO 2 concentration maps.The higher number of stations is needed for this, as the sources of SO 2 are distributed with a wide variation in the concentration.The stations in the south i.e. 2, 3, 7, 8 are predominantly located in the industrial area where power plants and refineries are located.Other stations are in the residential area, where exhaust from automobile could be the source of SO 2 .Station 9 is in the agriculture area and 14 and 16 are located in the automobile workshops and small scale industries zones.

CO
The results of simulations run on CO data are shown in Table 5.Since the concentrations of CO were very small, RMSE values were also very small ranging from 0.144 to 0.329 (Figure 3).MAPE values ranged from 6.608 to 51.586 for one station to eight station elimination as shown in Figure 4.The station 10 was the best single station that could be eliminated and the stations 2, 4, 7, 9, 10, 14, 15, and 16 were the best eight stations.However, in order for the MAPE to be within 25, a maximum of only 5 stations could be eliminated as shown in Table 5.
The NSE, ACFT and r 2 values were satisfactory for this 5 station elimination.Figure 9 shows the rest of 12 stations that are needed to make good CO concentration maps.As shown in the figure, most of the stations are located in industrial zones or the highly densely populated areas.

Overall
As observed from Tables 2-5, there is no single station which is common among the list of possible elimination stations.Table 6 outlines the stations required against each pollutant to achieve MAPE level of 25.This indicates that the sources of the pollutants are highly varied with respect to the location.It will be up to the decision maker to prioritize the pollutant and select the stations.

Conclusions
A simple method of optimizing the AQMN is proposed using GIS, interpolation techniques and historical data.
Existing air quality stations are systematically eliminated and the missing data are filled in using the most appropriate interpolation technique.The interpolated data are then compared with the observed data.Pre-defined performance measures RMSE, MAPE and r 2 were used to check the accuracy of the interpolated data.NSE and ACFT supported the validity of the interpolated data.The process was simulated for several sets of observed data using an algorithm developed in GIS environment.In order to achieve a MAPE value of 25 or less, no combination of station could be eliminated for all the pollutants.The pollutants could be prioritized to achieve the most optimal scenario.The results of the prioritization showed that the most optimal scenario was for the SO 2 stations, which achieved MAPE for O 3 , NO x and CO about 28, 51 and 80, respectively.This methodology proves to be useful to the decision makers to find optimal numbers of stations that are needed without compromising the coverage of the concentrations across the study area.Although it is a simple procedure, it does have few limitations.A continuous set of data is required to get the reliable simulation results, owing to the unavailability of such continuous dataset; the staggered dataset is averaged as hourly data for a day and simulated in present case study.Secondly, the process is computing intensive and hence requires large computing resources, though not very expensive these days.Lastly, more parameters can be included in the performance measures to get the most appropriate results.

Figure 3 .
Figure 3. Variation of RMSE with respect to combination of stations that are eliminated.

Figure 4 .
Figure 4. Variation of MAPE with respect to combination of stations that are eliminated.

Figure 5 .
Figure 5. Variation of r 2 with respect to combination of stations that are eliminated.

Figure 7 .
Figure 7. Stations required to generate NO x concentration maps (RMSE < 15 and MAPE < 25).MAPE = 2.155 and r 2 = 0.944.For eight stations, these parameters increased to 9.962, 69.931 and 0.301 respectively as shown in Table4.To limit the MAPE value to <25, a maximum of 5 stations could be eliminated.Figure8shows the rest of 11 stations required to produce the SO 2 concentration maps.The higher number of stations is needed for this, as the sources of SO 2 are distributed with a wide variation in the concentration.The stations in the south i.e. 2, 3, 7, 8 are predominantly located in the industrial area where power plants and refineries are located.Other stations are in the residential area, where exhaust from automobile could be the source of SO 2 .Station 9 is in the agriculture area and 14 and 16 are located in the automobile workshops and small scale industries zones.

Table 1 .
Simulation parameters. 2 and CO, and the results of the simulations are discussed as follows.The interpolation was performed using five methods i.e.IDW, Spline, OK, UK and NN.The IDW and UK outperformed other methods particularly in terms of producing lower RMSE, MAPE values and higher value of r 2 . SO

Table 2
presents the results of the simulation for O 3 .The values of RMSE, r 2 , MAPE, NSE and ACFT are shown along with the number of stations to be eliminated.The variations of RMSE, MAPE and r 2 with respect to stations are illustrated in Figures 3-5 respectively.The station 11 was the best single station that could be eliminated and the stations 4, 5, 6, 8, 11, 12, 15, and 16 were the best eight stations.RMSE and MAPE increased from 3.224 to 7.018 and 4.041 to 49.61 from one station elimination to eight respectively.Similarly NSE and

Table 2 .
Performance measures for O 3 simulations.

Table 3 .
Performance measures for NO x simulations.

Table 4 .
Tables 7-10 illustrate the statistical parameters taking the priority stations for O 3 , NO x , SO 2 and CO, respectively.For the case of O 3 as priority, MAPE for NO x Performance measures for SO 2 simulations.

Table 5 .
Performance measures for CO simulations.

Table 6 .
Pollutant and the stations needed to achieve MAPE < 25.

Table 7 .
Parameters with priority as O 3 .

Table 8 .
Parameters with priority as NO x .

Table 9 .
Parameters with priority as SO 2 .

Table 10 .
Parameters with priority as CO.CO is about 57 and 82 while SO 2 has a very high MAPE value of over 240.Considering NO x as priority, MAPE value for CO and O 3 were about 73 and 85.In this case also SO 2 has very high MAPE (over 198) as shown in and

Table 8 .
The MAPE value of O 3 and NO x were less than 50 in the case of SO 2 priority stations, while CO was little over 80. Taking CO as priority produced a MAPE value of 31 and 60 for NO x and O 3 while SO 2 produced exorbitantly high value of 327 (Table10).