Bayesian Data Fusion (BDF) of Monitoring Data with a Statistical Groundwater Contamination Model to Map Groundwater Quality at the Regional Scale

Abstract

Groundwater contamination by nitrate within an unconfined sandy aquifer was mapped using a Bayesian Data Fusion (BDF) framework. Groundwater monitoring data was therefore combined with a statistical groundwater contamination model. In a first step, nitrate concentrations, measured at 99 monitoring stations irregularly distributed within the study area, were spatialized using ordinary kriging. Secondly, a statistical regression tree model of nitrate contamination in groundwater was constructed using land use, depth to the water table, altitude and slope as predictor variables. This allowed the construction of a regression tree based contamination map. In a third step, BDF was used to combine optimally the kriged nitrate contamination map with the regression tree based model into one single map, thereby weighing the kriged and regression tree based contamination maps in terms of their estimation uncertainty. It is shown that BDF allows integrating different sources of information about contamination in a final map, allowing quantifying the expected value and variance of the nitrate contamination estimation. It is also shown that the uncertainty in the final map is smaller than the uncertainty from the kriged or regression tree based contamination map.

Share and Cite:

Mattern, S. , Raouafi, W. , Bogaert, P. , Fasbender, D. and Vanclooster, M. (2012) Bayesian Data Fusion (BDF) of Monitoring Data with a Statistical Groundwater Contamination Model to Map Groundwater Quality at the Regional Scale. Journal of Water Resource and Protection, 4, 929-943. doi: 10.4236/jwarp.2012.411109.

1. Introduction

Assessing the quality of groundwater is a prerequisite for designing sustainable water management strategies e.g. when implementing the Water Framework Directive [1]. Groundwater quality is, however, a spatially distributed attribute. It is generally accepted that knowledge of the spatial distribution of this attribute allows the design of site specific protection and remediation measures. Therefore robust and validated techniques are needed to map the spatial distribution of groundwater quality within the groundwater body continuum. Unfortunately, properties related to groundwater quality, such as nitrate concentration, can only directly be measured at the local scale. The mapping of groundwater quality within the water body continuum will therefore often build on data interpolation or prediction of properties related to groundwater quality through deterministic or stochastic models, and the adopted technique will impact the results of the final assessment.

Point data can be spatialized with traditional interpolation tools such as Voronoi tessellation, inverse distance weighting or kriging. References [2,3] for instance used ordinary kriging for mapping the spatial variability of nitrate concentrations in a shallow water body. Reference

[4] used Gaussian simulation techniques to introduce local uncertainty for mapping nitrate concentrations within the groundwater bodies of the Po Valley (Italy). Among modern interpolation techniques, the Bayesian Maximum Entropy (BME) framework was developed recently by [5] as a generalization of classical techniques. Reference [6] used BME as a formal spatial modeling framework allowing introducing the temporal variability of groundwater contamination and the sampling rate in a spatial map.

Unfortunately, the locations where groundwater quality properties are measured are typically scarce and sparsely distributed over space. As a result, large uncertainties about the variable of interest arise in the mapping process, especially at prediction points located far away from monitoring stations. Given that the quality of the interpolation decreases rapidly with the distance to the monitoring station, the mapping of a variable of interest through classical geostatistical interpolation techniques will have limited practical utility in regions where a densely distributed monitoring network is not available. Unfortunately, this is rather the rule than the exception, and therefore alternative robust and validated methods are needed to provide more comprehensive and accurate information about the groundwater quality. Data fusion is such a method that can be used to combine optimally various sources of information about groundwater quality in a consistent and accurate model prediction.

Data fusion techniques have already been used in various application fields, including for example biometrics verification systems (e.g. [7,8]), surveillance systems (e.g. [9]), robotics (e.g. [10]), medical imagery (e.g. [11]) or military/civil engineering (e.g. [12,13]). Data fusion has also important applications in classification of remote sensing images (e.g. [14]) and in environmental modeling [15].

Among different data fusion techniques, a Bayesian Data Fusion (BDF) approach was recently proposed by [15]. It was especially designed for spatial predictions problems and provides a consistent framework of fusing an arbitrary large number of information sources that are related to a same variable of interest in order to provide a unique spatial prediction. The main advantage of a Bayesian approach is to put the problem of data fusion into a clear probabilistic framework. Since this original paper, the general method has been applied to various case studies of environmental sciences; remote sensing [16, 17], hydrology [18,19] or air pollution [20]. Similarly to these applications, the present paper relies on some specific assumptions with the aim of spatial mapping of groundwater quality. To the best of our knowledge, BDF has never been used for mapping groundwater pollution by nitrate.

In this paper, we used the BDF approach as a formal framework to map groundwater contamination by nitrates. Groundwater contamination by nitrate remains a critical water quality issue for many groundwater bodies all over the world. The approach was illustrated for mapping groundwater contamination in the water body of the Brusselian sands located in an unconfined sandy aquifer in the centre of Belgium. This water body is considerably affected by nitrate pollution.

Groundwater nitrate concentrations were measured at different monitoring stations irregularly distributed in the study area. In a first step, these point data were interpolated on a grid covering the whole study area by a classical (ordinary kriging) geostatistical technique. With this interpolation method, the uncertainty on the prediction depends partly on the respective geometry of the monitoring station and the estimation point. As a consequence, the uncertainty on the predictions is larger in parts of the study area where data are scarce, making these estimates useless at some points on the map. Secondly, nitrate concentrations were predicted by a statistical model all over the study area. With this model, the prediction uncertainty depends on the uncertainty of underlying predictor variables such as depth to the water table, land use or altitude, along with the model uncertainty. In a third step, nitrate concentrations estimated by the interpolation method were combined with those obtained from the statistical model into a single prediction. In this step, BDF is used. The BDF prediction together with its estimated uncertainty was further spatially mapped.

2. Study Area and Data

This study focuses on the unconfined sandy groundwater body located in the Brusselian aquifer in the center part of Belgium. The aquifer has a surface area of 965 km2 and is of primary importance for drinking water supply. This unconfined aquifer is located in Tertiary sands and is overlaid by a Quaternary loess layer of variable thickness (0 to 15 m). The Brusselian sands outcrop mainly in the valleys where sandy and sandy loam soils develop. Transmissivity of the aquifer varies from 2.9 × 10–5 to 1.2 × 10–2 m2/s and its permeability varies from 1.4 × 10–6 to 6 × 10–3 m/s [21].

This aquifer is characterized by both the presence of intensive arable cropping and intense urban pressure. The 1:10.000 land use map with 65 land use classes of the Walloon Region was provided by the regional administration and depicts the situation of 2005 [22]. The land use is highly fragmented. Typical land uses are urban (generally located in the valleys; about 17% of land use in the study area), grassland and forests (found on valley slopes; about 13% and 10%, respectively), and arable land, mainly wheat, sugar beet, maize and barley (found on loamy soils on the plateau; about 51% of land use).

The depth to the water table was calculated by subtracting the piezometry value from a 30 m resolution digital elevation model (DEM) which was furnished by the regional administration. The piezometry map was calculated by interpolating the water table levels measured in 1984 using ordinary kriging. The calculation of depth to the water table varies from 0 to more than 45 m with a mean value of 10.5 m.

The groundwater nitrate concentration data used in this study was recently collected over 99 monitoring stations in January and February 2009. These monitoring stations are wells, galleries, drains and springs. A one-way analysis of variance was performed for comparing the nitrate concentrations measured in the different types of monitoring stations and no significant differences were detected in the mean concentrations of the different groups at a significance level of 0.05. The nitrate concentrations show a wide spatial variability in the study area, with values going from 6.9 up to 93.4 mg NO3/L. The regional mean and standard deviation are respectively equal to 45.6 and 17.2 mg NO3/L (Table 1). A Lilliefors test was performed using the function implemented in the Statistics ToolboxTM of MatlabTM. This test confirmed the Gaussian shape of the distribution of the measured nitrate concentrations at a significance level of 0.01.

Table 1. Descriptive statistics of the measured nitrate concentrations.

3. Tools and Methods

3.1. Kriging

Kriging is a group of stochastic prediction techniques widely used in geostatistics to interpolate the value of a random field (e.g., the groundwater nitrate concentration) at an unobserved location, based on a linear combination of observed values at nearby locations. Kriging incorporates the spatial dependence of the data in its estimation process through a variogram or a covariance function. The variogram function yields the average dissimilarity between locations separated by different intervals of distances. Kriging is known to provide a linear predictor that corresponds to the Best Linear Unbiased Predictor (BLUP) in the least squares sense. Additionally, it is the best possible predictor when random field is assumed to be multivariate Gaussian. Ordinary kriging is the most commonly used type of kriging. It assumes a constant but unknown mean and enough observations to reliably estimate the variogram.

3.2. Regression Tree

Regression tree modeling is an explanatory technique relying on a process known as binary recursive partitioning. Regression trees became popular in environmental sciences in the early nineties (e.g. [23-25]). The algorithm identifies which of the variables explains most of the variance in the response variable, then determines the threshold value of the explanatory variable that best partitions the variance in the response such that it minimizes the sum of the squared deviations from the mean in the two groups. The process is repeated for each new branch until there is no residual explanatory power, according to the limitations imposed by the user. Suppose that predicttor variables x1, x2, ···, xN and the response variable y are organized as columns in a table. The database table is sorted by the column of the first variable (x1). Then the table is split into two parts (called left and right branch) until the number of samples in the branch to split is lower than a user defined threshold. For each possible split position, the two partitions are compared based on the reduction in non-homogeneity

(1)

that they provide. The non-homogeneity in a group of samples is measured by computing deviations and is defined as

(2)

where is the mean value across all observations yi. Each partition generates left (DL) and right (DR) deviance values. This process is repeated for each of the N predicttor variables. The partition that maximizes the change in deviance ΔD is the partition to choose. Each of the branches obtained after partitioning is partitioned again using the same method.

Regression trees are often used to see whether complex interactions between explanatory variables exists and to identify which one of the predictors have the most important effect on the dependent variable [26]. Major advantages of regression tree models are that 1) they are nonparametric and, hence, Gaussian distribution assumption of predictor variables does not need to be satisfied, 2) they can incorporate categorical data, and 3) they allow possibly complex interactions between the predictor variables to be represented without assumptions of linearity. Furthermore, while multiple linear regression identifies global relationships in the data set, regression tree are able to identify local relationships [27].

In this study, a regression tree was developed for predicting nitrate concentrations as a function of the predicttors listed in Table 2. The regression tree model was

Table 2. Description of the variables used in the regression tree model.

applied to the dataset to predict nitrate concentrations all over the study area at the nodes of a regular 50 m × 50 m grid.

Regression tree algorithms are implemented in various software packages such as MatlabTM, SASTM, R. The “classregtree” function implemented in the Matlab Statistical ToolboxTM was used in this study.

3.3. Bayesian Data Fusion (BDF)

The Bayesian data fusion approach relies on the hypothesis that one can decompose the spatial component (i.e. the prior distribution) from the other information sources (i.e. the likelihood function). Subsequently, one can assume that, conditionally to the true underlying variable, the different information sources are independent. In the Bayesian framework, this implies directly that the likelihood function can be decomposed in a product of conditional distributions. The Bayes’s theorem can thus be applied a second time in order to express the posterior distribution of a given variable Z0 given the other secondary variables Yi either as a function of the conditional distribution or. It is beyond the scope of this paper to present the whole underlying theory of BDF. For more details about the general description of the approach, the reader can consult [15,18] for a specific implementation of the approach.

The general BDF equations can be simplified to simple analytical formulas when it is assumed that the distributions of errors obtained by kriging and by the statistical model are Gaussian [18]. The final predicted mean nitrate concentration μP and its variance are then given by

(3)

where μk, μm and μ0 are the means associated with the kriging prediction, the statistical model prediction and a “rough” estimation of the local mean obtained from the inverse distance method, respectively, and where, and are the variances associated with the kriging prediction (defined as the variance of prediction Var(Zest,0 – Z0), where Zest,0 is the predicted value and Z0 the observed value), the variance associated with the statistical model prediction (defined as the variance of the data at the end of each branch of the regression tree) and the sill of the semi-variogram, respectively.

3.4. Comparison of Methods

The presented interpolation methods yield different cartographic results because of their inherent characteristics. Each map is subject to prediction errors. To elucidate those errors and to illustrate the estimation accuracy, a “leave-one-out” cross-validation approach was performed. The following indicators were chosen for accuracy assessment:

The root mean squared error (RMSE):

(4)

The mean absolute error (MAE):

(5)

The mean error (ME):

(6)

The coefficient of determination (R2):

(7)

where n is the number of observations, is the mean value across all observations yi and ŷi are the predicted values.

It is worth noting that these quality indicators only reflect the general regional accuracy of the predictions. Local improvements of the mapping process must therefore be observed directly on the maps of uncertainties.

All calculations were performed in MatlabTM R2008b, using the geostatistical BMElib package [5]. Data preparation was carried out in ArcGISTM 9.3.

4. Results

4.1. Kriging

The first presented mapping method relies on the interpolation of the measured nitrate concentrations through ordinary kriging. An experimental semi-variogram was modeled as the sum of a nugget effect of 67.2 mg2/L2 and an exponential model with total theoretical variance and range of 332.9 mg2/L2 and 12090.8 m, respectively (Figure 1). About 66.5% of the study area has kriged nitrate concentrations lower than the standard of 50 mg/L, while nitrate concentrations exceeding 80 mg/L are observed in some parts of the study area (Figure 2(a)). The predicttion errors of the kriged concentration is minimum at the location of the monitoring stations and raises up to 20 mg/L were the density of the monitoring stations is the lowest (Figure 2(b)).

Kriged nitrate concentrations were compared to measured nitrate concentrations by a “leave-one-out” cross validation (Figure 3). The data roughly scatter around the 1:1 line. This bad prediction quality is partly due to

Figure 1. Semi-variogram of the nitrate concentrations in the Brusselian sands estimated on the basis of the dataset of January and February 2009. Dots represent the experimental semi-variogram (together with the number of data pairs on each lag interval), plain line represents the fitted semi-variogram.

(a)(b)

Figure 2. Map of kriged groundwater nitrate concentration (a) and associated prediction error (b). The monitoring stations are symbolized by circles on (a).

Figure 3. Comparison of the kriged nitrate concentrations to the measured nitrate concentrations. The plain line represents the 1:1 line.

the very low density of the sampling network in some parts of the study area. As a consequence, when the “leave-one-out” cross validation process removes these isolated points from the dataset, the prediction solely relies on data located far away from that point, thus resulting in an inaccurate prediction and a global poor coefficient of determination. Furthermore, the presence of a nugget effect of about 20% of the total variance is also responsible for poor prediction results.

4.2. Regression Tree

4.2.1. Development of the Regression Tree

Nitrate concentrations measured at the 99 monitoring stations in January and February 2009 were related to the 13 variables listed in Table 2 through a regression tree model (Figure 4(b)). As shown by [28] such models show highly complex interaction patterns, suggesting that it is a complex combination of variables that explains ob-served pollution levels, rather than single explicative variables.

Figure 4(a), which represents a simplified version of the regression tree with only the upper part of the tree,

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] “Directive 2000/60/EC Establishing a Framework for Community Action in the Field of Water Policy,” The European Parliament and Council Official Journal of the European Communities, L327/1, 2000.
[2] K. Hu, Y. Huang, H. Li, B. Li, D. Chen and R. E. White, “Spatial Variability of Shallow Groundwater Level, Electrical Conductivity and Nitrate Concentration, and Risk Assessment of Nitrate Contamination in North China Plain,” Environment International, Vol. 31, No. 6, 2005, pp. 896-903. doi:10.1016/j.envint.2005.05.028
[3] S. Fetouani, M. Sbaa, M. Vanclooster and B. Bendra, “Assessing Ground Water Quality in the Irrigated Plain of Triffa (North-East Morocco),” Agricultural Water Management, Vol. 95, No. 2, 2008, pp. 133-142. doi:10.1016/j.agwat.2007.09.009
[4] S. Cinnirella, G. Buttafuoco and N. Pirrone, “Stochastic Analysis to Assess the Spatial Distribution of Groundwater Nitrate Concentrations in the Po Catchment (Italy),” Environmental Pollution, Vol. 133, No. 3, 2005, pp. 569- 580. doi:10.1016/j.envpol.2004.06.020
[5] G. Christakos, P. Bogaert and M. Serre, “Temporal GIS. Advanced Functions for Field-Based Applications,” Springer, New York, 2002.
[6] S. Mattern, P. Bogaert and M. Vanclooster, “Introducing Time Variability and Sampling Rate in the Mapping of Groundwater Contamination by Means of the Bayesian Maximum Entropy (BME) Method,” In: L. Candela, I. Vadillo and F. J. Elor, Eds., Advances in Subsurface Pollution of Porous Media—Indicators, Processes and Modelling: IAH Selected Papers, Volume 14 (IAH—Selected Papers on Hydrogeology), Taylor and Francis, 2008, pp. 53-68.
[7] B. Duc, E. S. Bigün, J. Bigün, G. Ma?tre and S. Fischer, “Fusion of Audio and Video Information for Multi Modal Person Authentication,” Pattern Recognition Letters, Vol. 18, No. 9, 1997, pp. 835-843. doi:10.1016/S0167-8655(97)00071-8
[8] A. Ross and A. Jain, “Information Fusion in Biometrics,” Pattern Recognition Letters, Vol. 24, No. 13, 2003, pp. 2115-2125. doi:10.1016/S0167-8655(03)00079-5
[9] G. D. Jones, R. E. Allsop and J. H. Gilby, “Bayesian Analysis for Fusion of Data from Disparate Imaging Systems for Surveillance,” Image and Vision Computing, Vol. 21, No. 10, 2003, pp. 843-849. doi:10.1016/S0262-8856(03)00071-4
[10] F. Cremer, K. Schutte, J. G. M. Schavemaker and E. den Breejen, “A Comparison of Decision-Level Sensor-Fusion Methods for Anti-Personnel Landmine Detection,” Information Fusion, Vol. 2, No. 3, 2001, pp. 187-208. doi:10.1016/S1566-2535(01)00034-3
[11] X. B. Song, Y. Abu-Mostafa, J. Sill, H. Kasdan and M. Pavel, “Robust Image Recognition by Fusion of Contextual Information,” Information Fusion, Vol. 3, No. 4, 2002, pp. 277-287. doi:10.1016/S1566-2535(02)00092-1
[12] X. E. Gros, J. Bousigue and K. Takahashi, “NDT Data Fusion at Pixel Level,” NDT & E International, Vol. 32, No. 5, 1999, pp. 283-292. doi:10.1016/S0963-8695(98)00056-5
[13] S. Y. Sohn and S. H. Lee, “Data Fusion, Ensemble and Clustering to Improve the Classification Accuracy for the Severity of Road Traffic Accidents in Korea,” Safety Science, Vol. 41, No. 1, 2003, pp. 1-14. doi:10.1016/S0925-7535(01)00032-7
[14] G. Simone, A. Farina, F. C. Morabito, S. B. Serpico and L. Bruzzone, “Image Fusion Techniques for Remote Sensing Applications,” Information Fusion, Vol. 3, No. 1, 2002, pp. 3-15. doi:10.1016/S1566-2535(01)00056-2
[15] P. Bogaert and D. Fasbender, “Bayesian Data Fusion in a Spatial Prediction Context: A General Formulation,” Stochastic Environmental Research and Risk Assessment, Vol. 21, No. 6, 2007, pp. 695-709. doi:10.1007/s00477-006-0080-3
[16] D. Fasbender, J. Radoux and P. Bogaert, “Bayesian Data Fusion for Adaptable Image Pansharpening,” IEEE Transactions on Geoscience and Remote Sensing, Vol. 46, No. 6, 2008, pp. 1847-1857. doi:10.1109/TGRS.2008.917131
[17] D. Fasbender, D. Tuia, P. Bogaert and M. Kanevski, “Support-Based Implementation of Bayesian Data Fusion for Spatial Enhancement: Applications to ASTER Thermal Images,” IEEE Geoscience and Remote Sensing Let- ters, Vol. 5, No. 4, 2008, pp. 598-602. doi:10.1109/LGRS.2008.2000739
[18] D. Fasbender, L. Peeters, P. Bogaert and A. Dassargues, “Bayesian Data Fusion Applied to Water Table Spatial Mapping,” Water Resources Research, Vol. 44, No. 12, 2008, Article ID: W12422. doi:10.1029/2008WR006921
[19] L. Peeters, D. Fasbender, O. Batelaan and A. Dassargues, “Bayesian Data Fusion for Water Table Interpolation: Incorporating a Hydrogeological Conceptual Model in Kriging,” Water Resources Research, Vol. 46, 2010, pp. 8532-8532. doi:10.1029/2009WR008353
[20] D. Fasbender, O. Brasseur and P. Bogaert, “Bayesian Data Fusion for Space-Time Prediction of Air Pollutants: The Case of NO2 in Belgium,” Atmospheric Environment, Vol. 43, No. 30, 2009, pp. 4632-4645. doi:10.1016/j.atmosenv.2009.05.036
[21] IBW—Intercommunale du Brabant Wallon, “étude des Ressources en Eau du Brabant Wallon,” Contrat Région Wallonne, 1987.
[22] PCNOSW, Projet de Cartographie Numérique de L’occupation du Sol de Wallonie,“Projet Notifié par le Gouver- nement Wallon” Faculté Universitaire des Sciences Agronomiques de Gembloux, 2005.
[23] F. A. Baker, D. L. Verbyla, C. S. Hodges and E. W. Ross, “Classification and Regression Tree Analysis for Assessing Hazard of Pine Mortality Caused by Heterobasidion Annosum,” Plant Disease, Vol. 77, No. 2, 1993, pp. 136- 139. doi:10.1094/PD-77-0136
[24] B. G. Lees and K. Ritman, “Decision-Tree and Rule-Induction Approach to Integration of Remotely Sensed and GIS Data in Mapping Vegetation in Disturbed or Hilly Environments,” Environmental Management, Vol. 15, No. 6, 1991, pp. 823-831. doi:10.1007/BF02394820
[25] H. A. J. Vanlanen, C. A. Vandiepen, G. J. Reinds and G. H. J. Dekoning, “A Comparison of Qualitative and Quantitative Physical Land Evaluations, Using an Assessment of the Potential for Sugar-Beet Growth in the European Community,” Soil Use and Management, Vol. 8, No. 2, 1992, pp. 80-89. doi:10.1111/j.1475-2743.1992.tb00899.x
[26] M. J. Crawley, “Statistics, an Introduction Using R,” Wiley, New York, 2005.
[27] J. J. Rothwell, M. N. Futter and N. B. Dise, “A Classi- fication and Regression Tree Model of Controls on Dissolved Inorganic Nitrogen Leaching from European For- ests,” Environmental Pollution, Vol. 156, No. 2, 2008, pp. 544-552. doi:10.1016/j.envpol.2008.01.007
[28] S. Mattern, D. Fasbender and M. Vanclooster, “Discriminating Sources of Nitrate Pollution in an Unconfined Sandy Aquifer,” Journal of Hydrology, Vol. 376, No. 1-2, 2009, pp. 275-284. doi:10.1016/j.jhydrol.2009.07.039

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.