Evaluation of Unknown Groundwater Contaminant Sources Characterization Efficiency under Hydrogeologic Uncertainty in an Experimental Aquifer Site by Utilizing Surrogate Models

Characterization of unknown groundwater contaminant sources is an important but difficult step in effective groundwater management. The difficulties arise mainly due to the time of contaminant detection which usually happens a long time after the start of contaminant source(s) activities. Usually, limited information is available which also can be erroneous. This study utilizes Self-Organizing Map (SOM) and Gaussian Process Regression (GPR) algorithms to develop surrogate models that can approximate the complex flow and transport processes in a contaminated aquifer. The important feature of these developed surrogate models is that unlike the previous methods, they can be applied independently of any linked optimization model solution for characterizing of unknown groundwater contaminant sources. The performance of the developed surrogate models is evaluated for source characterization in an experimental contaminated aquifer site within the heterogeneous sand aquifer, located at the Botany Basin, New South Wales, Australia. In this study, the measured contaminant concentrations and hydraulic conductivity values are assumed to contain random errors. Simulated responses of the aquifer to randomly specified contamination stresses as simulated by using a three-dimensional numerical simulation model are utilized for initial training of the surrogate models. The performance evaluation results obtained by using different surrogate models are also compared. The evaluation results demonstrate the different capabilities of the developed surrogate models. How to cite this paper: Hazrati-Yadkoori, S. and Datta, B. (2017) Evaluation of Unknown Groundwater Contaminant Sources Characterization Efficiency under Hydrogeologic Uncertainty in an Experimental Aquifer Site by Utilizing Surrogate Models. Journal of Water Resource and Protection, 9, 1612-1633. https://doi.org/10.4236/jwarp.2017.913101 Received: November 29, 2017 Accepted: December 25, 2017 Published: December 28, 2017 Copyright © 2017 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/ Open Access S. Hazrati-Yadkoori, B. Datta DOI: 10.4236/jwarp.2017.913101 1613 Journal of Water Resource and Protection These capabilities lead to development of an efficient methodology for source characterization based on utilizing the trained and tested surrogate models in an inverse mode. The obtained results are satisfactory and show the potential applicability of the SOM and GPR-based surrogate models for unknown groundwater contaminant source characterization in an inverse mode.


Introduction
Groundwater is a valuable natural resource and its consumption has increased over the years. As a result, the environmental problems associated with groundwater have increased due to widespread improper and unplanned groundwater management worldwide. Groundwater contamination in an aquifer becomes more difficult to remedy as the contamination spreads. The challenge arises due to insufficient information regarding the contaminated aquifers and especially, often lack of knowledge regarding the sources of contamination and its history of activity. Usually, the contaminations are accidentally detected long time after the first contaminant source activities started. As a result, limited and sparse data are available and generally several contaminant sources are considered as the potential contaminant sources. Therefore, developing an efficient methodology for source characterization is essential.
The most frequently applied methodology for source characterization is linked simulation-optimization approach. This approach consists of numerical simulation models and optimization models, with the linked simulation model embedded or implicitly embedded within the optimization model [1]. The main drawback of this approach is that its applications in real-world cases are computationally very intensive. To overcome this drawback, simulation models are replaced by surrogate models to develop Surrogate Models based Optimization  [2]. In the first group, some of the proposed methodologies are random walk particle method [3], the Minimum Relative Entropy (MRE) [4], Tikhonov Regularization (TR) [5], the Backward Beam Equation (BBE) [6]. These developed methodologies applied mostly to characterize one or two-dimensional homogeneous contaminated study areas. In these studies, usually, the contaminant source(s) also considered being a single contaminant source. Their evaluation results also demonstrated that the applied methodologies can be effective in the presence of sufficient and accurate measurements data.
In the second group, consisting of the embedding technique, response matrix and linked simulation-optimization approaches were utilized to incorporate simulation models with optimization models [7]. For example, the embedded technique was used in [8]. They utilized least square regression and linear programming technique which each of them combined with groundwater solute transport simulation models for source identification. A nonlinear optimization model incorporating finite difference discretized governing equations of groundwater flow and transport for unknown contamination source characterization was utilized in [1] [9] and [10]. The embedded techniques have some limitations. For instance, for obtaining the optimal solutions, repeated solutions of the set of discretized groundwater and transport governing equations are required. As a result, these procedures are computationally intensive and inefficient. The main disadvantages of the response matrix approach are that the approach needs relatively large information of the aquifer system and the aquifer responses are generally assumed linear. The approach is also highly sensitive to measurement errors [1] [7] and [11]. The linked simulation-optimization approach which is the most commonly used approach is externally linked the groundwater flow and transport simulation models with an optimization model. Some of the prominent techniques which were utilized in this procedure are: Genetic Algorithm (GA) [12] [13], the Artificial Neural Network (ANN) [14] [15], Simulated Annealing (SA) [16] [17] [18] and [19] and Adaptive Simulated Annealing (ASA) [20], ASA in conjunction with uncertainty modelling [21]. The main advantages of the linked simulation-optimization approach compared to Journal of Water Resource and Protection the other ones are: 1. in this approach, some complex groundwater flow and transport simulation models such as MODFLOW and MT3DMS can be used, and 2. the number of decision variables of the optimization model can be decreased in this approach by eliminating the embedded equations as binding constraints [1], so the solutions can be easier and less intensive in terms of feasibility. However, the main disadvantage of the developed linked simulation-optimization approaches is their computational times which are very high. For example, for solving a real-world case, they may need several days of the iterative solution.
In this study, collected field data from an experimental aquifer site located in the Botany Basin aquifer, New South Wales, Australia are used to evaluate the performance of the developed methodology. The hydrogeologic characteristics of this experimental site are investigated through a few tests [22]. As a result, some measured values for hydraulic conductivity and contaminant concentrations are available. The performance evaluation results at this experimental site demonstrate the potential applicability of the developed surrogate models for source characterization in terms of contaminant source location, magnitudes, and release history.

Groundwater Flow and Transport Simulation Models
The MODFLOW [23] and MT3DMS [24] are groundwater flow and transport numerical simulation codes utilized in this study. MODFLOW [23] which is a finite-difference based groundwater flow model is utilized for numerical flow simulation. The general governing equation of the groundwater flow through porous media is described by Equation (1) [23].
where, , xx yy K K and zz K are the hydraulic conductivity values along the x, y, and z coordinate axes (L/T), h is the potentiometric head (L), S S is the specific storage of the porous media (L −1 ), t is time (T) and W is a volumetric flux per unit volume from aquifer as sources (sinks); the negative value represents withdrawal of the groundwater system and vice versa (T −1 ).
The MT3DMS [24] is the numerical mass transport simulation model used in this study. This model has the capability of simulating the advection, dispersion, and chemical reaction processes of the groundwater contaminants transport.
The governing equation of this model is described in Equation (2) [24]: where, θ is the subsurface porous media porosity (dimensionless), k C is the

Developing Surrogate Models for Source Characterization
Generally, implementation of the simulation models for real-world cases is complex and extensively time-consuming. Therefore, to decrease the high computational cost of the complex simulation models, these computationally intensive simulation models have been replaced by response surface methodologies. It is supposed that by accurately constructing these models, the behavior of more sophisticated simulation models can be approximately emulated with much reduced computational time [25]. Several types of surrogate models have been developed based on Kriging, ANN, MARS and Gaussian Process (GP) as approximate simulators of the physical processes [26]. Surrogate Models based Optimization (SMO) is one of the popular surrogate models which has been suggested to reduce computational burden. This approach replaces the computationally intensive simulation models with a cheaper to run trained surrogate model. Therefore, for obtaining global optimal solution there is no need to run the computational intensive simulation models tens of thousands times [27].

Self-Organizing Map
The Self-Organizing Map (SOM) is an unsupervised learning method that was introduced by T. Kohonen in 1982 [29]. This algorithm is widely utilized to visualize and cluster multidimensional data due to its easy implementation. The main features of this algorithm are its capability of transforming complex non-linear high-dimensional data space into a simple geometric relationship ( Figure 2(a)).Usually, SOM results are represented in two dimensions by preserving the topological structure of the input data [29] and [30]. The main processes of SOM algorithm can be summarized as initialization, competition, cooperation, and adaptation which are briefly described at below. The more detailed information of this algorithm can be found in [31].
1) Initialization: a group of high-dimensional inputs data is quantized by a few weight vectors to a discrete space usually two-dimensional grid [32] and [33]. If X is an m-dimensional continues input data pattern { } 1 2 , , , m X x x x = , these data are mapped to output neurons which usually is a two-dimensional discrete space by the weight matrix { } 1 2 , , , , where m is the size of the input data and 1, , j n = , which n defines the number of the output space neurons.
2) Competition: for each random sample of input space, the output neurons BMU command in SOM algorithm by searching to find the most similar output neuron to the input vector can be used for finding missing values of an input vector (Figure 2(b)). This command in this study is used to characterize unknown groundwater contaminant sources [31].
3) Cooperation: once the winner neuron is obtained, the weight vector of the winning neuron and all other neurons are updated according to Equation (4) to minimize the local error [32].
where ( ) t η : is the learning rate at iteration t; and ( ) , K j t is a suitable neighborhood function. This neighborhood function has the responsibility of preserving topological of input data [32].

4) Adaptation:
The weight adjusting is repeated until a stable map is obtained or the map is converged [33].
Moreover, SOM Map quality could be assessed by various methods. In this study, Quantization Error (QE) which is a widely used criterion for evaluation of SOM Maps is utilized. The QE gradually decreases with increasing map sizes.
The earlier studies indicate that the suitable number of neurons have an essential role in the accuracy and performance of the SOM algorithm [30]. The "SOM Toolbox for Matlab 5" is used in this study for constructing the SOM-based surrogate model [34].

Gaussian Process Regression
Gaussian Process Regression (GPR) models are nonparametric kernel-based probabilistic models. These models are flexible nonlinear interpolating techniques which are based on the training data [35]. This technique is capable of exploring unknown functions of multi-dimensional data which maps inputs data to output data (explore their interactions) [36]. This technique is capable of approximating any multi-dimensional data [37]. These capabilities make GPR a popular and widely used surrogate models technique. The GPR models are defined by two functions: mean function ( ) m X and covariance function ( ) , k X X′ , these functions can be described as [27]: The mean function represents the expected function value for input X [37].
The covariance function models the interactions between the function values at Journal of Water Resource and Protection different inputs points Xand X ′ [36]. And a GP model can be written as [27]:

Site Description, Eastlake Experimental Site
The performance of the developed methodology is evaluated by using the data from an experimental site. A natural gradient tracer experiment carried out at the Eastlakes Experimental Site, located at the Botany Basin, New South Wales, Australia [38]. This site is located in the upper part of the Botany Sands aquifer next to pond 5 of Lachlan ponds in an area about 80 m 2 [22] and [39]. Figure 3 illustrates this site in the Botany Sands aquifer. Although this aquifer is a homogeneous and isotropic on a macroscopic scale, it is heterogeneous and anisotropic on a microscopic scale [38]. This site was founded in 1992 for research studies at the University of New South Wales (UNSW) Groundwater Center [22]. This experimental site is known as East Lake Experimental Site (ELE Site) [22].  The dimension and characteristic values of the study area are presented in Table   1 [39]. This information is obtained from the previous studies reports at this experimental site [22] and [38]. In this study area, the east and west boundaries are    [38]. According to this information, for simulation of the study area from 6 meters above sea level to the groundwa- In the tests carried out in the ELE site, the injected tracer solutions included conservative and reactive inorganic elements such as bromide, calcium, lead, and potassium. Three injection wells, C, D, and E were used in this test. These wells are illustrated in Figure 4. The tracer test was conducted by preparing 300 liters of a solution that included boron, bromide, chloride, and lithium as the conservative tracers and six reactive solutes. The concentrations of conservative tracers needed to be three to four times higher than background concentrations to be properly monitored. To analyze the background chemical concentrations of tested elements, 88 groundwater samples were collected. The analysis results indicated that all of the tested elements concentrations were below the analytical detection limit [22]. The detection limit concentration for bromide was 1.8 mg/l [22]. In this study, bromide is considered as a conservative contaminant. The concentration of bromide in the test was 186 mg/l. The containers of tracer solution were injected over 30 minute's period from 13:00 to 13:30 on 2 nd July 1996.
During the tracer injection, the flow rates of wells were kept low enough to avoid the significant increases in the hydraulic heads at the injection wells [22]. For this reason, in this study, the flow rate of additional potential contaminant source was considered 1 m 3 /day to prevent a significant change of the flow system and hydraulic head distribution [39]. aquifer's hydraulic conductivity [22]. According to the previous studies at ELE Site, monitoring values until 16 days after the injection showed no noticeable chemical transport processes to effect on the natural tracer behaviors [22]. In other words, advection and dispersion were the dominant physical processes of the bromide tracer transport during the monitoring time. Therefore, the total time of simulation is divided into five different stress periods. The first stress period is the only active stress period and its duration is 30 minutes. The second to fourth stress periods are each of two days duration and the last stress period is of eight days duration. In this study, the monitored contaminant concentrations at nine monitoring locations and totaling to 10 values, and belong to stress periods two to five are utilized as described in Table 2. In this study, in addition to the three injection sources, one more potential location is considered as a possible contaminant sources location to evaluate the performance of the developed methodology for identifying unknown contaminant sources in terms of locations and magnitudes.
The hydraulic conductivity values for ELE site were estimated by applying a combination of constant head test and falling head tests [22].  Then, the average values of these three iterations for all the nodes of the study area are utilized as the inputs of simulation models. Figure 6 represents the generated hydraulic conductivity values for layer three of the EES aquifer utilizing IWD interpolation method.

Results and Discussion
In this section, first, the following steps for constructing surrogate models for source characterization are explained. Then, the performance evaluation results of the constructed models are discussed.   (Table 2).
3) Developing the surrogate models: in this step, SOM and GPR algorithms are utilized to develop surrogate models for source characterization.   For training and developing GPR-based surrogate models, first, the predictors and target variables of the system need to be addressed. Since, in source characterization problem, just observed contaminant concentrations data is available, unknown groundwater contaminant sources need to be characterized in an inverse mode. Therefore, in the training process of the GPR-based surrogate model, the contaminant concentration values of the training data are addressed as the predictors of the GPR prediction models. The randomly generated contaminant release concentrations at potential contaminant sources at specific times are considered to be the target variables of the GPR prediction models. Each GPR prediction model can only have one target variable. As a result, for each target variable, separate GPR model is developed. Then, after developing all the GPR prediction models, the constructed GPR prediction models are integrated to develop the GPR-based surrogate model. By providing the measured or simulated contaminant concentration values for the GPR-based surrogate model, unknown contaminant sources can be characterized at potential contaminant sources at specific times.
After developing the SOM and GPR-based surrogate models, the developed surrogate models are independently utilized for unknown source characterization without using an explicit optimization model. 4) Validation of the surrogate models: the developed surrogate models are tested by new sample sets. The contaminant release concentrations of these sample sets are randomly generated by using the LHS method in the range of 0 -300 mg/l. Then, the corresponding concentration values at monitoring locations are obtained by implementing the simulation models.
In order to evaluate the capability and efficiency of the SOM and GPR-based surrogate models to identify the unknown source characteristics, when the field concentration measurements resulting from specified contaminant release concentrations in the study area are specified, the surrogate models are used in in-Journal of Water Resource and Protection verse mode. The simulated contaminant concentration values at specific locations and time of testing data are considered to be the known variables of the system. The developed surrogate models are utilized for source characterization by using information regarding these known variables. Table 4 presents a typical input dataset with missing data for testing the surrogate models.
In the SOM-based surrogate model case, when utilized in the inverse mode, to estimate unknown contaminant sources, the BMU command of the SOM algorithm which searches for the most similar vectors of the SOM-based surrogate model to match the testing input data is utilized for source characterization. The detailed information of the application of this surrogate model was discussed in [31]. While utilizing the GPR based surrogate model, the GPR-based surrogate model acts as a prediction model. This prediction model by using simulated contaminant concentration data at specific monitoring locations and times (testing data) characterize unknown contaminant sources.
The performance of the developed surrogate models is evaluated by utilizing Normalized Absolute Error of Estimation (NAEE) as an error criterion. NAEE can be defined by Equation (8)     2) Source characterization or recovering source injection history: The obtained results at evaluation stage demonstrate that these surrogate models can be utilized for source characterization. Therefore, the developed SOM and GPRbased surrogate models by using the measured bromide concertation data (Table 2) are utilized to recover source injection history from the ELE site. The results are illustrated in Figure 9. The obtained results in terms of NAEE are equal to 24.9% and 24.6% for the SOM and GPR based surrogate models, respectively.

Conclusion
In this study, SOM and GPR algorithms for comparison purpose are used to construct the surrogate models for source characterization. Same training data is 2) The performance evaluation results demonstrate potential applicability of the SOM and GPR algorithms as the surrogate model types in inverse mode, for unknown groundwater source characterization problems under hydraulic conductivity estimation uncertainty and erroneous contaminant concentration data ( Figure 9).
3) Comparison of the performance of the developed surrogate models for characterization of each of the potential contaminant sources (Figure 8) shows more accuracy for the GPR-based surrogate mode in terms of NAEE.

4)
In source characterization problems, SOM algorithm capability in clustering multidimensional input data leads the SOM-based surrogate model to screen  dummy sources, i.e., not actual sources but included as potential sources precisely.
5) The most important conclusion is that these surrogate models may provide a feasible methodology for characterization of unknown groundwater contaminant sources in terms of location, magnitude, and duration of source activity, without the necessity of using a linked simulation-optimization model.
However, these performance evaluation results are limited to specific cases and further evaluations are necessary to establish the applicability of the developed methodology.