^{1}

^{*}

^{2}

Characterization of unknown groundwater contaminant sources is an important but difficult step in effective groundwater management. The difficulties arise mainly due to the time of contaminant detection which usually happens a long time after the start of contaminant source(s) activities. Usually, limited information is available which also can be erroneous. This study utilizes Self-Organizing Map (SOM) and Gaussian Process Regression (GPR) algorithms to develop surrogate models that can approximate the complex flow and transport processes in a contaminated aquifer. The important feature of these developed surrogate models is that unlike the previous methods, they can be applied independently of any linked optimization model solution for characterizing of unknown groundwater contaminant sources. The performance of the developed surrogate models is evaluated for source characterization in an experimental contaminated aquifer site within the heterogeneous sand aquifer, located at the Botany Basin, New South Wales, Australia. In this study, the measured contaminant concentrations and hydraulic conductivity values are assumed to contain random errors. Simulated responses of the aquifer to randomly specified contamination stresses as simulated by using a three-dimensional numerical simulation model are utilized for initial training of the surrogate models. The performance evaluation results obtained by using different surrogate models are also compared. The evaluation results demonstrate the different capabilities of the developed surrogate models. These capabilities lead to development of an efficient methodology for source characterization based on utilizing the trained and tested surrogate models in an inverse mode. The obtained results are satisfactory and show the potential applicability of the SOM and GPR-based surrogate models for unknown groundwater contaminant source characterization in an inverse mode.

Groundwater is a valuable natural resource and its consumption has increased over the years. As a result, the environmental problems associated with groundwater have increased due to widespread improper and unplanned groundwater management worldwide. Groundwater contamination in an aquifer becomes more difficult to remedy as the contamination spreads. The challenge arises due to insufficient information regarding the contaminated aquifers and especially, often lack of knowledge regarding the sources of contamination and its history of activity. Usually, the contaminations are accidentally detected long time after the first contaminant source activities started. As a result, limited and sparse data are available and generally several contaminant sources are considered as the potential contaminant sources. Therefore, developing an efficient methodology for source characterization is essential.

The most frequently applied methodology for source characterization is linked simulation-optimization approach. This approach consists of numerical simulation models and optimization models, with the linked simulation model embedded or implicitly embedded within the optimization model [

The methodologies proposed earlier for unknown groundwater source characterization can be subdivided into two main categories. 1. Methodologies based on statistical and deterministic approaches which mainly solved this problem in an inverse mode. 2. The approaches based optimization algorithm which integrate the groundwater flow and transport simulation models with an optimization algorithm [

In the second group, consisting of the embedding technique, response matrix and linked simulation-optimization approaches were utilized to incorporate simulation models with optimization models [

In this study, collected field data from an experimental aquifer site located in the Botany Basin aquifer, New South Wales, Australia are used to evaluate the performance of the developed methodology. The hydrogeologic characteristics of this experimental site are investigated through a few tests [

The MODFLOW [

∂ ∂ x ( K x x ∂ h ∂ x ) + ∂ ∂ y ( K y y ∂ h ∂ y ) + ∂ ∂ z ( K z z ∂ h ∂ z ) ± W = S s ∂ h ∂ t (1)

where, K x x , K y y and K z z are the hydraulic conductivity values along the x, y, and z coordinate axes (L/T), h is the potentiometric head (L), S_{S} is the specific storage of the porous media (L^{−1}), t is time (T) and W is a volumetric flux per unit volume from aquifer as sources (sinks); the negative value represents withdrawal of the groundwater system and vice versa (T^{−1}).

The MT3DMS [

∂ ( θ C k ) ∂ t = ∂ ∂ x j ( θ D i j ∂ C k ∂ x j ) − ∂ ∂ x i ( θ v i C k ) + q s C s k + ∑ R n (2)

where, θ is the subsurface porous media porosity (dimensionless), C k is the dissolved concentration of species k (ML^{−3}), t is time (T), x i , x j represents the distances along the Cartesian coordinate axis (L), D i j is the hydrodynamic dispersion coefficient tensor (L^{2}T^{−1}), v i represents the seepage velocity (LT^{−1}); it is related to the Darcy flux through the relationship; v i = q i θ , q s is volumetric flow rate per unit volume of the groundwater system which represents fluid source (positive) and sinks (negative) (T^{−1}), C s k is the concentration of the source or sink flux for species k (ML^{−3}); and ∑ R n is the chemical reaction term (ML^{−3}T^{−1}).

Generally, implementation of the simulation models for real-world cases is complex and extensively time-consuming. Therefore, to decrease the high computational cost of the complex simulation models, these computationally intensive simulation models have been replaced by response surface methodologies. It is supposed that by accurately constructing these models, the behavior of more sophisticated simulation models can be approximately emulated with much reduced computational time [

In this study, for characterization of unknown groundwater contaminant sources, Self-Organizing Map (SOM) and Gaussian Process Regression (GPR) algorithms for comparison purpose are used to construct the surrogate models (

1) Problem definition and sampling plan: this stage is a crucial stage and has essential effects on the accuracy of results. First, the problem and the most important variables of the system which are highly dependent on the complexity of origin system are defined. These variables are constituted of known variables and decision variables. Then, for generating qualified sampling points for training and testing surrogate models a suitable random generating methodology need to be selected and utilized. In this study, Latin Hypercube Sampling (LHS) is utilized to generate the training and testing sample data. For source identification problem, LHS is used to generate adequate random contaminant source fluxes. It is also suggested that the sampling size be 15 - 20 times of the dimensions of the problem [

2) Solving the simulation models: at this stage, the flow and groundwater simulation models for the contaminated aquifer site are solved. These models are solved to randomly generated contaminant source fluxes at stage 1. As a result, the contaminant concentration values are obtained as the solution of the groundwater flow and transport simulation models.

3) Solving the simulation models: at this stage, the flow and groundwater simulation models for the contaminated aquifer site are solved. These models are solved to randomly generated contaminant source fluxes at stage 1. As a result, the contaminant concentration values are obtained as the solution of the groundwater flow and transport simulation models.

4) Building surrogate models: in this stage, at least one important question should be addressed, the tool(s) which are to be used for constructing the surrogate model(s) [

5) Model evaluation: in this stage, the performances of the developed surrogate models are evaluated by using a new sample data set which are independent of the training data. The model evaluation results can be used to change the surrogate model types or designs.

6) Source characterization solution/step 3: if the goodness of fit is achieved, source characterization results are obtained and stop. Otherwise, go to step 3.

The Self-Organizing Map (SOM) is an unsupervised learning method that was introduced by T. Kohonen in 1982 [

1) Initialization: a group of high-dimensional inputs data is quantized by a few weight vectors to a discrete space usually two-dimensional grid [

2) Competition: for each random sample of input space, the output neurons

compete to be the winner neuron. The winning neuron which has the most similarity to the input data is called Best Matching Unit (BMU). The distance between the random sample of input space and all weight vectors are calculated by using Equation (3) or Euclidian distance measure.

d j ( x ) = ∑ i = 1 m ( x i − w j i ) 2 , ∀ i = 1 , ⋯ , m (3)

BMU command in SOM algorithm by searching to find the most similar output neuron to the input vector can be used for finding missing values of an input vector (

3) Cooperation: once the winner neuron is obtained, the weight vector of the winning neuron and all other neurons are updated according to Equation (4) to minimize the local error [

W j i = w j i ( t ) + η ( t ) K ( j , t ) [ X i − W j , i ( t ) ] (4)

where η ( t ) : is the learning rate at iteration t; and K ( j , t ) is a suitable neighborhood function. This neighborhood function has the responsibility of preserving topological of input data [

4) Adaptation: The weight adjusting is repeated until a stable map is obtained or the map is converged [

Moreover, SOM Map quality could be assessed by various methods. In this study, Quantization Error (QE) which is a widely used criterion for evaluation of SOM Maps is utilized. The QE gradually decreases with increasing map sizes. The earlier studies indicate that the suitable number of neurons have an essential role in the accuracy and performance of the SOM algorithm [

Gaussian Process Regression (GPR) models are nonparametric kernel-based probabilistic models. These models are flexible nonlinear interpolating techniques which are based on the training data [

m ( X → ) = E [ f ( X → ) ] (5)

k ( X → , X → ′ ) = E [ ( f ( X → ) − m ( X → ) ) ( f ( X → ′ ) − m ( X → ′ ) ) ] (6)

The mean function represents the expected function value for input X [

f ( X → ) ~ G P ( m ( X → ) , k = ( X → , X → ′ ) ) (7)

The performance of the developed methodology is evaluated by using the data from an experimental site. A natural gradient tracer experiment carried out at the Eastlakes Experimental Site, located at the Botany Basin, New South Wales, Australia [^{2} [

The dimension and characteristic values of the study area are presented in

According to the results of previous geological investigations, the experimental site consists of five sedimentological distinct layers (

Parameter | Unit | Value |
---|---|---|

Maximum length of study area | m | 15.00 |

Maximum width of study area | m | 13.00 |

Thickness of study area | m | 3.50 |

Grid spacing in x-direction | m | 1.00 |

Grid spacing in y-direction | m | 1.00 |

Porosity (layer 1, layer 2, layer 3 and layer 4) | Dimensionless | (0.39, 0.41, 0.36 and 0.41) |

Longitudinal dispersivity (all layers) | m | 0.03 |

Ratio: H/L dispersivity | Dimensionless | 0.10 |

Specific storage (all layers) | 1/m | 0.20 |

Specific Yield (all layers) | Dimensionless | 0.20 |

Recharge | m/day | 0.00 |

Flow rate in injection wells | m^{3}/day | 4.40 |

Initial contaminant release concentrations | mg/l | 0 - 300 |

4. Peat material; and 5. Silty/clay sand unit [

In the tests carried out in the ELE site, the injected tracer solutions included conservative and reactive inorganic elements such as bromide, calcium, lead, and potassium. Three injection wells, C, D, and E were used in this test. These wells are illustrated in ^{nd} July 1996. During the tracer injection, the flow rates of wells were kept low enough to avoid the significant increases in the hydraulic heads at the injection wells [^{3}/day to prevent a significant change of the flow system and hydraulic head distribution [

The first samples of contaminant concentrations were collected two days after the injection on 4^{th} July 1996. Gathering samplings repeated by nine more sessions 4, 6, 8, 12, 16, 20, 24, 28 and 32 days after injection. Monitoring the transport of tracers plume movements demonstrated that bromide and the other conservative elements transports are mainly controlled by the variability of aquifer’s hydraulic conductivity [

The hydraulic conductivity values for ELE site were estimated by applying a combination of constant head test and falling head tests [

ID | Monitoring locations (i, j, k)* | Stress Period | Contaminant concentration values (mg/l) |
---|---|---|---|

1 | M1 (7, 3, 3) | 2 | 12.20 |

2 | M2 (6, 3, 3) | 15.50 | |

3 | M3 (5, 3, 3) | 0.10 | |

4 | M4 (8, 3, 3) | 3 | 9.00 |

5 | M2 (6, 3, 3) | 4 | 19.00 |

6 | M5 (5, 4, 3) | 0.09 | |

7 | M6 (6, 5, 3) | 0.09 | |

8 | M7 (8, 4, 3) | 5 | 0.15 |

9 | M8 (6, 4, 3) | 13.30 | |

10 | M9 (7, 6, 3) | 0.11 |

*: (i,j,k) the nodes coordinates in X, Y and Z directions, respectively.

1.8 to 50 m/day. Sometimes these variations are observed in short distances [

In this section, first, the following steps for constructing surrogate models for source characterization are explained. Then, the performance evaluation results of the constructed models are discussed.

1) Problem definition and sampling plan: as previously mentioned, four potential contaminant sources are considered in this study. These four sources are included three injection wells (

generate 1000 initial sample sets. These sample sets consist of contaminant release concentrations at four potential contaminant sources. The contaminant release concentrations are assumed to be in the range of 0 - 300 mg/l for all potential contaminant sources. The contaminant concentration values at specified locations and times (

2) Solving the numerical simulation models: The numerical flow and transport simulation models, MODFLOW and MT3DMS, respectively (within GMS 7) are solved for randomly generated contaminant release concentrations at the previous stage. The solutions contained the corresponding contaminant concentration magnitudes at selected monitoring locations at specific stress periods (

3) Developing the surrogate models: in this step, SOM and GPR algorithms are utilized to develop surrogate models for source characterization.

Same sets of training data are used for constructing the SOM and GPR-based surrogate models. Due to the different natures of the applied tools, different approaches are utilized to design the training data for developing the surrogate

ID | Source 1 | Source 2 | Source 3 | Source 4 | M1 | M2 | M3 | M4 | M2 | M5 | M6 | M7 | M8 | M9 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Contaminant release concentrations (mg/l) | Contaminant concentrations (mg/l) | |||||||||||||

SP1 | SP2 | SP3 | SP4 | SP5 | ||||||||||

1 | 290 | 251 | 8 | 146 | 13.3 | 36.0 | 5.7 | 0.3 | 55.0 | 2.7 | 0.5 | 0.0 | 15.8 | 0.0 |

2 | 163 | 216 | 245 | 157 | 18.3 | 14.9 | 3.7 | 5.4 | 26.1 | 1.2 | 0.1 | 0.2 | 12.4 | 0.0 |

3 | 289 | 0 | 5 | 59 | 0.1 | 24.9 | 3.5 | 0.3 | 42.3 | 0.5 | 0.2 | 0.0 | 15.6 | 0.0 |

4 | 16 | 159 | 102 | 269 | 13.2 | 1.5 | 0.4 | 0.2 | 3.5 | 0.1 | 0.0 | 0.1 | 0.3 | 0.0 |

5 | 55 | 298 | 52 | 84 | 16.8 | 6.7 | 0.0 | 1.6 | 9.2 | 0.0 | 0.1 | 0.1 | 1.5 | 0.0 |

models. In the SOM-based surrogate models, all the training data is used to develop the surrogate models in one shot or in a single run. At this stage, different SOM-based surrogate models representing different numbers of SOM map units are constructed.

For training and developing GPR-based surrogate models, first, the predictors and target variables of the system need to be addressed. Since, in source characterization problem, just observed contaminant concentrations data is available, unknown groundwater contaminant sources need to be characterized in an inverse mode. Therefore, in the training process of the GPR-based surrogate model, the contaminant concentration values of the training data are addressed as the predictors of the GPR prediction models. The randomly generated contaminant release concentrations at potential contaminant sources at specific times are considered to be the target variables of the GPR prediction models. Each GPR prediction model can only have one target variable. As a result, for each target variable, separate GPR model is developed. Then, after developing all the GPR prediction models, the constructed GPR prediction models are integrated to develop the GPR-based surrogate model. By providing the measured or simulated contaminant concentration values for the GPR-based surrogate model, unknown contaminant sources can be characterized at potential contaminant sources at specific times.

After developing the SOM and GPR-based surrogate models, the developed surrogate models are independently utilized for unknown source characterization without using an explicit optimization model.

4) Validation of the surrogate models: the developed surrogate models are tested by new sample sets. The contaminant release concentrations of these sample sets are randomly generated by using the LHS method in the range of 0 - 300 mg/l. Then, the corresponding concentration values at monitoring locations are obtained by implementing the simulation models.

In order to evaluate the capability and efficiency of the SOM and GPR-based surrogate models to identify the unknown source characteristics, when the field concentration measurements resulting from specified contaminant release concentrations in the study area are specified, the surrogate models are used in inverse mode. The simulated contaminant concentration values at specific locations and time of testing data are considered to be the known variables of the system. The developed surrogate models are utilized for source characterization by using information regarding these known variables.

In the SOM-based surrogate model case, when utilized in the inverse mode, to estimate unknown contaminant sources, the BMU command of the SOM algorithm which searches for the most similar vectors of the SOM-based surrogate model to match the testing input data is utilized for source characterization. The detailed information of the application of this surrogate model was discussed in [

The performance of the developed surrogate models is evaluated by utilizing Normalized Absolute Error of Estimation (NAEE) as an error criterion. NAEE can be defined by Equation (8) [

NAEE ( % ) = ∑ i = 1 S ∑ j = 1 N | ( q i j ) e s t − ( q i j ) a c t | ∑ i = 1 S ∑ j = 1 N ( q i j ) a c t × 100 (8)

where ( q i j ) a c t and ( q i j ) e s t are the actual and estimated source fluxes at source number i in stress period j, respectively. S and N are the total number of potential contaminant sources and transport stress periods, respectively.

The performance evaluations of the different developed SOM-based surrogate models representing different numbers of SOM map units are illustrated in

ID | Source 1 | Source 2 | Source 3 | Source 4 | M1 | M2 | M3 | M4 | M2 | M5 | M6 | M7 | M8 | M9 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Contaminant release concentrations (mg/l) | Contaminant concentrations (mg/l) | |||||||||||||

SP1 | SP2 | SP3 | SP4 | SP5 | ||||||||||

1 | 10.1 | 7.6 | 0.7 | 1.7 | 10.7 | 0.2 | 0.0 | 0.4 | 7.4 | 0.0 | ||||

2 | 2.8 | 5.7 | 0.0 | 0.4 | 11.1 | 0.0 | 0.1 | 0.0 | 3.4 | 0.0 | ||||

3 | 2.9 | 21.9 | 5.4 | 6.2 | 23.8 | 1.6 | 0.2 | 0.1 | 16.5 | 0.0 | ||||

4 | 13.1 | 21.7 | 0.1 | 3.3 | 29.8 | 0.0 | 0.2 | 0.1 | 18.0 | 0.0 | ||||

5 | 16.7 | 11.7 | 0.1 | 4.0 | 16.1 | 0.0 | 0.1 | 0.1 | 2.5 | 0.0 |

The performance of the developed GPR-based surrogate model for source characterization is also evaluated by using the same testing data. The performance evaluation results of the SOM and GPR-based surrogate models for testing data in terms of NAEE are equal to 15.8% and 16.2%, respectively. The evaluation results show similar accuracy for the selected SOM-based surrogate model compared to the performance evaluation results of the GPR-based surrogate models. Despite on the average similar performance in terms of accuracy of these two surrogate models for source identifications, their abilities in screening dummy sources are different. The SOM-based surrogate model could screen the dummy sources in 98 percent of the cases accurately, against only six percent correct inference by the GPR-based surrogate model. Actually, the approximations of the GPR-based surrogate model for the dummy sources are not unsatisfactory. The developed GPR-based surrogate model could appropriately estimate the dummy sources (not actual sources) as very low magnitudes but not exactly as zero flux values.

The obtained average NAEE for each source for all the developed surrogate models are compared and presented in

2) Source characterization or recovering source injection history: The obtained results at evaluation stage demonstrate that these surrogate models can be utilized for source characterization. Therefore, the developed SOM and GPR- based surrogate models by using the measured bromide concertation data (

results are illustrated in

In this study, SOM and GPR algorithms for comparison purpose are used to construct the surrogate models for source characterization. Same training data is used to develop SOM and GPR-based surrogate models. Limited performance evaluations of the developed SOM and GPR-based surrogate models are conducted to test their efficiency for source characterization in an experimental contaminated aquifer site. This site constitutes of a portion of a heterogeneous aquifer with uncertainties in hydraulic conductivity values, and errors in measured contaminant concentration values. Main conclusions that can be drawn from these performance evaluation results are:

1) SOM and GPR based surrogate models are potentially effective tools to approximate the groundwater flow and transport simulation processes in a multi-layer heterogeneous experimental contaminated aquifer site.

2) The performance evaluation results demonstrate potential applicability of the SOM and GPR algorithms as the surrogate model types in inverse mode, for unknown groundwater source characterization problems under hydraulic conductivity estimation uncertainty and erroneous contaminant concentration data (

3) Comparison of the performance of the developed surrogate models for characterization of each of the potential contaminant sources (

4) In source characterization problems, SOM algorithm capability in clustering multidimensional input data leads the SOM-based surrogate model to screen

dummy sources, i.e., not actual sources but included as potential sources precisely.

5) The most important conclusion is that these surrogate models may provide a feasible methodology for characterization of unknown groundwater contaminant sources in terms of location, magnitude, and duration of source activity, without the necessity of using a linked simulation-optimization model.

However, these performance evaluation results are limited to specific cases and further evaluations are necessary to establish the applicability of the developed methodology.

The second author thanks CRC-CARE, Australia for providing financial support for this research through Project No. 5.6.0.3.09/10(2.6.03), CRC-CARE-Bithin Datta which partially funded the Ph.D. scholarship of the first author.

Hazrati-Yadkoori, S. and Datta, B. (2017) Evaluation of Unknown Groundwater Contaminant Sources Characterization Efficiency under Hydrogeologic Uncertainty in an Experimental Aquifer Site by Utilizing Surrogate Models. Journal of Water Resource and Protection, 9, 1612-1633. https://doi.org/10.4236/jwarp.2017.913101