Adaptive Surrogate Model Based Optimization (ASMBO) for Unknown Groundwater Contaminant Source Characterizations Using Self-Organizing Maps

Characterization of unknown groundwater contaminant sources in terms of location, magnitude and duration of source activity is a complex problem. In this study, to increase the efficiency and accuracy of source characterization an alternative methodology to the methodologies proposed earlier is developed. This methodology, Adaptive Surrogate Modeling Based Optimization (ASMBO) uses the capabilities of Self Organizing Map (SOM) algorithm to design the surrogate models and adaptive surrogate models for source characterization. The most important advantage of this methodology is its direct utilization for groundwater contaminant characterization without the necessity of utilizing a linked simulation optimization model. The validation of the SOM based surrogate models and SOM based adaptive surrogate models demonstrates that the quantity and quality of initial sample sizes have crucial role on the accuracy of solutions as the designed monitoring locations. The performance evaluation results of the proposed methodology are obtained using error free and erroneous concentration measurement data. These results demonstrate that the developed methodology could approximate groundwater flow and transport simulation models, and substitute the optimization model for characterization of unknown groundwater contaminant sources in terms of location, magnitude and duration of source activity.


Introduction
Groundwater has a fundamental role in human life as being one of the main renewable sources of fresh water.Unfortunately, in recent decades, because of increasing anthropogenic activities and improper management worldwide, groundwater is subjected to several kinds of pollutants such as seepage from: chemical and petrochemical infrastructure; waste water collection systems; industrial, mining and agriculture fields.However, usually groundwater contamination remains undetected for a long time and is often detected accidently by changing qualities of regional surface water or by chemical analysis of water collected from drinking water wells.Therefore, identifying the unknown characteristics of these contaminant sources and remediation of contaminated groundwater is a necessity.On the other hand, identifying unknown groundwater contaminant source characteristics (contaminant magnitudes, locations and time releases) usually are time consuming and inaccurate because of the uncertainties in the available hydrogeologic information and sparsity of measurement data.Also, the solutions may be non-unique because of high sensitivity to the monitoring data and model parameters.The methodologies proposed earlier to identify unknown groundwater contaminant characteristics can be classified into two major groups: methods based on statistical estimation, and methods based on optimization approaches.An extensive literature review of these methodologies can be found in [1]- [6].In the approaches based on optimization, the most effective method to tackle this problem is the linked simulation optimization approach.The linked simulation optimization procedures consist of two main components: 1) models for simulation of groundwater flow and contaminant transport processes, 2) optimization model with an optimization algorithm.Some of the optimization algorithms utilized are linear programming and multiple regressions technique [7]; a nonlinear optimization model with embedding technique [8] [9] [10]; Genetic Algorithm (GA) [11] and [12]; the Artificial Neural Network (ANN) [13] and [14]; a hybrid methodology based on GA [15] and [16]; the classical nonlinear optimization algorithm [17]; Simulated Annealing (SA) [18] [19] [20] [21] and Adaptive Simulated Annealing (ASA) [22], Genetic Programming (GP) [23] and [24]; ASA in conjunction with uncertainty modeling [25] and [26].Application of these methodologies to real-world cases is generally computationally time intensive, and may need days or weeks of CPU time to obtain an optimal solution.
Therefore, Surrogate Modeling Based Optimization (SMBO) methodologies have been proposed to reduce these enormous computing costs and time associated with repeated runs of the numerical simulation models within the optimization algorithm.Surrogate models based on ANN, GA, Kriging, and regression techniques have been proposed as approximate simulators of the physical processes [27].Surrogate models are trained by using numerical simulation models.Once trained, the surrogate model can approximate the physical process simulation.Therefore, linked simulation optimization models linking with computationally intensive numerical simulation models can be replaced by optimi-zation simulation models linked using surrogate models [12].Using surrogate models can substantially reduce computational time, as the linked simulation optimization models require a repeated solution of the simulation models.Therefore, replacing the numerical simulation models by surrogate models can result in very substantial computational efficiency and feasibility [28].In the present study, an alternative approach to the linked simulation optimization model and SMBO for optimal characterization of unknown groundwater contaminant sources is proposed and evaluated for potential applicability.In this methodology, the linked simulation optimization model is replaced by a trained Self Organizing Map (SOM) based surrogate model or adaptive surrogate model to characterize unknown groundwater contaminant sources.This methodology: Adaptive Surrogate Modeling Based Optimization (ASMBO) uses the capabilities of SOM algorithm to design the surrogate models, and adaptive surrogate models to improve the efficiency of solving the inverse problem of source characterization.The surrogate models approximate the groundwater flow and transport simulation models and the ASMBO eliminates the need for using a formal optimization model for source characterization in terms of location, magnitude and duration of source activity.The specific, main objective of this study is to develop an efficient methodology to characterize unknown groundwater contaminant sources especially where measurement data are sparse and erroneous.

Groundwater Flow and Transport Simulation Models
In this study, the numerical simulation model MODFLOW is utilized to simulate groundwater flow process in a contaminated aquifer.The governing equation in this numerical simulation model can be represented by Equation (1).This equation describes three-dimensional movement of groundwater in nonequilibrium, anisotropic and heterogeneous conditions [29].Analytical solution of Equation (1), except in a few simple cases, is very difficult.Therefore, to solve Equation (1), different numerical models are applied to reach approximate solutions.MODFLOW uses the finite-difference method to solve Equation (1).
where: SS is the specific storage of the porous media (L −1 ); t is time (T).
Moreover, for simulating the three dimensional transports of contaminants in groundwater MT3DMS is utilized.The governing equation of MT3DMS can be described by Equation (2), which is a partial differential equation and considers the fate and transport of contaminants of species k in a 3-D, transient groundwater flow system [30].
( ) ( ) where θ is porosity of the subsurface medium, dimensionless; k C is the concentration of species k which dissolved in groundwater, ML −3 ; t is time; x x is the distance along the respective Cartesian coordinate axis, L; ij D is the hydrodynamic dispersion coefficient tensor, L 2 T −1 ; i v is the seepage velocity, LT −1 ; s q is volumetric flow rate per unit volume of groundwater system which represent fluid source (positive) and sinks (negative), T −1 ; C is the concentration of the source or sink flux for species k, ML −3 ; and In this equation, advection, dispersion and chemical reaction of contaminants in groundwater are considered.To solve this equation, the seepage velocity that is related to the Darcy flux through the relationship i i q v θ = , should be known.
Therefore, calculating the hydraulic head using MODFLOW is necessary.

Self-Organizing Map
The Self Organizing Map (SOM) is an algorithm introduced by Kohonen to visualize multidimensional data.This algorithm visualizes complex non-linear statistical multidimensional data problems usually into two dimensional display [31] and [32].This algorithm transforms the high dimensional data to low dimensional data by preserving the main characteristics and relationships of the input data [33].Therefore, the capabilities of SOM algorithm in reducing the dimensions and visualizing of data leads this algorithm to be widely used in various complex fields of sciences such as: statistics, data mining, machine learning signal processing, financial analyses, chemistry and social networks [32] and [34].
The SOM algorithm consists of a set of processing units, "neurons", which are commonly arranged in a 2-dimensional rectangular or hexagonal grid.These neurons are accompanied with a location and a weight vector that connects input to output by stating an initial random weight in several iterations to reach a stable map.In other words, this algorithm tries to cluster training data based on similarity and topology without any external supervision [35].The main steps of Kohonen's SOM algorithm are initialization, competition, cooperation and adaptation [35] [36] [37] [38], which are described as follows: 1) Initialization: in this step, it is assumed that the set of input data with N units is represented by X: { } ; then, each neuron in the output space will map to the corresponding units in the input space.The connection weight vector between input units i and output neurons j can be written as j W : 2) Competition: for each input pattern Xi, the output neurons compete to declare the winner neuron.The winner neuron or Best Matching Unit (BMU) is the closest neuron or most similar one to the input vector.The discriminant function used for this step can be defined by Equation (3) which is a squared Euclidean distance between the input vector X and weight vector 3) Cooperation: according to the results of neurobiological studies there is a lateral interaction within a set of excited neuron and the winner neuron.This interaction decays with distance.Therefore, the winning neuron and its topological neighbours update all weights according to Equation (4) and are moved to decrease their distance with the input units.
where ( ) η : is the learning rate at iteration t; and ( ) , K j t is a suitable neigh- bourhood function.
4) Adaptation: the excited neurons decrease their discriminant function values to reach an appropriate alignment to the input pattern.For this step, the process repeats steps 2 to 4 until the feature map stops changing.
The SOM algorithm visualizes nonlinear relationship of high dimensional data into low dimensional display by preserving the main characteristics of input data.This algorithm is capable of not only clustering and visualizing high dimensional data but, also is capable of generalization.In other words, SOM can interpolate between the initial data and predict missing values of the system's vectors [33].Figure 1 shows how this algorithm is utilized for predicting the missing components of a new vector (Z) of the system based on its known components.In this study, Z represents the vector of measured concentrations and unknown contaminant sources that need to be estimated.The software "SOM Toolbox for Matlab 5" [39] is utilized for constructing the SOM based surrogate model and the SOM based adaptive surrogate model.

Application of Adaptive Surrogate Model Based Optimization for Source Characterization
Surrogate models function essentially by developing a relationship between the inputs and outputs of the system based on training of the model.If this model is constructed accurately, approximates can mimic the behavior of more sophisticated simulation models at substantially reduced computational time [40].Several methodologies have been developed to improve the accuracy and efficiency of surrogate modelling such as: Adaptive Surrogate Model Based Optimization (ASMBO).This methodology utilizes adaptive training of the surrogate models [41] and has been suggested as an efficient methodology to solve time-consuming computer models.The main idea of this procedure is that the direct optimization is substituted by an iterative process comprised of construction, optimization and updating of the surrogate model [42].Moreover, by using adaptive sampling which is based on the preliminary results of surrogate model, the efficiency of the surrogate models is increased.In ASMBO, after sampling a certain number of selected parameters sets in initial stage, additional sampling which can effectively increase the accuracy of the surrogate model results are added.An adaptive sampling methodology improves the speed of obtaining the accurate variable values [43].In this study, a new type of ASMBO is developed to characterize unknown groundwater contaminant sources.This developed methodology is SOM based surrogate model or SOM based adaptive surrogate model which is utilized to characterize unknown groundwater contaminant sources in terms of location, magnitude and activity time. 1) Initial sampling: first, the main variables of the defined system as per their degree of importance, according to the preliminary experiments are chosen [44].
The main question in this stage is how we could design our surrogate models to accurately mimic the behavior of the defined system with limited numbers of inputs.Furthermore, Latin Hypercube Sampling (LHS) is appropriate and suitable for this stage [45].In this stage, it is crucial to ensure sampling is selected through all domains of input values and due to this characteristic LHS is utilized in this study.Also, the upper and lower bounds of these variables are assumed to be known.
2) Generating training data: the numerical simulation models are solved to generate solution results for randomly generated initial samples in previous

Performance Evaluation
In this study, performance of the developed methodology is evaluated utilizing synthetic hydrogeologic and geochemical data for an illustrative contaminated aquifer.The advantage in using synthetic data is that the unknown data errors in the measurement data can be quantified and need not be treated as unknown quantities for evaluation purpose.Normalized Absolute Error of Estimation (NAEE) is also utilized as a measure to calculate a normalized error of estimation.Equation (5) represents NAEE [22]: where: S is the number of pollution source (s); N is the number of transport stress periods; ( ) act j i q is actual source flux at source number i in stress period j; ( ) est j i q is estimated source flux at source number i in stress period j.

Study Area
The illustrative study area utilized for the performance evaluation of the proposed methodology is a homogeneous aquifer which consists of one confined layer (Figure 3).Table 1.Hydrogeologic characteristics of the study area.

Application of the SOM Based Surrogate Model for Source Identification
In this study, SOM based surrogate models and SOM based adaptive surrogate models are utilized to characterize unknown groundwater contaminant sources as an inverse problem.The following steps are followed to select the best SOM based surrogate model among constructed models for illustrative study area; then, the SOM based adaptive surrogate model is developed.
1) Scenarios for initial sampling: LHS is used to randomly generate two groups of 1000 initial sample sets.These sample sets are generated by assuming that all of these three potential sources are active through first four stress periods, SP1 to SP4.Also, three groups of 100 sample sets are generated by assuming that in each group at least one of the sources is inactive.The contaminant source fluxes are assumed to be in the range of 0 -100 kg/day for all potential sources.
For all of the generated sample sets, the three potential contaminant source fluxes at five different stress periods and their corresponding contaminant concentration magnitudes at specified monitoring locations and specific stress periods are selected as the variables of the surrogate models for this study area.As mentioned in the methodology section, because the definition of BMU of the SOM algorithm (Equation ( 3)) is similar to the definition of the implicit objective function of source identification problem.Therefore, the BMU of SOM algorithm is utilized for estimating unknown characteristics (magnitude, location and duration) of potential contaminant sources.This algorithm by using the information of known components of the input vector estimated the unknown components of the input vector.In this study, this capability of the SOM algorithm is utilized to characterize unknown groundwater contaminant sources as an inverse problem.It also utilized to estimate contaminant concentration values at specified location and time when the contaminant sources and their characteristics are known.
For performance evaluation of source characterization capabilities utilizing the trained SOM surrogate models, the contaminant concentration values at monitoring locations at specific times are considered as known variables of an input vector.This vector needs to have the same number of variables as the input vectors of training phase.Table 5 represents a typical input for testing data when the SOM based surrogate model is utilized to characterize unknown contaminant sources as an inverse problem.In this table, magnitudes of contaminant concentration values at six monitoring locations (ML1 toML6) at five periods (SP1 to SP5) are assumed as known variables of the SOM based surrogate

Results
For evaluating the effect of initial sample sets on the result of surrogate models, different surrogate models using different numbers of initial sample sets ranging 1000 to 2300 are constructed.The concentration measurement data corresponding to 6 existing monitoring locations are used to construct these surrogate models.The numbers of SOM map units are maintained constant (100 × 100 units).The best results are obtained by using 2300 initial sample sets; the average NAEE for 100 sample sets is equal to 30.     a is maximum deviation expressed as a percentage; and b is a random fraction between +1 and −1 obtained by utilizing the LHS.
The source characterization results obtained with these erroneous concentration measurements are shown in Figure 9.These solution results shown in Figure 9 demonstrate that the source characterization performances do not substantially change for scenarios with error free, 5 percent, 10 percent, 15 and 20 percent concentration measurement errors.Figure 9 also indicates that the accuracy of estimated source fluxes significantly decreased when the incorporated errors are 25 percent or larger.

Discussion
The performance evaluation results of the SOM based surrogate model are not entirely satisfactory.These very limited results show that it could approximate groundwater flow and transport simulation models properly.However, for increasing the efficiency of developed methodology additional training with incorporation of different actual source location scenarios were developed.The evaluation results also indicated that the quantity and quality of initial sample sets and the number of SOM map units have a crucial rule in the efficiency of the  Error free data 5% uncontrolled errors 10% uncontrolled errors 15% uncontrolled errors 20% uncontrolled errors 25% uncontrolled errors 30% uncontrolled errors model (Table 6 and Figure 6).In order to improve the accuracy of the solution results, the following strategies are suggested:

Conclusions
Different scenarios correspond to different surrogate models with various numbers of initial sample sizes and Self-Organizing Map (SOM) map units are considered.Also, the performance of the developed methodology is evaluated by thodology can be used as an alternative methodology for unknown groundwater contaminant sources characterization, which can potentially eliminate the necessity of using other widely used methodologies, i.e., the linked simulation optimization methodology.
2) The quality of initial sample size is important.This size should be adequate and cover the whole plausible range of contaminant source fluxes for all the potential contaminant sources.
3) The size of SOM map units is important.The best size should be selected due to the memory of PC used, number of variables, and initial sample sizes.
4) The performance evaluation results do show comparatively large errors in terms of the specific error criteria utilized.However, a comparison of the source estimates and the actual source characteristics shows a good match.
5) Most important conclusion is that the SOM based surrogate models may provide a feasible methodology for characterization/identification of unknown groundwater contaminant sources in terms of location, magnitude and duration of source activity, without the necessity of using a linked simulation optimization model, when the ASMBO methodology is adopted.However, it appears likely that the accuracy of characterization may not be adequate in real life scenarios with multiple sources, complex hydrogeology of the aquifer, and parameter estimation uncertainties.
6) The SOM based models seem to perform satisfactorily when concentration measurement data are erroneous.
7) The performance evaluation results presented in this study are very limited in scope and more rigorous evaluations are necessary to establish its applicability for source identification without using any optimal decision model.These performance evaluation results are based on very limited scenarios.More rigorous performance evaluations incorporating: random heterogeneity of hydrogeologic parameters and considering more complex geochemical processes are necessary to establish the applicability of the proposed methodology.

K
are the hydraulic conductivity along the x, y, and z coordinate axes, (L/T); h is the potentiometric head (L); W is a volumetric flux per unit volume from aquifer as sources (sinks), the negative value represents withdrawal of the groundwater system and vice versa (T −1 ); Figure 1.(a) The SOM algorithm for clustering and visualization; (b) The prediction process for missing components of system's new input vectors.

Figure 2
illustrates the main stages of constructing a SOM based surrogate model and SOM based adaptive surrogate model for source identification.These stages are briefly discussed in the following paragraphs.

Figure 2 .
Figure 2. Key elements of the Adaptive Surrogate Model based Optimization (ASMBO) procedure for source identification as an inverse method.

(
ML1 to ML6) and two abstraction wells (W1 and W2); these important features are shown in Figure 3.The total time of simulation is divided into 5 different stress periods (SP1 to SP5).The first four stress periods are each of 183 days duration, and the last stress period is of 2200 days duration.Potential contaminant sources are assumed to be active only in the first four stress periods.The abstraction rates for each stress period at the abstraction wells are presented in Table 3.

Figure 3 .
Figure 3. Illustrative study area represents potential contaminant source locations, abstraction wells and monitoring locations.

2 )
Generating training data: the solution results of the numerical simulation models for generated initial sample sets are obtained in this step.The numerical flow and transport simulation models MODFLOW and MT3DMS (within GMS 7) are solved to obtain adequate sample data for training and testing of the surrogate models.Figure 4 shows a typical contaminant plume 732 days after start of the first source activity.The training data consist of randomly generated contaminant source fluxes and their corresponding contaminant concentration values at the specified monitoring locations at specified times.

Figure 4 .
Figure 4.A typical concentration plume 732 days after start of first source activity.

Figure 5 .
Figure 5.The results obtained from SOM based surrogate model for estimating the contaminant concentration values at selected monitoring locations (NAEE is equal to 15 percent).

Figure 6 .Figure 7 .
Figure 6.Required times for developing different SOM based surrogate models representing different numbers of SOM map units.

Figure 7 .
Figure 7.The results obtained from the selected SOM based surrogate models for source identification of actual contaminant source fluxes (NAEE is equal to 31 percent).

Figure 8 .C
Figure 8.The performance evaluation of the SOM based adaptive surrogate models and the selected SOM based surrogate model in terms of NAEE for characterizing unknown contaminant sources, the NAEE are equal to 20 and 31 percent, respectively.

Figure 9 .
Figure 9.The performance evaluation results of the SOM based surrogate models and SOM based adaptive surrogate models in terms of NAEE.
based surrogate model SOM based adaptive surrogate model NAEE (%) utilizing the SOM based surrogate model, to identify potential contaminant sources, for an ideal scenario of error free concentration data, as well as scenarios with different degrees of erroneous concentration measurements data.In addition, an improved version of SOM based surrogate model, i.e.SOM based adaptive surrogate model (ASMBO) is constructed to characterize potential contaminant sources.Main conclusions that can be drawn from these limited performance evaluation results are: 1) SOM based surrogate models are potentially efficient methods to approximate groundwater flow and transport simulation models.The developed me- Table1shows the aquifer characteristic values and dimensions of this study area.In this study area, the north and south boundaries are considered as specified head boundaries with 35 m and 25 m as specified head for north and south boundaries, respectively.Whereas, the east and west boundaries are variable heads.In this case, only a conservative contaminant is considered and three potential contaminant source locations are considered (S1, S2, andS3).The locations and actual contaminant fluxes of these three potential contaminant sources are presented in Table2.There are six monitoring locations

Table 2 .
The locations and actual contaminant fluxes of three potential contaminant sources.

Table 3 .
Abstraction well locations and abstraction rates in different stress periods.

Table 4
represents a typical input for training of a SOM based surrogate model.This input consists of five sample sets.Each set consists of randomly generated contaminant source fluxes for three potential contaminant sources at four stress periods (SP1 to SP4).

Table 4 .
Typical input vectors for training a SOM based surrogate model.

Table 5 .
A typical input vector with missing data for testing a SOM based surrogate model.
5)The selected SOM based surrogate model: the selected SOM based surrogate model is used to characterize the unknown groundwater contaminated sources as an inverse problem and for further performance evaluation.6) SOM based adaptive surrogate model: It is supposed that SOM based adaptive surrogate models could improve the source characterization results.Therefore, based on the preliminary results of the selected SOM based surrogate model (i.e., emphasizing the preliminary or latest source estimation results new sample patterns are randomly generated) the SOM based adaptive surrogate model is constructed for contaminated aquifer by adding new sample sets.500 new sample sets are generated by utilizing LHS and considering the results obtained by utilizing SOM based surrogate model for source identification.

Table 6 .
The performance evaluation of different scenarios representing different numbers of SOM map units.
It can be concluded that, SOM based surrogate model and SOM based adaptive surrogate model could be utilized to identify unknown characteristics of potential contaminant source in contaminated aquifers.Also, these could be applied to estimate the contaminant concentration values at specified monitoring location if the contaminant sources are known.Especially, additional information based on earlier estimates of the contaminant source characteristics scenarios if incorporated in the training stage; it can increase the efficiency in terms of more accurate estimation when new samples are added.This is essentially the adaptive surrogate model based optimization approach.One of the advantages of this methodology is the consistency of solution results for ideal (error free concentration measurements) and real (when contaminant concentration incorporate up to 20 percent erroneous data) scenarios.This observation may be relevant only when limited numbers of initial samples are utilized.Therefore, the selected method to generate relevant initial sample sets has important role on the solution results.Also, utilizing sufficient size of sample sets is necessary.