Characterization of Groundwater Pollution Sources with Unknown Release Time History

Characterizations of unknown groundwater pollution sources in terms of source location, source flux release history and sources activity initiation times, from sparse observation concentration measurements are a challenging task. Optimization-based methods are often applied to solve groundwater pollution source characterization problem. These methods are effective only when the starting times of activity of the sources are precisely known, or the possible time window within which the sources activity actually start is known with a fair degree of certainty. However, in real life scenarios, the starting time of the activity of the sources is either unknown or can lie anywhere within a time window of years or decades. Absence of any prior information about the span of time window, within which the sources become active, makes existing source identification methodologies inefficient. As an alternative, an optimization-based source identification model is proposed, to simultaneously estimate source flux release history and sources activity initiation times. The method considers source flux release history and sources activity initiation times as explicit decision variables, optimally estimated by the decision model. Performance of the developed methodology is evaluated for an illustrative study area having multiple sources with different source activity initiation times, missing observation data and transient flow conditions. These evaluation results demonstrate the potential applicability of the proposed methodology and its capability to correctly estimate the unknown source flux releasing history and sources activity initiation times.


Introduction
In the event of detection of any pollutant in the groundwater system, the next step is to design an effective remediation strategy for reclaiming a polluted aquifer.This requires precise information of the source characteristics in terms of source location, source flux release history and sources activity initiation times.The process of flow and transport in the groundwater system is so slow, that many times pollution in an aquifer is detected much after the pollution sources have become active or may even have ceased to exist.Therefore, even a rough guess of the time span within which the source activity has started may be very difficult.
Optimization based methodologies developed so far for unknown source flux characterization which relies on some knowledge of the approximate starting time of the sources activities.In all real world problems, the starting time of the activity of the sources is either unknown or can lie anywhere within a time window of years or decades.Absence of any prior information about the starting time of activity of the pollution sources, or the span of time window, within which the sources becomes active, renders the existing source identification methodologies inefficient.To overcome this, a methodology is developed for simultaneously estimating source flux releasing history and sources activity initiation times, using a new optimal decision model that can efficiently solve the source characterization problem, when it is impossible to specify a small time span within which the source activity may have started.
All existing optimization based source identification techniques implicitly assume that the starting time of the activity of the sources is precisely known, or the time span for the possible start of the sources activity is known with fair degree of certainty.This is because, in the existing methodologies, the starting time is not treated as an explicit decision variable, but generally estimated indirectly by solving for source fluxes magnitudes at defined discretised time intervals.The following scenarios represent the existing methods and their limitations.
1) The actual starting time of the source activity is assumed to be precisely known.This is an impractical assumption and in all real world scenarios, there is seldom any clue about the actual starting time of the sources.As a result most of these developed methods cannot be applied to such real world scenarios.
2) In scenarios where the time span for the possible start of the sources activities is known with fair degree of certainty, a typical range being of 1 to 5 years, the precise actual starting time of the sources is estimated indirectly by estimating the source fluxes magnitude for entire time span, discretised into smaller stress periods.As a result, large number of decision variables is added to the optimization problem.Each of these decision variables represent the source flux magnitude for a potential/actual source for a given stress period.Hence, the optimization problem needs to be solved for a large number of source flux magnitudes at different time intervals.Such a solution may theoretically seem possible but with every added decision variable, the dimension of the search space increases, leading to exponential rise in the computational time.This indirect approach does not represent an efficient formulation of the optimum decision model.
3) In the third scenario where there is no information of time span for the possible start of the sources activities, solution of source identification becomes highly challenging.If the solution approach as discussed for the previous scenario is applied to solve for the actual starting time, would mean solving for hundreds of source flux magnitudes for a very large number of stress periods, for a very large time span.This would not only increase the size of the optimization problem, but may render it computationally infeasible.
4) The solution methodology for all the above mentioned scenarios implicitly assumes that the actual starting time of the sources will always lie within a specified time span considered in the study.If this is untrue, the existing methods fail to give any meaningful result.
To address these limitations, a new methodology is proposed for simultaneously identifying the starting times of the activity of the sources and their flux release history.A new optimum decision model is formulated and solved for this purpose.SA is used for solving the optimization problem with starting times of the sources as the explicit decision variable.Performance of the proposed methodology is evaluated by solving an illustrative example problem for different sources activity initiation time scenarios, transient flow conditions and missing observation data to demonstrate its potential applicability.

Methodology
The proposed methodology incorporates the starting time of the activity of the source as explicit unknown variable in the optimization based decision model.Simulated Annealing is used for solving the optimization problem with an objective of minimizing the residual error between the simulated and observed pollutant concentrations measurements at observed locations.The optimization model is solved such that the unknown source fluxes and the stating times of the activity of the sources are the direct solutions of the source identification problem.

Source Identification Using Linked Simulation-Optimization Model
A linked simulation optimization model simulates the physical process within the optimization model.Therefore the simulation model is treated as a set of binding constraints for the optimization model.Therefore any feasible solution of the optimization model needs to satisfy the flow and the transport simulation model.The advantage of this approach is that it is possible to link any complex numerical model to the optimization model.In this simultaneous optimal source flux and activity starting time identification model, the flow and transport simulation models are linked to the optimization model using the SA algorithm for solution.

Groundwater Flow Simulation Model
A three dimensional numerical model MODFLOW [38] is used to simulate the groundwater flow in the polluted aquifer.MODFLOW is a computer program that numerically solves the three-dimensional groundwater flow equation for a porous medium by using a finite-difference method.The partial differential equation for groundwater flow [39] is given by Equation (1): where K xx , K yy and K zz represent the values of hydraulic conductivity along the x, y and z coordinate axes (LT −1 ) h is the potentiometric head (L) W is the volumetric flux per unit volume representing sources and/or sinks (T −1 ) S s is the specific storage of the porous material (L −1 ) t is time (T) x, y and z are the cartesian co-ordinates (L)

Solute Transport Model in Groundwater System
A three dimensional modular pollutant transport model MT3DMS [40] is used to simulate the solute transport in the polluted aquifer system.The transport model (MT3DMS) utilizes the flow field generated by the flow model (MODFLOW) to compute the pollutant plume.The partial differential equation describing three-dimensional transport of pollutants in groundwater [41] is given by Equation ( 2).
( ) where C is the concentration of pollutants dissolved in groundwater (ML −3 ) t is time (T) x i is the distance along the respective Cartesian coordinate axis (L) D ij is the hydrodynamic dispersion coefficient tensor (L 2 T −1 ) v i is the seepage or linear pore water velocity (LT −1 ); it it is related to the specific discharge or Darcy flux through the relationship, v i = q i /θ q s is volumetric flux of water per unit volume of aquifer representing fluid sources (positive) and sinks (negative) (T −1 ) C s is the concentration of the sources or sinks (ML −3 ) θ is the porosity of the porous medium (dimension less) ∑ is chemical reaction term for each of the N species considered (ML −3 T −1 ).

Optimization Model
Simulated Annealing (SA) is used as an optimization algorithm to solve the optimization problem.Simulated Annealing was first introduced by [42], is an extension of the metropolis algorithm [43].The basic concept of SA is derived from thermodynamics.Each step of SA algorithm replaces the current solution by a random nearby solution, chosen with a probability that depends on the difference between the corresponding function values and algorithm control parameters (initial temperature, temperature reduction factor etc.).

Simultaneous Source Flux and Activity Starting Time Identification Model
In source identification problem where the starting times of the activity of the sources is known, temporal pollutant source fluxes from all the potential sources, represented by the term s s q C in the transport Equation ( 2) are the only explicit decision variable.Source flux identification using linked simulation-optimization is solved by minimizing the difference between the simulated and the observed concentration measurements in space and time.Hence, source identification problems need observed concentration measurement data at different locations and times.Actually, typical concentration measurements at a given observation location, represents only a small portion of the entire concentration versus time plot (breakthrough curve) measured at discrete time intervals, as shown in Figure 1.If multiple sources are active at different locations and times, the concentration measurements at an observation location over a period of time represent a portion of the combined breakthrough curve at that observation location.
If the starting time of the activity of the sources is assumed to be known, the simulated concentration measurements can be correctly mapped in time to the observed concentration measurements to calculate the residual error at the observed locations.However, in real life scenario, the starting time of the source is mostly unknown.Hence it becomes unclear which temporal concentration measurement values on the simulated breakthrough curve, corresponds in time to the observed concentration measurements, and can be used to calculate the residual error at the observed locations.This can be understood from the illustrative example shown below.
Figure 2 shows pollutant plumes in the polluted aquifer at 1200, 1400, 1600 and 1800 days since the start of any source activity.It is evident from Figure 2, that the pollutant plume is dynamic in nature.This implies that the observed concentration measurement at an observation location will vary in magnitude when taken at different times.However, if the starting time of the sources is unknown, it cannot be determined if the resulting plumes and the corresponding concentration measurements are for 1200, 1400, 1600 and 1800 days since the start of any source activity, or for a different set of days.A meaningful comparison between the observed pollutant plumes and the simulated pollutant plumes is only possible when they correspond to the same time.Hence a correct estimate of the source fluxes is only possible when the actual and estimated plumes are compared co-  rrectly in time and space.Hence, the time lags between the start of the activity of the sources and the observed concentration measurement needs to be ascertained in order to estimate the sources activity starting times.
To overcome the limitation of the earlier proposed methodologies, an explicit decision variable, time lag T ∆ as explained below is defined in the source identification model.T ∆ is the time between the first concentra- tion measurement and the actual source activity initiation time.Figure 1 shows a typical breakthrough curve at any observation location iob with the source activity plotted on the x axis.The x axis represents the time axis, primary y axis represents the source flux (g/s) and the secondary y axis represents the concentration of the dissolved pollutant in groundwater at observation location iob (g/lit).Only a small portion of the breakthrough curve, shown by solid line in Figure 1, represents concentration measurements of the pollutant at observation location iob .
The following variables are used in the formulation of the source identification model: The objective function is defined as: ( ) f q C t represents the simulated concentration obtained from the transport simulation model at an observation location at time t and source flux s s q C : nob is the total number of concentration measurement wells; Abs is the absolute difference.The constraint set in Equation ( 6) essentially represents the linking of the optimization algorithm with the numerical groundwater flow and transport simulation model through the decision variables.The optimization algorithm searches for optimal values of unknown pollutant source fluxes, s s q C the lag time T ∆ and

( )
T i δ (i is an integer variable varying from 1 to maximum number of sources) by generating candidate solutions of these decision variables in optimization algorithm.Candidate values of fluxes s s q C are used as input for simu- lations of flow and transport models.Thus the lag time between the start of any source ( ) S i and first concentration measurement obtained from the site is given by ( ) This optimal search process with different candidate values of unknown variables results in an optimal solution that minimizes residual error between the simulated and observed pollutant concentrations.It is to be noted that T ∆ and ( ) δ are decision variables whose value is to be optimally determined to ascertain the times of initiation of the pollutant source fluxes.The optimal value of lag time T ∆ and ( ) gives the best estimate of the actual source fluxes starting times act T obtained as solution.

Performance Evaluation of Developed Methodology
The performance of the proposed methodology is evaluated for an illustrative polluted aquifer study area as shown in Figure 3.The performance is evaluated for different scenarios by varying source activity initiation times, varying the degree of spatial heterogeneity in the study area and different combination of observation locations.The study area is specified as comprising of heterogeneous, anisotropic, and confined aquifer.This study area has total dimension of 2100 meters in the x direction and 1950 meters in the y direction.The entire study area is discretised into smaller grids of size Δx, Δy and Δz in x, y and z direction respectively.The study area contains three different hydro-geological zones with different values of hydraulic conductivity K xx and po- rosity θ.Groundwater flow and solute transport processes are simulated using the value for saturated thickness of the aquifer b, longitudinal dispersivity αL, transverse dispersivity αT and horizontal anisotropyas given in Table 1.
In the discretised study area, cells marked with red star represent the grid locations containing a pollutant source S(i) where i represents the source number.Cells marked with green and purple circles are the grid locations containing an observation well.
In the first scenario, all the three sources are assumed to have different activity initiation times (Table 2) and activity duration of 900 days.The entire activity duration of the sources is divided into three equal stress periods of 300 days.The pollutant flux from the sources is assumed to be constant over a stress period.The pollutant flux from each of the sources is represented as S(i)(j), where i represents the source number and j represents the stress period number.A total of nine source fluxes S11, S12, S13, S21, S22, S23, S31, S32, S33, T ∆ , ( ) Since all flow and pollutant transport in groundwater system is inherently transient, the performance of the developed methodology was evaluated for transient flow and solute transport condition in the second scenario.Three different sources are considered and it is assumed that all the three sources start activity at the same time.It is also assumed that the starting time of the activity of the sources is same as the starting time of the simulation sim T , hence ( ) ( ) ( ) . The activity duration of the sources is divided into three equal stress periods of 364 days in this scenario.The pollutant flux from the sources is assumed to be constant over a

Simulation of Observed Concentration
In actual field application, the concentration measurements are to be obtained from field data.However, for performance evaluation purpose, these measurements are synthetically generated by simulation for assumed actual pollutant sources.The observed aquifer responses are simulated by numerical simulation models, MO-DFLOW and MT3DMS in GMS7.0.Initial and boundary conditions i.e. initial heads, boundary heads are specified in the numerical simulation models.Actual source fluxes are utilized to simulate these observed concentration measurement data at specified measurement locations.The time of start of the numerical simulation model for synthetically generating the observed concentration measurements is designated as the initial time ( ) act T .This initial time can be defined with respect to a calendar date, which in practice would correspond to the actual starting time of the sources.The time lag between the start of the numerical simulation model for synthetically generating the observed concentration measurements, and the first concentration measurement used for source identification is also specified.
In the first evaluation scenario considered, the time lag between the time of first pollutant concentration measurement ( ) m t used in the source identification model, and the actual starting time of the sources activity ( ) ( ) act T i , is 500, 200 and 800 days for source S1, S2 and S3, respectively.Concentration measurements are taken at discrete time intervals starting from the first pollutant concentration measurement time m t for different observation wells as shown in Table 3.The cross mark indicates the missing data for an observation well at a given observation time step.A total of four temporal pollutant concentration measurements from each of the six observation wells shown by purple circle (Figure 3) are utilized.Groundwater flow is assumed to be steady state.
In the second evaluation scenario all the sources are assumed to start at the same time and the time lag be- concentration measurements from each of the six observation wells are utilized.

Evaluation of Methodology Using Erroneous Concentration Measurement Data
In order to reflect real life condition, where the pollutant measurements are erroneous, the numerically simulated concentrations were perturbed to incorporate measurement errors.The observed concentrations generated using MODFLOW and MT3DMS were perturbed by adding error terms to the simulated measurement data to represent the effect of random measurement errors.These errors were added in order to incorporate realistic measurement errors.The observed pollutant concentration data is perturbed with random measurement error with maximum deviation of 10 percent of the measured concentration value as shown in Equation ( 7).
( ) err per rand where a Pert t iob cobs is the perturbed simulated erroneous concentration measurement at location iob at time ; a t err is error term; per µ is maximum deviation expressed as percentage; rand is a random fraction between −1.0 and +1.0 generated using a Latin hypercube distribution.Latin hypercube distribution is chosen for generating random error data evenly distributed across all class intervals, thus eliminating any clustering of sample data in few of the class intervals.

Solution Procedure
The proposed methodology for source flux identification and estimation of source activity initiation time is evaluated using synthetically generated observed concentration measurements as explained in the previous section.The source identification model simulates the aquifer response for a period of 5000 days.Concentration measurements are recorded at interval of 200 days for scenario 1 and 182 days for Scenario 2, at specified observa- ∆ for individual sources.Optimal source identification is evaluated for scenarios (Table 2), first using er- ror free concentration measurement data and then using concentration measurement data perturbed with random error.

Results and Discussion
The evaluation results of scenarios 1 using error free concentration measurement data and perturbed erroneous concentration measurement data is presented in 800 -0 = 800 days.In Scenario 2, the actual lag time is directly given by the estimated value of ∆T which is same for all three sources as they all started at the same time.
Even while using perturbed concentration measurement data in the identification model, the time lag between the start of source activity and the first pollutant concentration measurement is accurately estimated leading to accurate identification of the starting time of the activity of the sources.The solution results for source flux estimates using erroneous concentration measurements data show large errors of estimation in comparison to error free measurement data.These deviations between the actual and the estimated value of the source fluxes show the effect of errors in concentration measurement data, which accounts for random measurement errors in a real world scenario.
It can be seen in Figure 5 that while using perturbed concentration measurement the source fluxes estimated in Scenario 2 having transient flow condition shows large deviation in estimating S11, S21 and S31.On analysing the concentration measurements from the observation locations it was found that after a lag time of 1638 days, as was the case in Scenario 2, most of the pollutant plume had already passed over the observation well locations used in this scenario.As a result, only the tail end of the breakthrough curve was captured in the observed pollutant concentrations.It is also highly probable that the observed pollutant concentration from the tail end of the breakthrough curve was due to the later part of source activity, and the effect of the initial source activity was not captured by the tail end of the breakthrough curve.This could explain the reason for large deviations in estimating the flux values from all the three sources during stress period one using erroneous concentration measurement data.This also stresses the need for optimal monitoring well location for source identification.

Summary and Conclusions
This developed methodology for simultaneous identification of the source fluxes and their starting time appears to perform satisfactorily in estimating the unknown groundwater pollution source fluxes and their starting times for the illustrative example.These results show that this proposed methodology is capable of overcoming major limitations in the earlier methods.The main limitations of these earlier proposed source identification methods were: 1) It assumed that the starting time of the activity of the sources is precisely known; 2) In scenarios where the time span for the possible start of the sources activities is known with fair degree of certainty, actual starting time was estimated indirectly by solving for source flux magnitudes; 3) In scenarios, where there was no information of the time span for the possible start of the sources activities, solving indirectly for a large number source flux magnitudes over a very large time span, rendered the earlier methods computationally infeasible.
The proposed methodology is applicable to real world problems of source identification in polluted aquifer sites, where no information is available about the starting times of the sources activity and the uncertainty spans over a large time period.This methodology in its current form can estimate the activity starting times for all the potential sources both in steady state and transient conditions.It can also handle multiple sources having different source activity initiation times and missing observation data.
The main advantage of the proposed methodology is its capability to treat the starting time of source activity as an explicit decision variable.This capability has the potential to render many real life unknown groundwater pollution source identification problems computationally feasible.This will ultimately enhance the capability of addressing groundwater pollution remediation issue in complex contaminated aquifer study areas, where there is very little prior knowledge of the source location, source flux release history, and the source activity initiation time.

Notation
The following symbols are used in this paper:

Figure 1 .
Figure 1.Details of variables with respect to an observed breakthrough curve at a monitoring location.

Figure 2 .
Figure 2. Pollutant plume in the polluted aquifer study area at different times.
date representing the actual starting time of the activity of a source; sim T = a calendar date representing the starting time of the simulation for any particular candidate solution in the optimization model; ( ) T i δ = time interval between the starting time of the simulation and the starting time of the activity of source ( )S i , for any particular candidate solution in the optimization model (i is an integer variable varying from 1to maximum number of sources); m t = a calendar date representing the first concentration measurement obtained from the site; t ∆ = time interval between concentration measurements; (T) N = number of concentration measurements available (assumed same for all measurement locations) such that measurement at observation location iob at time a t (ML −3 ); s t iob cest = corresponding simulated concentration at observation location iob at time s t (ML −3 ).

Figure 3 .
Figure 3. Plan view of the illustrative study area.
explicit unknown variables in the optimization problem.The hydraulic conductivity K xx and effective porosity θ values in the three zones are kept different to show heterogeneity in the aquifer system.The scenario is further complicated by considering irregular monitoring for pollutant concentration measurements and missing observation measurement data Cobs.
tween the time of first pollutant concentration measurement ( ) m t used in the source identification model, and the actual starting time of the sources activity ( ) act T is 1638 days.A second set of monitoring well locations shown by green circle in Figure 3 is used as observation locations.Concentration measurements are taken every 182 days, starting from the first pollutant concentration measurement time t m .A total of four temporal pollutant tion locations.Since the actual starting time ( ) act T of the activity of the sources is not known to the source identification model, it assumes the simulation starting time ( ) calendar date before or after the actual starting time ( ) act T of the sources.Candidate values of unknown source fluxes and lag times are generated within the optimization model.The generated flux values are used for simulation of flow and transport process as a part of the linked simulation-optimization model.Value of time lag T ∆ obtained as solution of the source identification model deter- mines the temporal spacing between the assumed starting time ( ) sim T of the sources, and first concentration measurement ( ) m t , in Scenario 2, as all the sources start at the same time and coincides with the simulation starting time.In Scenario 1, the starting time of the individual sources is determined by subtracting

Figure 4 . 5 .
The solution results of identification using error free data and perturbed error data are compared to the actual source flux and lag time.Each of the unknown source flux variables S(i)(j) and lag time ∆T, the x axis having three corresponding bars.The first bar is the actual value.The second bar represents the estimated values using error free concentration measurements.The third bar represents the estimated values using concentration measurements with perturbed error.The evaluation results of source flux identification and starting times for Scenario 2 having transient flow condition is shown is Figure The same convention is followed for Scenario 2. However, unlike Scenario 1 as all the three sources are assumed to start at the same time corresponding to the simulation starting time ( ) sim T, hence only ∆Tlag variables is estimated as other lag variables Results of source flux identification using error free data closely match with the actual source fluxes values for all source fluxes in both the scenarios.The time lag between the start of source activity and the first pollutant concentration measurement is precisely identified by the model in both the scenarios.In Scenario 1, the estimated time lags can be used to estimate the starting time of the activity of the sources by subtraction ( ) T i δ from ∆T for each of the respective sources ( ) S i .Using these estimates of time lags in Scenario 1, the actual lag time between the starting of the source activity and the first concentration measurement is estimated as; ( ) 1 S 800 -300 = 500 days; ( ) 2 S 800 -600 = 200 days; ( ) 3 S
the actual starting time of the activity of the sources; sim T = calendar date representing the starting time of the simulation for any particular candidate solution in the optimization model; T ∆ = time lag between the first measurement of concentration and a candidate starting time for the source; m t = calendar date representing the first concentration measurement obtained from the site; the starting time of the simulation and the starting time of the activity of source ( ) water per unit volume of aquifer representing sources; for each of the N species considered; observation location iob at time t s ; at observation location iob at time t a ;

Table 1 .
Hydro-geological parameters for study area.

Table 3 .
Missing concentration measurement data for Scenario 1.