Coupling Singular Spectrum Analysis with Artificial Neural Network to Improve Accuracy of Sediment Load Prediction

Sediment load estimation is generally required for study and development of water resources system. In this regard, artificial neural network (ANN) is the most widely used modeling tool especially in data-constraint regions. This research attempts to combine SSA (singular spectrum analysis) with ANN, hereafter called SSA-ANN model, with expectation to improve the accuracy of sediment load predicted by the existing ANN approach. Two different catchments located in the Lower Mekong Basin (LMB) were selected for the study and the model performance was measured by several statistical indices. In comparing with ANN, the proposed SSA-ANN model shows its better performance repeatedly in both catchments. In validation stage, SSA-ANN is superior for larger Nash-Sutcliffe Efficiency about 24% in Ban Nong Kiang catchment and 7% in Nam Mae Pun Luang catchment. Other statistical measures of SSA-ANN are better than those of ANN as well. This improvement reveals the importance of SSA which filters noise containing in the raw time series and transforms the original input data to be near normal distribution which is favorable to model simulation. This coupled model is also recommended for the prediction of other water resources variables because extra input data are not required. Only additional computation, time series decomposition, is needed. The proposed technique could be potentially used to minimize the costly operation of sediment measurement in the LMB which is relatively rich in hydrometeorological records.


Introduction
Quantification of sediment load is necessary for study and development of water resources system such as reservoir storage, dam, irrigation/navigation channel, soil and water conservation measure, environmental impact assessment, etc. [1][2][3][4][5].Sediments are the end products of land surface erosion governed mainly by hydrometeorology, topography, geology and land use/cover [1,2].Sediment data are lacking for rivers in many areas of the world, especially in developing and remote regions [6].However, it can be estimated with the aid of modeling approaches.The hydrologic and terrain conditions of a river basin change spatio-temporally and this causes difficulties in determining their effects on sediment erosion and transport.This drawback has encouraged the application of black box models, e.g.artificial neural network (ANN).ANN forecasts outputs using experiences learned from historical data.Its application can be found in many sectors including finance, medicine, water resources, and so forth.There are many types of ANN and the recognized ones are feedforward, kohonen and hopfield networks [7].In predicting and forecasting water resources variables, feedforward networks are almost exclusively applied [8].The term "ANN" used in this paper is referred to feedforward artificial neural network.
The ANN model is commonly used in river basins with data scarcity because it does not require detailed physical information of the system.By just providing hydrometeorogical information as inputs, ANN can predict sediment load at the watershed outlets with high accuracy.Kisi and Shiri [5] applied ANNs to predict suspended sediment concentration (SSC) in Eel River (USA) with rainfall and discharge as inputs and obtained very satisfactory results with Nash-Sutcliffe Efficiency (NSE) between 0.80 and 0.84 in validation stage.Sediment yield of various sub-watersheds in Kapgari River Basin (India) is modeled well by ANN (input: rainfall and temperature) with NSE ranging from 0.76 to 0.83 in validation stage [9].In Pari River (Malaysia), ANNs (input: discharge) perform very well in simulating suspended sediment load (SSL) with NSE equal to 0.99 and 1.00 in validation stage [10].ANN can be employed also to analyze the hysteretic phenomenon of sediment transport [11].It is a very practical and promising modeling tool in the context of sediment load prediction [12] and its outputs can be potentially used for design and management purposes in water-related development projects [7].
Although ANN has been proved to perform well in modeling sediment load and other hydrological variables, many researches have been carried out further in order to improve its accuracy by coupling with other methods.Sediment load is generally predicted by using hydrometeorological variables and the most common of which are rainfall and discharge.Naturally, the time series of such variables are very noisy due to the effects of climate variation and other human activities.Thus, one common way to improve the prediction accuracy of ANN is to perform some pre-processing of the inputs and this requires another method.This kind of technique is known as a coupled approach which has been getting more interest recently.Kisi [13] developed a range-dependent neural network (RDNN) for predicting sediment load at two stations operated by the US Geological Survey.RDNN splits the original data series into three ranges which are afterward used as ANN inputs.In term of model efficiency measured by determination coefficient (R 2 ), RDNN is slightly better than ANN for larger R 2 about 0.5% at Santa Clara Station, and both models perform comparably at Calleguas Station.If considering root mean square error (RMSE) and mean absolute error (MAE), RDNN is much better than ANN at both stations.
Selection of a method for input pre-processing should match ideally the specific learning problems.In this study, singular spectrum analysis (SSA) was proposed because it is generally seen as an adaptive noise-reduction algorithm [14].SSA is a tool decomposing a time series into a number of components with simple structures, which can be often identified as trends, seasonality and other oscillatory series, or noise components, and it does not require any statistical assumptions while performing the analysis [15,16].The application of SSA in analyzing hydrometeorological time series (e.g.rainfall, discharge, temperature) can be found in Hanson et al. [17] and Marques et al. [18].This method can be used particularly to extract the main components of rainfall and discharge series and to provide good forecast for them [18].Sivapragasam et al. [14] combined SSA with the support vector machine method (the latter called SSA-SVM approach) to predict rainfall at Station 23 (Singa-pore) and runoff from Tryggevaelde catchment (Denmark), and the results were compared with those of the non-linear prediction (NLP) method.For rainfall prediction, SSA-SVM performs much better than NLP for less RMSE 36% in calibration stage and 28% in validation stage.For runoff prediction, SSA-SVM is also superior to NLP for less RMSE 64% in calibration phase and 59% in validation phase.To our knowledge, there are no any studies associating SSA with ANN for predicting sediment load yet.
The present study attempts to combine SSA with ANN, hereafter called SSA-ANN model, for sediment load prediction with expectation to obtain more accurate results than using ANN alone.The specific objectives are to examine the application of the SSA-ANN model in predicting monthly average m and compare its performance with that of the existing ANN approach.The case study was firstly tried in Ban Nong Kiang (BNK) catchment.In order to show consistency, another case study was conducted in Nam Mae Pun Luang (NMPL) catchment.Both catchments are located in the Lower Mekong Basin (LMB).

 
SSL SSL

Study Catchments
The LMB is a trans-boundary river basin which partially covers four Southeast Asian countries: Lao PDR, Thailand, Cambodia and Vietnam.This basin is relatively rich in hydrometeorological records except sediment [19].
As illustrated in Figure 1  soon as well but the amount is much higher than that in BNK catchment because NMPL catchment is oriented in windward direction.Sediment yield of this catchment is about 58 t/year/km 2 .The larger sediment yield can be explained by topographic feature of each catchment.Mosaics and shrub cover dominates the land use in the catchment and the dominant soil type is Orthic Acrisols.
Poor gauging stations in term of data availability and completeness are commonly found in developing and remote regions as located the LMB.These two catchments were selected based on data availability: 20 years (1982-2003, no data in 1986 and 1987) in BNK catchment and 22 years (1980-2001) in NMPL catchment.

Data
The main data used in this study are rainfall (R), discharge (Q) and suspended sediment load (SSL).SSL is the product of Q and SSC.The daily time series of R, Q and SSC were obtained from Mekong River Commission.R and Q series are continuous but SSC series are discontinuous with few samples per month.The average sampling frequency in BNK and NMPL catchment is about 2 and 4 samples per month, respectively.This provokes the study in monthly basis.The monthly average SSL(SSL m ) is the product of monthly average Q(Q m ) and monthly average  and R m (monthly average rainfall) were employed as inputs for model calculation and SSL m was used for comparison with the model outputs.Due to data limitation, the model inputs consist of only R m and Q m .Rainfall and discharge are the main erosion and transport agents [1,2] and both variables are generally used in many existing researches.Some case studies (e.g.Mustafa et al. [10], Memarian and Balasundram [20]) employ only discharge or rainfall as the input.There are no other reasons besides data unavailability.However, the model accuracy must pass the minimum satisfactory level.In this study, the entire dataset in each catchment was divided into two parts, the first 75% for calibration and the remaining 25% for validation.This combination   75 25  is very common in the study of sediment modeling [21].
The effect of land use changes and other human activities might cause great variation of sediment load over the simulation period (about 20 years) and this could lead to low accuracy of the model results.Based on the Mann-Kendall and the Pettit test (0.01 significance level) on the SSL annual series, there are no significant trends and change points detected at any of the two catchments.Therefore, it can be concluded that the SSL m data series used in this study have no significant influence from the said effects.
At the catchment outlets, it is very likely that there is lag-time between R and SSL as well as Q and SSL due to clockwise hysteretic effect [22][23][24].Hence, consideration of R and Q from previous time step could improve the model accuracy.The present study was conducted in monthly time scale.Therefore, the consideration of antecedent R and Q would have no much effect on the model results because the lag-time is just few days.Melesse et al. [3] applied ANN model to simulate daily and weekly SSL in Mississippi and Missouri River (USA) by considering two different input combinations (I1 and I2).I1 includes one-day antecedent Q and I2 does not.As a result in daily basis, the model prediction using I1 is just slightly better than the one using I2.NSE (I1) is larger than NSE (I2) about 6% in Mississippi River and 3% in Missouri River.In weekly basis, the model efficiency decreases dramatically in comparing with the daily simulation and NSE (I1) becomes less than NSE (I2) in Missouri River.Similar situation is also observed in the case study of four rivers in Turkey conducted by Kisi et al. [25].In consequence, the model performance will not be much different for monthly time scale simulation and the reason that this research does not take into account the antecedent R and Q.

ANN and SSA-ANN Model
ANN is a flexible and potential tool in determining non-linear processes such as sediment transport.The main differences of ANN structures are network architectures, training algorithms and transfer/activation functions.In this study, the multi-layer perceptron with the back-propagation algorithm and sigmoid transfer function was selected.This kind of structure is commonly used in water resources modeling and provides better results than others [7,8,26].As presented in Figure 2, the designed model structure composes of 1 input, 1 hidden and 1 output layer.The input layer has two nodes, one for R m and another for Q m .The number of nodes in the hidden layer was determined by trial and error method because so far, there is no guideline for this purpose.The single node in the output layer is SSL m .
Firstly, each input node receives a set of input data (x) and in this case R m and Q m .The connections between the input and hidden layer contain weights (w) which are determined through the system training.Then, in the hidden layer, the weighted average of input (z) is computed by using summation functions [21]: where w i is the weight vector, x i is the input vector and β is the bias term.Afterward, z is transferred to y (output) and in this case SSL m , through the sigmoid transfer function [21]: In the output layer, y (the predicted SSL m ) is compared with the target value (the observed SSL m ) in order to detect the error or difference between the predicted and observed SSL m .Subsequently, the error is corrected by adjusting w.After assigning the new w, the same calculation steps are performed.This procedure is repeated until obtaining a desirable y or acceptable level of error.To sum up, the ANN model training is a process of weight adjustment attempting to produce a desirable outcome with minimum residuals.
For the SSA-ANN model, the methodologies are similar to those of ANN but a new form of R m and Q m was accounted as inputs instead of their original one.SSA was applied to decompose the original dataset of R m and Q m into a number of components which are then input to the ANN model for predicting SSL m .The SSA algorithm for one-dimensional time series analysis consists of 1) transformation of the original time series

where
, by means of one-parameter (window length L) delay procedure; 2) singular value decomposition of the trajectory matrix 3) split of the elementary matrices into m groups and within each group, determination of the summed matrices 4) transfer of each summed matrix into a new dimensional series of the same length N. The first two steps make up the decomposition stage and the remaining two do the reconstruction stage.In short, the initial time series F is decomposed into the sum of m time series: The basic concept and detailed methodology of SSA can be found in Golyandina et al. [15].
In this paper, the original time series of R m and Q m were decomposed into two components.Since this is the first trial study, a number of components other than two were not examined because many components would provoke difficulty (time consuming) in training the model.Optimizing the number of components is subjected to future study.In addition, interpreting physical meaning of each extracted component is beyond the scope of this research.The main purpose here is to examine the potential of SSA in combination with ANN for SSL m prediction.The model structure of SSA-ANN designed for this particular study is illustrated in Figure 3.

Model Evaluation and Comparison
The efficiency of each model was measured by NSE which is the most widely used goodness-of-fit indicator in predictive hydrological models.Basically, NSE compares the residual variance with the observed data variance and at the same time, it also reflects the prediction accuracy of the modeling approach in comparing with the observed mean value [27].Negative NSE indicates that the observed mean value is a better predictor than the model being used.With NSE greater than 0.50, the model performance is judged as satisfactory [28].NSE,

Input Layer
Hidden Layer Output Layer ) is important for dam-reservoir management [13,26], the model performance for this purpose was also investigated and absolute percentage bias (AP-BIAS) was used as an indicator.SSL t is the integral of SSL m series within a particular period (calibration or validation period).The model result of SSL t prediction is considered as acceptable if APBIAS is less than 55% [28].NSE, RMSE, MAE and APBIAS were calculated respectively using Equations ( 3)-( 6) [21,28]: Copyright © 2013 SciRes.
  where O is the observed SSL m with the mean value O avg , P is the predicted SSL m , n is the sample size, O t is the observed SSL t , and P t is the predicted SSL t .cal difference between the original time series and its extracted components is presented as below.

Statistical Analysis of Datasets
The results of statistical analysis for both calibration and validation datasets are summarized in Table 1 (ANN datasets) and Table 2 (SSA-ANN datasets), and the statistical parameters are the maximum (Max), minimum (Min), average (Mean), correlation coefficient (CC) between the inputs and the observed outputs, standard deviation (SD) and skewness coefficient (SKEW).SD is a measure of how widely the data are dispersed from the average value (Mean) while SKEW indicates the degree of asymmetry of a data distribution [29].A data normal distribution is corresponding to SKEW value about zero.In BNK catchment, the extent of validation datasets (both ANN and SSA-ANN datasets) overall is within the range of calibration datasets.Although there are some over-extrapolations, e.g. the upper bound of R m dataset of ANN (14.07 mm day in validation stage and 13.93 mm/day in calibration stage), it is not significant.Discharge generally exhibits higher CC than rainfall and this suggests that SSL m is more dependent on discharge.Since C1 is the main component (both rainfall and discharge), it  therefore has higher CC value than C2.The value of SD and SKEW is generally low.It should be noted that high value of SD and SKEW will cause negative effect on the model performance [3,30].The SD and SKEW value of the calibration datasets are rather comparable with the corresponding ones of the validation datasets.This is appropriate for modeling because the great difference will lead to poor model performance in validation stage [30].
Remarkably, the SSA-ANN inputs are characterized by lower SD and SKEW value than the ANN inputs and this condition is favorable to the model simulation.This reveals the potential of SSA in statistical point of view.
In NMPL catchment, the inputs of both ANN and SSA-ANN in validation period do not extend beyond the range of the corresponding ones in calibration period.It is contradictory for SSL m in which over-extrapolation is significant for the upper bound (435.42 t/day in validation period and 273.60 t/day in calibration period).If excluding this particular event (435.42 t/day), both data ranges become similar.Therefore, this sole unfavorable data point would have no much effect on the model results.This event occurred in August (2001) which is the rainy season.Moreover, NMPL catchment is characterized by steep slope terrain.In consequence, this particular event might associate with local extreme phenomenon (e.g.slope failure, debris flow) occurring episodically and bringing huge amount of sediment in a short time.For the case of lower bound, the difference is not significant.Similar situation is observed for CC.Both calibration and validation datasets also contain low SD and SKEW value and behave similar characteristics.The effect of SSA is the same as observed in BNK catchment.

Model Performance in BNK Catchment
The performance of each model is summarized in Table 3.It can be seen that not only ANN but also SSA-ANN model yields satisfactory results for both SSL m and SSL t prediction because NSE and APBIAS values are respectively greater than 0.50 and less than 55%.NSE and AP-BIAS of ANN are correspondingly equal to 0.81 and 5.06% in calibration stage, and 0.52 and 48.04% in vali-dation stage.SSA-ANN contains respectively NSE and APBIAS value about 0.84 and 0.09% in calibration period, and 0.64 and 38.25% in validation period.The predicted SSL m resulted from each model is graphically compared with the observed data as depicted in Figure 5(a).Visually, the predicted time series of both models show similar trend with the observed one.Figure 5(b) (ANN) and Figure 5(c) (SSA-ANN) depict the scatter plots of the predicted versus observed SSL m which were used to distinguish the model performance in estimating low, medium and high value.In order to clearly investigate the whole extent, from low to high value, both figures were plotted in log-log scale.These two scatter plots obviously demonstrate that both models overestimate the low values.In case of medium and high values, the scattering points are distributed uniformly around the ideal fit line.SSA-ANN predicts better not only the low but also the medium and high SSL m through reduction of the overestimates at low value and the underestimates at medium and high value.The better prediction of SSA-ANN at medium and high value can be confirmed respectively by the less MAE and RMSE value (Table 3).
For SSL m prediction, SSA-ANN is superior to ANN for more NSE 4%, less RMSE 9% and less MAE 22% in calibration stage.In validation stage, SSA-ANN is better for more NSE 24%, less RMSE 14% and less MAE 18%.In case of SSL t prediction, SSA-ANN is more powerful for less APBIAS 98% in calibration phase and 20% in validation phase.

Model Performance in NMPL Catchment
From Table 3 and Figure 6, similar situation is observed.Both models also perform well in this catchment and the advantage of SSA-ANN over ANN also exists.For SSL m prediction, SSA-ANN is superior to ANN for more NSE 1%, less RMSE 4% and less MAE 3% in calibration stage.In validation stage, SSA-ANN is better for more NSE 7%, less RMSE 4% and less MAE 2%.In case of SSL t prediction, SSA-ANN is more powerful for less APBIAS 65% in calibration phase and 6% in validation phase.The advantage of SSA-ANN in this catchment is rather less in  comparing with that in BNK catchment.This is because the ANN inputs (original datasets) in NMPL catchment are characterized by lower SD and SKEW value.Therefore, when transformed to become SSA-ANN inputs using SSA, they (SD and SKEW) are not decreased as much as in BNK catchment, especially C1 which is the main component.For instance, in calibration stage, the decreasing rate of SKEW from Q m to Q m -C1 is 56% in BNK catchment and it is just 49% in NMPL catchment.Similarly in validation stage, it is 54% and 30% in BNK and NMPL catchment, respectively.In calibration period, the efficiency of both models in NMPL catchment is slightly better than that in BNK catchment.The difference in model performance between these two catchments may be attributed to different variation of sediment load spatially.This can be explained by the difference in SD and SKEW value.The SSL m dataset in NMPL catchment is characterized by lower value of SD (50.99) and SKEW (2.45) and therefore easier to be calibrated.Looking into validation period, NSE value of both methods becomes less in comparing with that in BNK catchment.This could be due to different temporal variation of the SSL m data which can be explained statistically by the difference between the calibration and validation dataset in each individual catchment.The more similar these two datasets is corresponding to the better model performance in validation period.The difference in SKEW value is likely comparable in both catchments but the difference in SD value is more significant in NMPL catchment.

Conclusions
This research proposed a coupled model (SSA-ANN) to predict sediment load in two catchments, located in the LMB, having different hydrological and terrain characteristics.The performance of this model was compared with that of the existing ANN approach.Satisfactory results were obtained from both methods but SSA-ANN exhibits its better performance repeatedly in both catchments.This improvement reflects the importance of SSA.SSA filters the noise containing in the raw time series.It reduces the value of SD and SKEW, and transforms the original input data to be near normal distribution which is favorable to modeling.Instead of ANN, the proposed SSA-ANN model is also recommended for the prediction of other water resources variables because extra input data are not required.Only additional computation, time series decomposition, is needed.This new technique could be potentially used to minimize the costly operation of sediment sampling in the LMB which is relatively rich in hydrometeorological records.
In this study, the model simulation was conducted in monthly basis.Therefore, other time scales should be tested.The present research employed SSA to decompose the raw inputs into two components only.Larger amount of components should be examined in order to extensively investigate the potential of SSA-ANN.The present authors expect that the model accuracy will be more improved with more number of components.

Figure 1 .
Figure 1.Map of the study catchments.

Figure 4
Figure4shows the results of SSA in decomposing R m and Q m in BNK catchment.For the case of R m (Figure4(a)), the first component (C1) behaves lower frequency then the second one (C2) and it is also apparent that SSA removes the discontinuity characterized by many zeros (dry periods) existing in the original time series.For the case of Q m (Figure4(b)), time series of C1 contains lower frequency than that of C2 as well.From Figure4, it is clearly seen that C1 is the main component.This situation is also found in NMPL catchment.The statisti-

Figure 4 .
Figure 4. Results of SSA in BNK catchment (no data in 1986 and 1987), (a) R m and (b) Q m .

Figure 5 .Figure 6 .
Figure 5.Comparison of the predicted versus observed SSL m in BNK catchment (no data 1986 and 1987), (a) Time series comparison; (b) Scatter plot of ANN results and (c) Scatter plot of SSA-ANN results.

Table 3 . Model performance indicated by NSE, RMSE, MAE and APBIAS.
NSE, RMSE and MAE for evaluating SSL m prediction; APBIAS for evaluating SSL t prediction; Architecture (optimum): Number of nodes in the input-hiddenoutput layer.