Generalized Additive Mixed Modelling of River Discharge in the Black Volta River ()
1. Introduction
Most important tasks in problem solving in hydrology have been taken over by mathematical models [1] . According to [2] , modelling in environmental science is the representation of a complex natural system in a simplified form through the use of logical mathematical statements. Most hydrologic systems are extremely complex, and we cannot hope to understand them in detail without modelling [3] .
Many different reasons account for the development of hydrologic models for a catchment. They therefore have many different forms despite the fact that they are in general developed to meet at least one of two primary objectives [3] . One objective is to gain a better understanding of the hydrologic phenomena operating in a catchment and of how changes in the catchment may affect these phenomena whiles the other objective is to generate synthetic sequences of hydrologic data for facility design or for use in forecasting [3] .
River discharge and other components of the hydrologic system are affected by many variables. Key among them is rainfall and its variation in space and time in response to various climatic factors. Other variables that potentially can affect river discharge include rock and soil type, land use, relief and weather conditions such as temperature and humidity. Establishing a relationship among these variables is the central focus of hydrological modelling from its simple form of unit hydrograph to rather complex models based on fully dynamic flow equations [4] .
Hydrologic models can be classified into two broad classes, namely physical and abstract models [5] . Physical models can further be categorised into two categories, namely scale and analog models. A scale model refers to a scaled down model of a real system whiles an analog model refers to a physical system having the same characteristic as the original sample. Abstract models on the other hand, are used to show a system in a mathematical form. The model is operated with a set of equations, input and output data. These models are data- driven in nature, as they do not require knowledge of the underlying process beforehand and are solely based on empirical equations calibrated to field data [6] .
Quite recently, [7] argued that hydrologic models may be seen as black-box, conceptual or deterministic models. Black-box models explain the relationship between the input and output data mathematically [8] and are often good for modelling with available and analyzed data for a specific catchment. Deterministic models have complex physical theory and need to have a large amount of data and computational time. Conceptual models are formulated with a number of conceptual elements which are simple representations of a reference system [9] .
A significant number of physically based and data-driven models have been developed and implemented. Examples include [10] - [21] . Although it is easier understanding the separate hydrological processes that govern the whole system using the physically based models, in many occasions the input data may be unavailable, expensive or time consuming to collect [22] . Also, a number of variables still need to be determined through model calibration. This makes the operation of physically-based models difficult and time consuming as opposed to data-driven models [23] .
According to [24] , the various physical mechanisms governing the river discharge dynamics act on a wide range of spatial and temporal scales. However, an important observation that can be made from the studies conducted thus far on the applications of both the physically-based models and the black-box models for river discharge forecasting is that none of these studies has looked at the influence of both spatial and temporal variability on river discharge forecasting simultaneously. This forms the basis of the present study. Giving the peculiar location of the Black Volta River, quantifying changes of river discharge both in space and time is fundamental in addressing issues of flooding, power generation and survival of ecosystem downstream [25] .
In this study, we propose generalized additive mixed models (GAMMs) [26] [27] [28] incorporating a smooth interaction of space and time for modelling space-time variations in river discharge in the Black Volta River, to extract space-time signals for the entire study area. GAMMs are appealing for their flexibility and the straight forward way in which smooth effects of covariables can be incorporated along-side the smooth space time effect and random effects [29] .
2. Materials and Methods
2.1. Study Area
The Black Volta river basin (Figure 1) stretches from North to South through Mali, Burkina Faso, Ghana and Cote d’Ivoire, and from West to East through
Burkina Faso, Cote d’Ivoire and Ghana. Geographically, it lies between 7˚00'00"N and 14˚30'00"N and Longitude 5˚30'00"W and 1˚30'00"W. The watershed has an area of about 130,400 km2 constituting about 32.6% of the Volta basin which occurs when some portion of Bamboi which belongs to the Lower Volta is added to the basin. The portion of the watershed within Ghana has an area of about 18,384 km2 which is about 14% of the basin. The annual rainfall varies from 1043 mm to 1270 mm. The Wettest month in the basin is September whiles the driest month is March. The estimated mean runoff in the basin is at 7 km3 per annum. The mean monthly temperature in the basin is around 26˚C. The hottest month is March and the coolest is August [30] . Four gauge stations in the Black Volta basin namely Lawra, Chache, Bui and Bamboi, were all used for modelling and analysis.
2.2. Data and Variables
The data contains information on the four gauge stations along the Black Volta River namely, Lawra, Chache, Bui, and Bamboi. For each gauge station, latitude and longitude, year, month, elevation, land use, soil type, rainfall, humidity, and discharge are reported. Land use data is obtained from the land use map of Ghana in Figure 2. The lands in the Black Volta basin are mostly used for agriculture with bush fallow food crop cultivation. Except in the dry season, where livestock owners/herdsmen migrate with their animals in search of water and feed in nearby communities, animal grazing in the basin is mostly done on free range [31] .
River discharge data from January 2000 to December 2009 for the four gauge stations was obtained from the hydrological services department of Ghana whiles rainfall and humidity data for the same period was obtained from the meteorological services department of Ghana. Data on Soil type was extracted from the soil map of Ghana in Figure 3. Ferric Luvisols is the most dominant soil in the Ghana portion of the basin. It is characterized mainly by Savannah Ochrosols and patches of Savannah Ochrosols-Lithosols [30] .
The response variable for these analyses was river discharge (disch) measured in cubic metre per second whiles the independent variables were time (month & year) and space (loc) which are the various gauge stations along the Black Volta River considered in this study, namely Lawra, Chache, Bui and Bamboi. The covariates included rainfall (rain) measured in millimetres, relative humidity (humid), elevation (elev) measured in meters, soil type (soil) and land use (luse) which was considered as a random effect. Interactions between some of these variables were also considered especially the space-time interactions.
2.3. Models and Analyses
After checking the relationship between river discharge (disch) and all predictors, independent models were constructed for all covariates to determine their effect on disch and, if it resulted significant, its nature (linear or nonlinear) was also determined. If a covariate was significant as a unique predictor, but not significant jointly with other predictors, it was removed. For instance, it was observed in the case of elevation (elev), which had a significant parameter as a unique predictor, but was eliminated in models which had several predictors because its effect resulted no more significant. This preliminary analysis is omitted from the results for the sake of brevity.
The response variable ‘river discharge in gauge station i’ (
) is modelled using a generalized additive mixed model (GAMM) [26] [27] [28] , as shown in Table 1.
2.4. Parameter Estimation
The GAMMs in Table 1 can be expressed as generalized linear mixed models (GLMMs);
(1)
where
is a row of the model matrix containing all components of the model. That is, all explanatory variables of fixed and random effects, and all the basic functions evaluated at observation i. The parameter
contains the coefficients of fixed terms, the random land use effects and the bases. We estimate parameters with maximum likelihood (ML) estimation of the smoothness parameters, by integrating out the part of
in the log likelihood function that is in the range space as described in [32] .
2.5. Model Selection and Validation
Model selection was based on Akaike information criterion (AIC), Bayesian information criterion (BIC), adjusted R-squared, the root mean squared prediction error (RMSPE), and Nash-Sutcliffe efficiency (NSE). However, the key indicators of performance were the RMSPE which is independent of the likelihood and NSE. The RMSPE and NSE are calculated using Equations ((2) and (3)) respectively.
Table 1. Summary of models considered.
where
and
is assumed to follow the Gaussian distribution.
is the intercept parameter,
and
are parameter estimates of fixed effects while
are smooth functions of the covariates which are represented using a cyclic cubic regression spline.
indexes the land use at the ith gauge station and
is a random land use effect.
is a tensor product of cyclic cubic regression splines.
(2)
(3)
where
and
are the observed and predicted river discharges for n months,
is the mean of the observed data.
For model checking and investigating whether the final selected model has disentangled spatial and temporal correlation in residuals, several diagnostics were used. The QQ-plot and histogram of residuals were used to check normality of residuals. Homoscedasticity of residuals was checked using scatter plot of residuals versus predictors. Also, the relationship between response and fitted values was checked as a visual goodness of fit verification using scatter plot of response versus fitted values.
2.6. Software
Data analysis was done in the R programming environment version 3.2.4 [33] and models were fit using the MGCV package [28] .
3. Results and Discussion
3.1. Descriptive Statistics
In generalized regression models, it is important and necessary to study the distribution function of the response variable (disch) in order to select both response distribution and link fuction. Boxplots for disch and log (disch) are reported in the upper panels of Figure 4 while the normal QQ-plots for disch and log (disch) are reported in the lower panels. We observe from the figure that log (disch) gives a good approximation to the normal distribution. Hence the Gaussian distribution was considered as the underlying theoretical distribution of disch in the GAMMs with log link function.
The time series plots of river discharge at the various gauge stations are shown in Figure 5, which indicates an obvious seasonality in discharge at all four gauge stations. This suggests that smooth functions may be represented using cyclic cubic regression splines.
3.2. Model Selection
The first GAMM (Model 1) included fixed effect of soil type, random effect of land use, and smooth functions of rainfall and humidity but excluded space and time effects, and was able to explain only about 19.2% (R-sq. (adj) = 0.192) of the variability in river discharge as shown in Table 2. The second GAMM (Model 2) added the main effects of space and time to model 1 which resulted in explaining about 40.1% of the variability in river discharge. The third GAMM (Model 3) added only the space-time interaction effect to model 1 and resulted in explain-
Figure 4. In the upper panels: Boxplots for river discharge (on the left) and the log transformation of river discharge (on the right); in the lower panels: Normal QQ-plots for river discharge (on the left) and the log transformation of river discharge (on the right).
Figure 5. Time series plots for river discharge at the various gauge stations.
ing about 72.4% of the variability in river discharge while the final GAMM (Model 4) added both the main and interaction effects of space and time to model 1 and resulted in explaining about 82.1% of the variability in river discharge. This provides an indication of the very significant role space-time effects play in modelling river discharge, but are usually ignored.
Furthermore, the RMSPEs and NSE values in Table 2 indicate satisfactory performance for all GAMMs considered. However, a comparison among them using AIC and BIC values clearly indicate a much better performance by the GAMM which included both the main and interaction effects of space and time (Model 4).
3.3. Parameter Estimates of the Selected Model
Parameter estimates of the selected GAMM are reported in Table 3 and Table 4. We observe from those Tables that, parameter coefficients (both smooth terms and non-smooth terms) were all significant at the 0.05 level.
3.4. Diagnostic Checks
Basic diagnostic plots of the selected GAMM are reported in Figure 6. The QQ-plot of residuals shows an evident arrival of residual quantiles at the theoretical normal quantiles and a near symmetry observed in the histogram of residuals as well. The scatter-plot of residuals versus the linear predictor indicates
Table 4. Approximate significance of smooth terms.
Figure 6. Diagnostic for selected GAMM. In the upper panels: QQ-plot of residuals (on the left) and plot of residuals vs. Linear predictor (on the right); In the lower panels: Histogram of the residuals (on the left) and plot of the response vs. Fitted values (on the right).
an accentuated homoscedasticity of residual variance while that of the response versus fitted values shows independence of the residuals. All in all, diagnostics of the selected GAMM are quite good.
4. Conclusions
We have effectively used GAMMs for modelling space-time river discharge data in this paper. GAMMs provide a flexible framework which allows for smooth effects of covariates and smooth effects of space and time. In other applications such as repeated observations of weather station data, the use of spatio-temporal dynamic models or state-space models have been proposed. Four GAMMs were explored, two with space-time interactions and two without space-time interactions. The comparison of the performance of the models with space-time interactions and those without space-time interactions based on AIC and BIC suggests that in this application, the former is better overall and in particular for modelling variations in river discharge data. Further, a model with space and time main effects performed better compared with one without space and time main effects.
Acknowledgements
This study was supported by the Akosombo Kpong Dams Reoperation and Reoptimization Study hosted by the Water Resources Commission (WRC) of Ghana and funded by the African Development Bank (ADB). We sincerely thank WRC and ADB for the support.