Comparison of Methods of Estimating Missing Values in Time Series

This paper proposes new methods of estimating missing values in time series data while comparing them with existing methods. The new methods are based on the row, column and overall averages of time series data arranged in a Buys-Ballot table with m rows and s columns. The methods assume that 1) only one value is missing at a time, 2) the trending curve may be linear, quadratic or exponential and 3) the decomposition method is either Additive or Multiplicative. The performances of the methods are assessed by comparing accuracy measures (MAE, MAPE and RMSE) computed from the deviations of estimates of the missing values from the actual values used in simulation. Results show that, under the stated assumptions, estimates from the new method based on full decomposition of a series is the best (in terms of the accuracy measures) when compared with other two new and the existing methods.


Introduction
The analysis of time series data constitutes an important area of statistics especially in identifying the nature of the phenomenon represented by the sequence.However, missing observations in time series data are very common [1] [2].This happens when an observation may not be made at a particular time, due to faulty equipment, lost records or a mistake, which cannot be rectified until later.When this happens, it is necessary to obtain estimates of the missing value for better understanding of the nature of the data and make possible a more accurate fore- In time series analysis, a problem frequently encountered in data collection is a missing observation.Missing observations may be virtually impossible to obtain, either because of time or cost constraints.In order to obtain estimates of these observations, there are different options available to the researcher.One of the options is to replace them by the mean of the series.The missing observation may be replaced with naive forecast or with the average of the last two known observations that bound the missing observation [3].
Using the Bode-Shannon representation of random processes and the "state-transition" method of analysis of dynamic systems, Kalman [4] worked on classical filtering and prediction in relation to missing values in time series.His work which hinged on state space modelling approach was later extended to observational error and missing observations [Jones [5]].Further on, Harvey and Pierse [6] highlighted on the relevance of state space modelling and Kalman filter to the problems of missing data in times series.Their work discussed the maximum likelihood estimation of the parameters in an ARIMA model under missing observations and the estimation of missing observations.Using the univariate form of the modified Kalman filter, Kohn and Ansley [7] defined and computed efficiently the marginal likelihood of an ARIMA model with missing observations.Their work showed light on how to predict by interpolating missing observations and obtaining the mean squared error of the estimates.
As the literature reveals, missing values in time series has attracted so much research attention.Several approaches to determine missing values like the use of ARIMA models as well as other techniques have continued to evolve.Among them are the optimal linear combination of the forecast and back forecast method [Damsleth [8]], method for the estimation of models for discrete time series in the presence of missing data [Robinson and Dunsmuir [9]], forecasting techniques to estimate missing observations in time series [Abraham [10]], etc.A number of alternative procedures for estimating missing observations in stationary time series for autoregressive moving average models were provided by Ferreiro [11].Sequel to these alternative procedures in stationary time series, Rosen and Porat [12] introduced the general formulae for the asymptotic second-order moments of the sample covariances, for missing values.
Another easily applicable spectral estimator for missing data is the method of Scargle [13].This computes Fourier coefficients as the least squares fit of sines and cosines to the available remaining observations.The Lomb-Scargle spectrum is accurate in detecting strong spectral peaks but this assumption biases the description of slopes and background shapes in the spectrum according to Bos et al. [14] and Broersen et al. [15].
Brockwell and Davis [16] gave the option that missing values at the beginning or the end of the time series are simply ignored while intermediate missing values are considered serious flaws in the input time series.It therefore, interpolates values using interpolation algorithms: linear, polynomial, smoothing, spline and Open Journal of Statistics filtering.
Yuan et al. [17] compared the Normal-distribution-based maximum likelihood (ML) and multiple imputation (MI) procedures for analyzing missing value data.The paper compared these two procedures with respect to bias and efficiency of parameter estimates.Their result showed that ML is preferable to MI in practice, although parameter estimates by MI might still be consistent.
Cheema [18]  In time series, it is assumed that the data consist of observations made sequentially in time; a systematic pattern (usually a set of identifiable components) and random noise (error).So, when some observations are missing it violets the condition for application of time series model.The systematic pattern includes the trend (denoted as t T ), seasonal (denoted as t S ) and the cyclical (denoted as , 1, 2, , Cyclical variation which refers to the long term oscillation or swings about the trend appears to an appreciable magnitude only in long period sets of data.The pseudo-additive model is used when the original time series contains very small or zero values.However, this work will discuss only the additive and multiplicative models. Missing values can lead to erroneous conclusions about data.Substitution of missing values may introduce inaccuracies.It can lead to false results, forecast and errors or data skews can proliferate across subsequent runs causing a larger cumulative error effect.Most analytical methods cannot be performed if there are missing values in the data.Furthermore, existing methods did not consider the model structure (i.e.whether Additive or Multiplicative models) and other trending curves beyond the linear (Quadratic, Exponential etc.).More so, the seasonal component of the time series data was not taken into consideration in developing estimation methods as can be assessed from literature.Therefore the ultimate objective of this study is to develop methods of estimating missing values which take into consideration the model structure and trending curve.The specific objectives are: 1) To review existing methods of estimating missing values.
2) Develop new methods of estimating missing values in time series 3) Assess the performance of the methods of estimating missing values.4) Compare results from the existing methods of estimation of missing values with results from the new methods developed using simulated data.
Based on the results, recommendations are made.
The rationale for this study is to fill the gap in the existing methods of estimation of missing values, by providing analyst with a better method for the estimation of missing values irrespective of model structure and functional relationship.

Existing Methods of Estimating Missing Values
The new methods proposed in this study assumed that the series are arranged in a Buys Ballot Table with m rows (periods) and s columns (seasons), for m s > .
Under this arrangement, the observation; t X made at time t is identified by the period i, ( 1, 2, , ) and season j, ( 1, 2, , j s = ) and t becomes ( ( ) . Thus, the observations in the i-th row (period) are and the observations in the j-th column (season) are , , , , , For details of the Buys Ballot table see Iwueze and Nwogu [22] [23] and Iwueze et al. [24].Therefore, for consistency, the existing methods have been presented using the Buys-Ballot format.
Some of the existing methods of estimating missing values in time series analysis are the Mean Imputation (MI), Series Mean (SM), Linear Interpolation (LI) and Regression Imputation (RI).Assuming an observation ( ( ) missing in the Buys-Ballot table at a point say ( ) + , it is estimated using the different methods listed above as follows: 1) Mean Imputations (MI) Mean imputation entails replacing the missing value with the mean of the values before the missing position.This is achieved by taking the summation of the values and dividing by the number of observation before the missing position. where + − is the number of observations preceding the missing observation.

2) Series Mean (SM)
Series mean estimates the missing value with the mean of the remaining series.
Symbolically, the series mean is given by where, n ms = and ( ) ( ) This method of linear interpolation for estimating missing values is given by

4) Regression Imputation (RI)
This method estimates the missing value by the estimate of the trend at the point of the missing value.Thus if the remaining values of the series are used to determine estimates of the trend parameters and the estimate of the missing value at ( )

New Methods of Estimating Missing Values
The new methods proposed in this work are the Row Mean Imputation, Column mean Imputation and Decomposition Without the Missing Value.The new methods are given as follows: 1) Row Mean Imputation (RMI) The row mean imputation method computes the missing value as the mean of the remaining observations in the row (period) containing the missing value.
Thus, the missing value is estimated by 2) Column Mean Imputation (CMI) The columns mean imputation method computes estimate of the missing value as the mean of the remaining observations in the column (season) containing the missing value.Thus, the missing value is estimated as: (2.9) 3) Decomposing Without the Missing Value (DWMV) In this method, estimates of the trend parameters and seasonal indices obtained from the remaining observations using any of the methods of time series decomposition, are substituted into the expression for the missing value.Hence, the estimates of the missing values by this method are given by: a) For Additive Model ( ) ( ) The trend-cycle components of the DWMV method for the linear, quadratic and exponential curves are: i) Linear Trend ( ) ( )

Assessing Performance of the Methods
To assess the performance of our estimation methods, accuracy measures are computed from the deviations of the estimates of the missing values from the actual values.The deviations of ( ) 1 ˆi s j X − + from the Actual value ( ) The accuracy measures discussed are: Mean Absolute Error (MAE), Mean Absolute percentage Error (MAPE) and Root Mean Square Error (RMSE), (Makridakis and Hibon, 1995).Given a data set of size n = ms, we considered one missing value at a time for different 0 m n < positions, n > 1.These accuracy measures are defined as follows: 1) Mean Absolute Error (MAE) The MAE is defined as The MAPE is defined as: 3) Root Mean Square Error (RMSE) This is calculated as:    the selected trending curves, followed by the LI.Each estimation method (in comparison with the others, using MAE, MAPE and RMSE) were consistent in their performance without being prone to minimal variations in the 106 data sets simulated for this study.This implies that the DWMV method of estimation of missing values yielded best (in terms of the accuracy measures) among other methods investigated in this work.This impressive observation may be attributable to the fact that DWMV combines the effects of both the trending curves and Open Journal of Statistics seasonal effect in estimating the missing values.The information that DWMV takes into account seasonality of the missing value is supported by literature.For the real life data, the DWMV method also out-performed the other methods of estimation of missing values even as the assumption of normal distribution of error terms is not met in real life data.

Concluding Remark
The results of the analysis indicate that for all trending curves and both model structures, DWMV yielded best (in terms of the accuracy measures) estimates of the missing values when compared with both the existing methods and the two other new proposed methods (RMI and CMI).This is perhaps, because DWMV combines the effects of both the trending curves and the seasonal indices unlike the other methods.Cheema [18] also observed that multiple regression imputation method of handling missing data performed well when the analytical method was multiple regression because using regression-imputed data in a regression equation was like fitting a regression equation twice to predict the same dependent variable.
In view of this, it is recommended that the DWMV method be used in estimating missing values in time series analysis when one observation is missing at a time until further studies proves otherwise.It is also recommended that this study be extended to cases where more than one point data are missing at a time and to examine the effects of different sample sizes and distributions on the estimation of missing values.

tC
) components.The random noise (or error, irregular component) is denoted as t I or t e , where t stands for the particular point in time.These four classes of time series components may or may not coexist in real-life data.These components can adopt different specific functional relationship.They can be combined in an additive (additive seasonality) or a multiplicative (multiplicative seasonality) fashion and can as well take other forms such as pseudo-additive/mixed (combining the elements of both the additive and multiplicative models) model.The Additive model, Multiplicative model and Pseudo-Additive/Mixed Model are given in Equations (1.1)-(1.3)respectively: This section presents some empirical examples to illustrate the application of the methods of estimating missing values discussed in Section 2. The empirical example consists of both simulated and real life data.The simulated series used consists of 106 data sets of 120 observations each simulated from the Additive model: .In the Additive model, it is assumed that ( ) are as shown in Table1.The real life example used is the monthly time series data on Airline Passengers for the period of twenty (20) years.The summary of the accuracy measures for the seven methods of estimating missing values considered are shown in Tables 2-4 for the selected trending curves.The summary of accuracy measures for the simulated Additive and Multiplicative models shown in

Table 2 and
Table 3 respectively indicates that DWMV has the lowest values of the accuracy measures (MAE, MAPE and RMSE) for all

Table 1 .
Seasonal indices used for simulation.Note: S j(Add) is Seasonal indices for Additive model and S j(Mult) are Seasonal indices for Multiplicative model.

Table 2 .
Summary result of estimation of missing value for additive model.

Table 3 .
Summary result of estimation of missing value for multiplicative model.

Table 4 .
Summary result of estimation of missing value using the airline passenger data.