Constructing Prediction Intervals : A Re-Edition of the Williams-Goodman Method

The aim of this paper is to develop and validate a procedure for constructing prediction intervals. These forecasts are produced by Box-Jenkins processes with external deterministic regressors and prediction intervals are based on the procedure proposed by Williams-Goodman in 1971. Specifically, the distributions of forecast error at various lead-times are determined using post-sample forecast errors. Fitting a density function to each distribution provides a good alternative to simply observing the errors directly because, if the fitting is satisfactory, the quantiles of the distribution can be estimated and then the interval bounds computed for different time origins. We examine a wide variety of probability densities to search the one that best fit the empirical distributions of forecast errors. The most suitable mathematical form results to be Johnson’s system of density functions. The results obtained with several time series suggest that a Box-Jenkins process combined with the Williams-Goodman procedure based on Johnson’s distributions, provide accurate prediction intervals.


Introduction
Energy companies are strongly affected by uncertain price conditions, as they are exposed to the different risks from liberalized energy markets in combination with important and, to a large extent, irreversible investments.Price predictions, however, are usually expressed as point forecasts that give little guidance as to their accuracy, whereas, the planning process needs to take into account the entire probability distribution of future prices or at least intervals that have a pre-specified nominal coverage rate i.e. a given probability of containing the future prices.It is the aim of this paper to resume the prediction intervals (PIs) proposed by Williams & Goodman [1], which anticipated bootstrap techniques, but was introduced at a time when its heavy computational demands slowed down practical applications.
A literature review in the field of short-term forecasting of electricity prices, reveals that limited research has addressed the issue of PIs.Misiorek et al. and Weron [2] [3] were the first to consider interval forecasts.Other few exceptions are Wu et al., Nowotarski & Weron, and Bunn et al. [4] [5] [6].However, interval forecasts remain an underdeveloped topic.See [7].Under the very restrictive assumptions of independent and identically distributed Gaussian errors, PIs can be obtained by means of standard formulae.Since the distribution of forecast errors is unknown, they must be estimated to calculate interval bounds.
The main problem with assessing the reliability of price forecasts is that the magnitude of post-sample errors cannot be exactly evaluated until the prices are observed.In order to simulate such a situation, the time series under study can be split into two parts: the "training" period, which ignores a number of the most recent time points and the "validation" period, which comprises only the ignored time points.For the purpose of this study, we have not used the entire time series, but kept the very last time points untouched because they serve as a benchmark (target period) against which the quality of the PIs is to be judged.The training period is used to identify and estimate one of the large variety of electricity price models described in literature.Borovkova & Schmeck [8] observe that, despite the voluminous literature on modeling electricity prices, a clear "winner" model has not emerged.Here, we use a Box-Jenkins process.
Williams & Goodman [1] had the simple and ingenious idea of rolling ahead the training period and repeating the forecasting procedure until it is not possible to make any more multi-step predictions.The collection of multi-step-ahead errors forms the empirical distribution of the forecast errors for each lead-time.This allows us to estimate the quantiles necessary to compute the interval bounds.To implement the Williams & Goodman (WG) procedure, it is necessary to find a family of probability distributions having a member which gives a good fit to the empirical distribution of forecast errors.We will consider various density functions to be combined with SARIMAX processes for time series data concerning Italian electricity hourly zonal prices.Our aim is to find the most effective way of mixing WG procedure and SARMAX processes.
The organization of the paper is as follows.In the next section we address important aspects of data-preparation.Section 3 provides a brief review of the Box Jenkins approach to compute point forecasts.In Section 4, the Williams-Goodman procedure is discussed and in Section 5 it is combined with various density functions purportedly useful in describing the empirical forecast error distribution.
In this same section is presented an application to Italian hourly zonal prices.
Conclusions are drawn in Section 6.

Data Preparation
In this article we analyze data on hourly zonal prices traded at the day ahead Open Journal of Statistics Italian energy market.Because of transmission capacity constraints, Italy is partitioned into six zones: North, Centre-North, Centre-South, South, Sardinia and Sicily with a separate price for each zone.When there are no transmission congestions, arbitrage opportunities restrict the prices in each zone to be equal.See [9].The hourly marginal prices can differ across zones because of transmission limits or because of dissimilar behavior of the consumers, but it is the same within a zone.The sizes of the zones are not equal.In fact, North (which includes the Italian regions of Val D'Aosta, Piemonte, Liguria, Lombardia, Trentino, Veneto, Friuli Venezia Giulia, Emilia Romagna) constitutes a large fraction of Italian national surface (40%) and its population (46%).The large islands of Sicily and Sardinia suffer from poor interconnections and frequent congestions.
In the present article, we prefer to work with zonal data because they show far wilder randomness than the national price and are complex enough to challenge the statistical methods used for making predictions.
According to principles of decentralization and subsidiarity, creatively extended to long time series, we will treat each hour as a separate time series, such that 24 different models are estimated.All the time series go from 1 am on Monday, 7/1/2013 to 24 pm on Sunday, 26/2/2017 and hence cover a total of 24 hourly prices in 1148 days for six zones.We reconstructed the values corresponding to the changes of the daylight-saving time by the arithmetic average of the two neighboring hours, while the "doubled" values corresponding to the switch from the daylight-saving time, were replaced by the arithmetic mean of the two neighboring prices.
Time series of electricity prices display characteristics not frequently observed in other commodity markets.Pronounced daily, weekly, monthly and multi-monthly seasonal cycles; heteroskedasticity, mean reversion, and a high number of spikes (very sharp peaks or extremely deep valleys) in a short period of time.Knittel & Roberts [10] note that electricity prices also contain what they refer to as, an "inverse leverage effect".Electricity price volatility tends to rise more so with positive shocks than negative shocks.Most of these characteristics stem from the fact that power cannot be economically stored and, consequently, accumulation and selling of stocks/inventories have reduced intervention potential to smooth supply or demand shocks across over time.Additionally, electricity markets face stringent distribution and transmission constraints.
To attenuate these effects, prices are log-transformed so that all upward or downward spikes are closer to the mean of the time series.The attenuation does not absolve us from trying to use more effective albeit more invasive treatment of aberrant prices.On the one hand, even if the removal of legitimate data points could be accepted as a permissible practice, the number of values suspectable of being anomalous is too large to justify their exclusion.Extreme price swings, in fact, need not be treated as enemies, because they are very significant for energy market participants.On the other hand, as noted by Fildes [11], the choice of a forecasting method should not be dependent on such extremes, unless they contain information we cannot afford not to consider.
To deal with spike prices, we construct an artificial time series by decomposing the original time series into trend-cycle (expressed through orthogonal polynomials) and periodicities (expressed as a sum of harmonics with random phases).Deviations between observed and artificial prices outside the range: median plus/minus a factor of the median absolute deviation from the median, are considered anomalous residuals which may indicate abnormal price.These prices, are considered missing and replaced by a weighted average of the observed prices and the corresponding artificial prices.Although infrequent, negative or virtually zero prices do occur.These unusual prices can create problems with log transformation.So, prices less than one e/MWh are treated as missing values and imputed using the artificial time series.

Point Forecast
The generic time series is represented by 1 2 , , , n P P P  where n is the number of observations, which, in this section, comprises the training and the validation periods (which, taken together, form the fit period).The index of the hour is suppressed, but it is understood that t P refers to a daily time series of one of the hours.
There is no general consensus at present on the best method to be used for electricity prices modeling.In this context, we apply Box-Jenkins forecasting method which has proven to be flexible enough to accommodate the electricity price behavior satisfactory.See [12] [13].We do not investigate other time series approaches which are potentially useful, but lie beyond the scope of the present study.For a recent comprehensive survey see [14].
The general form of a sarimax model is where t 1 Some of the parameters may be zero or otherwise constrained, so that (2) could be a multiplicative seasonal are polynomials of order, respec-tively, p, P, q, Q and s indicates the length of the periodicity (seasonality).The same notation may be used to take into account multiple seasonality effects if necessary.Moreover, indicates m variables observed at day t influencing the price of electricity; j β is a parameter measuring how the price t P is related to the j-th variable , t j X .
To keep the problem of estimating Equation ( 1) tractable, we use only deterministic exogenous variables so we know exactly what they will be at any future time (e.g.calendar variables, polynomials or sinusoids in time).The choice of known or non-stochastic regressors simplifies the inferential procedures, including estimation and testing of the parameters.This choice is also suggested by the fact that stochastic exogenous regressors, which must also be forecast, is one of the possible causes of inefficiency in prediction intervals.See [15] [Sect.The estimation of the parameters can be realized by optimizing the log-likelihood function of (1), provided that , , , p q P Q are known and errors are Gaussian random variables.Since we ignore the order of the polynomials, the estimation should be repeated for different values of , , , p q P Q .If 0 , 0 , 0 , 0 ) distinct processes to be explored for each time series.We execute the search of the best process in automatic mode, over a limited set of distinct variations by using the function auto.arima() of the R package forecast with the option of a stepwise search to reduce the high computational cost of brute force search.
A common index to compare rival models is the bias-corrected version of the Akaike criterion where 2 ˆa σ denotes the estimated error variance of the candidate process.The process associated with the smallest AICC is presumed to be the best process.Let 0 L > be the number of prices to be foreseen (lead-time).

P E P I k L
+ + = =  .It turns out that, under reasonably weak conditions, the optimal forecast is the expected value of the series being forecast, conditional on available information.
Forecasting the regression term in (1) does not present particular difficulties because of the perfectly predictable nature of the regressors.The future values of the stochastic process term can be computed by using the infinite moving-average representation of the optimal process ( ) ( ) ( ) ( ) where (this constraint is equivalent to the requirement of roots outside the unit circle).The coefficients i ψ in ( 5) are functions of the parame- ters in (2) and can be easily obtained by recursive equations.See [17].In practice, however, the parameters of (2) have to be estimated, and it is customary to substitute estimated values into all the formulae.

Prediction Intervals
Short-term point forecasts cannot reflect all the uncertainties in the price of energy.In this regard, it is far more interesting to have information on how reliable the extent of the prediction is.In short, given a time series of n prices , , , n P P P  , we seek forecast limits such that the probability is ( ) , where , n k P is the price (/kWh) at a given hour of k days after day n and n is the last period at which a price is available.The point forecast , ˆn k P is obtained by identifying and estimating a SARMAX process to the fit period (i.e.training plus validation periods).
, , n k Q α is the quantile of order α of the distribution of the forecast error , n k e at origin n and lead-time k.If the hypothesis of Gaussianity is accepted for each k, then PIs can be derived from the standard formulae given by Box & Jenkins [18] , 2 ˆˆ. where with zero mean and variance one.Moreover, PIs in (7), typically called Box-Jenkins prediction intervals (BJ PIs), are the most commonly used even in the cases there are no specific reasons to assume a Gaussian distribution of the errors.[15] [Sect.7.7] illustrates various possible reasons why PIs in (7) are inadequate to encompass the required proportion of Open Journal of Statistics future prices and the lack of Gaussianity of the forecast errors is indicated as one of the causes.See also [19].

Williams Goodman Procedure
To simulate distribution of forecast errors, the time series is split into two parts: the "training" period and the "validation" period.As a preliminary, we choose a window size, ν (the number of consecutive daily prices) which, together with the maximum lead-time L, establish the complexity of the Williams Goodman (WG) procedure.
Initially, the training period contains prices for day 1 through ν whereas, the prices from ( ) act as the first validation period.The class of SARMAX model discussed in Section 3 is fit to the training time series to find the best process which minimizes the AICC criterion (4).The selected process is then used to calculate the L-step-ahead point forecasts: The post-sample forecast errors are obtained from differ- ence with the corresponding values of the validation period: Note that, in this case, 1,k P ν + is a real price and not a random quantity.
In the successive step, a block of γ contiguous prices is dropped from the start of the training period and, simultaneously, γ contiguous prices from the start of the validation period is shifted back to the end of the training period so that the second window contains prices for day ( ) The second validation period includes prices from ( ) , The procedure is iterated until the last training period ( ) ( ) and the last validation period ( ) : achieve the end n of the fit time series.Overall, the procedure forms ( ) L-step-ahead forecast prices and post-sample forecast errors.We can arrange errors as a matrix.
Rows correspond to different time origins and columns to different lead-times.
If the forecast error distributions are the same, then column k g can be intended as a sample of size r of the forecast errors that would have been made by the selected SARMAX process, at lead-time k across horizon origins 1, 2, , r ν ν ν + + +  .The construction of PIs requires knowledge of the quantiles of the forecast error distribution, which are typically unknown and have to be estimated.An obvious way to generate PIs is to assume k-step-ahead forecast errors follow a continuous distribution function.If the fitting is satisfactory, the quantiles of the distribution can be estimated and then prediction bounds determined for each lead-time.
Chatfield [15] [Sect.7.5.5],notice that, although the WG procedure is attractive in principle, it seems to have been underutilized, not only because of the lack of time series long enough to give credibility to the fit of empirical distributions, but also because of the heavy computational requirements involved.Of course, the length of the time series is not a problem with time series collected in the electricity market and analyzed in the present study.In addition, the effort required to implement WG for time series of moderate length (1000 -1500 time points) is compatible with the hardware/software resources generally at disposal.An R script is available from the authors upon request.

Density Selection
In the framework of electricity price forecasting, it might be reasonably argued that prices are not Gaussian (see, [7] [10]) but it is not clear what is precisely their distribution.In this subsection, however, we discuss the distributional properties of post-sample forecast errors whose behavior cannot be automatically deduced with certainty from observed values and/or in-sample forecast errors.If the empirical situation does not suggest an obvious choice, one can be selected among myriad examples of probability density functions (pdfs).Williams & Goodman [1] adapted a gamma density to the absolute value of the forecast errors.Isengildina-Massa et al. [20] found that forecast errors from the data set used in their study most often followed a logistic distribution.Bordignon & Lisi [21] and Lee & Scholtes [22] propose the Gaussian distribution.We choose the mathematical function in the first row of Table 1 after a long and complicated search for a powerful and versatile curve.Obviously, we are aware that, trying many densities and keeping the "best fitting" one, does not guarantee that another model will not look better than those we have already seen.
In all the densities, 1 θ controls the location of the distribution; 2 0 θ > af- fects the scale, 3 θ and 4 θ are shape parameters.The densities are referred to In using (10), a first problem to be solved is to determine which of the three families should be used and, once the class is selected, the next problem to be solved is to estimate the parameters.In both problems, we follow the technique proposed by Wheeler [24] as implemented in package in the package SuppDists with quantiles ( ) 0.1, 0.25, 0.5, 0.75, 0.9 .The column headed "R package" refers to the package used for parameter estimation.The notation "stats" indicates standard computational algorithms.The last column of Table 1 reports the technique of estimation of the parameters: mle (maximum likelihood), qme (quantile method), mme (method of moments).
The usual strategy behind fitting a given distribution to data is to identify the type of density curve and estimate the parameters that give the highest probability of producing the observed values.Instead, we follow an indirect approach: we compare the different density curves by testing how accurately the PIs generated by a SARMAX process, in tandem with Williams-Goodman method, capture the true prices.

Forecasting Accuracy
Let us consider the matrix of estimated forecast errors G discussed in the preceding section.For each lead-time ˆk v Q α be the α-th estimated quantile of the v-th distribution, 1, 2,3 v = .The quantiles are used to derive ( ) , : The means and standard deviations are computed over the post-sample errors , Notice that the mean of post-sample errors is not necessarily zero.
To assess the performance of the various PIs, we compare the prediction interval actual coverage (PIAC) to ( ) . The PIAC is measured by counting the number of true hourly prices of the target period enclosed in the bounds (11) , 100 where 0 otherwise If the PIs are accurate, then ( ) All other things being equal, nar-row PIs are desirable as they reduce the uncertainty associated with forecast-based decision making.However, there is a trade-off between PI widths and PIAC: the wider the PI, the higher the corresponding PIAC and hence the greater is the accuracy of predictions, at least to a certain extent, because very wide PIs are not practically useful.On the other hand, very sharp PIs with a low coverage probability are useless as well.It is necessary to introduce, in this connection, a scoring rule for addressing the sharpness of PIs.We use a score function of the form proposed by Winkler & Murphy [25] ( ) ( ) ( ) , The use of ratios facilitates comparability across price levels.The symbol ( )  13) and ( 15) should be judged keeping in mind the stochastic behavior of the electricity prices.Here, we have a potentially severe problem.Price peaks and valleys have been smoothed for the training and validation periods, but the same has not been done for the target period.These prices, in fact, are left as they are observed, to simulate real conditions.Spike prices, however, are recurring events and, therefore, it would not be surprising to find some of them in the target time series.Our SARMAX processes, being developed within a cleaned-up data set, have hardly any possibility to predict satisfactorily all, or at least a good part, of the outliers.Remaining outliers imply poor prediction intervals in practice.Further research is required to formulate a model which is not only generally enough to merge Box-Jenkins processes, WG prediction intervals and spike prices, but also it is numerically tractable to provide a quantitative description of the complex patterns of electricity market time series.

Predictive Performance
To this end, we analyze 144 24  , which include 81 different processes.Each process is combined with the WG procedure applied to any one of the density functions in Table 1.For completeness, we include the BJ PIs described in (7) and Tchebycheff PIs ( ) , with 1 where 2 0.5 e k σ δ = is the correction factor proposed by [Baskerville, 1972] and ˆk σ is the standard deviation introduced in (8).Expression ( 17) is a genuine ( ) is a monotone function of n k P + , but they do not have necessarily the same structure.For example, the interval ( 17) is asymmetrical even though the interval in the log scale is symmetrical.Furthermore, the anti-logarithms of forecasts are biased.See [26].On a first general examination, we note the consistent behavior between actual coverage rate (PIAC) and mean relative scores (MS), with the latter decreasing as the former increases.Naturally, this is a confirmation of the expected behavior of the score function (14).Tchebycheff intervals show, perhaps not surprisingly, the largest widths.Box-Jenkins prediction intervals (BJ PIs) appear to be the most conservative approach, i.e. it yields largest coverage rates, but with prevalently smaller widths than the Tchebycheff PIs.The substantial reliability in performance of the WG procedure based on Johnson's system, gamma, logistic and Gaussian distributions is due to actual coverage rates which are decidedly lower than the corresponding nominal coverage rates than BJ and Tchebycheff PIs.But, above all, the formers have a much smaller sharpness than the latters at all confidence levels.The distributions nested within Johnson's system come more frequently on top in terms of actual coverage probability and sharpness of the intervals and hence can be considered the optimal probability density within the experimental set up of our study.

Conclusions
Prediction intervals (PIs) are random sets designed to contain a future value with a given probability.The principal reason for constructing them is to provide an indication of the reliability of point forecasts avoiding a complete description of the probability distribution of the uncertainty associated with a prediction.Box-Jenkins or BJ PIs (the procedure in common use currently) assume Gaussian errors, known parameters and intervals are centered about the conditional expectation.Consequently, BJ PIs cannot take into account the variability due to parameter estimation and behave poorly when the errors are not Gaussian.Our findings confirm these observations.The primary concern in this paper is with the Williams & Goodman [1] (WG) procedure and the gain of accuracy brought by WG PIs in comparison to con-ventional BJ PIs.Our findings show point forecasts for day-ahead hourly prices in Italian wholesale electricity market may be of greater utility if accompanied by PIs calculated using WG centered around a member of the Johnson system of density functions.This mix offers PIs having a coverage rate greater than the nominal rate, but, what is more important is that this desirable conservativeness is not at the expense of widening the intervals, which instead are admirably sharp.The procedure has proven to be very competitive on two other procedures: Box-Jenkins and Tchebycheff PIs.

6. 5 ]
. According to the preparatory work for the present research, the most influential calendar variables are: days of the week, public holidays (official and religious) and daylight-saving time.Days immediately before and immediately after holidays are considered Saturdays and Mondays, respectively.Calendar effects are accounted for in the model by incorporating sets of dummy variables,where one of the categories is omitted to prevent complete collinearity.The dummy, or more precisely, binary variables in the process (1), precludes using the difference operators.It follows that, from now on the Box-Jenkins processes will be associated to the acronym SARMAX.The "burden of regular and seasonal non-stationarity" is placed entirely on the estimated parameters.In this sense, we require the roots of the polynomials ( ) unit circle, with no single root common to both polynomials.If this condition is satisfied then errors are stationary with finite variance.
inclusion of the next block of prices taken sequentially from the time points of the validation period not yet processed.The same class of models as in the initial step is fitted to the new training period, the new L-step-ahead forecasts calculated and the corresponding post-sample errors obtained at the time origin 1 ν γ + + as , 1, , density is fit to the absolute value of post-sample errors and hence 1 0 θ = .The system proposed by Johnson[23] contains three classes of distributions which are based on the transformation a standard Gaussian random variable and g (with derivative g′ ) represents an indicator function taking one if the argument is true and zero otherwise.The first addend in (14) reflects a cost associated with the width of the interval.The cost decreases as ( ) 1 α − increases, to compensate the tendency of the bounds to be broader as the confidence level increases.The other two addends penalize PIs if the target is outside the interval.The penalty increases with increasing distance from the nearest interval endpoints.The average of (14) across time points provides an indication of the sharpness of PIs

6 =×
different time series, one for each hour of the day and each zone of the Italian electricity market.All the daily time series are long 1148 days, but the last three weeks ( 21 L = ) are reserved for assessing the predictive accuracy of the intervals.Thus, only the first 1127 days are used for estimation and validation of SARMAX models.The size of the rolling win-dow is fixed at 959 ν = (15%) which leads to 168 r = samples of 21-step ahead forecasts.The search of the SARMAX processes is conducted within the bounds 2 The selected process serves to compute, standing at time n, forecast ˆn k

Table 1 .
Density functions for post-sample forecast errors.

Table 2
shows the results of PIAC and MS at ( )

Table 2 .
Quality of PiIs.Frct" denotes the percentage of times, out of the 144 cases studied, the corresponding density determined PIs with the lowest magnitude among all the PIs associated with an actual coverage rate greater than or equal to () " indicates that the corresponding distribution has never taken the first place in the two rankings of forecast accuracy used in this study.