Evaluating Volatility Forecasts with Ultra-High-Frequency Data—Evidence from the Australian Equity Market

Due to the unobserved nature of the true return variation process, one of the most challenging problems in evaluation of volatility forecasts is to find an accurate benchmark proxy for ex-post volatility. This paper uses the Australian equity market ultra-high-frequency data to construct an unbiased ex-post volatility estimator and then use it as a benchmark to evaluate various practical volatility forecasting strategies (GARCH class model based). These forecasting strategies allow for the skewed distribution of innovations and use various estimation windows in addition to the standard GARCH volatility models. In out-of-sample tests, we find that forecasting errors across all model specifications are systematically reduced if using the unbiased ex-post volatility estimator compared with those using the realized volatility based on sparsely sampled intra-day data. In particular, we show that the three benchmark forecasting models outperform most of the modified strategies with different distribution of returns and estimation windows. Comparing the three standard GARCH class models, we find that the asymmetric power ARCH (APARCH) model exhibits the best forecasting power in both normal and financial turmoil periods, which indicates the ability of APARCH model to capture the leptokurtic returns and stylized features of volatility in the Australian stock market.

Theoretical Economics Letters different financial instruments and markets around the world. Important financial decisions such as portfolio optimisation, derivative pricing, risk management and financial regulation heavily depend on volatility forecasts. In derivative pricing, such as in the estimation of the Black-Scholes option pricing model, volatility is the only parameter that needs to be forecasted. The prediction of volatility is also crucial in development of Value at Risk (VaR) and a variety of systemic risk models, as well as in banking and finance regulations.
For example, according to the Basel Accord II and III, it is compulsory for all financial institutions to predict the volatility of their financial assets to incorporate the risk exposure of their capital requirements.
The focus of our study is on the predictive ability of the popular AutoRegressive Conditional Heteroscedastic (ARCH) class of models that originated from a seminal Nobel Prize-wining article by [1] [2] generalized his framework to obtain the GARCH model. ARCH and GARCH models are popular and standard volatility forecasting models in econometrics and finance.
Documented stylized features of variation such as the clustering and long memory effect can be captured by GARCH class models, and the model parameters are relatively easy to estimate. A comprehensive survey of the GARCH family of models can be found in [3]. The current study selects three popular GARCH class models from the literature, including the standard GARCH, the threshold GARCH (TGARCH) of [4] and [5], and the asymmetric power ARCH (APARCH) of [6]. In addition to the three standard models, we consider 12 corresponding forecasting strategies associated with them, which involves different estimation windows and errors distributions.
The predictive power of a volatility model is evaluated based on an out-of-sample test in which the predicted volatility generated from the model is compared with the ex-post volatility measurements. Superior volatility forecasting models are supposed to have small forecasting errors, measured as the difference between the predicted and actual volatility. However, unlike the return, the volatility process cannot be observed. Therefore, in out-of-sample evaluation of volatility forecasting models, the crucial task is to find an accurate proxy for the underlying unobserved volatility process. In the mid-1990s, a series of empirical studies noted that while GARCH-type models are for fitting the time series returns, they failed to explain much of the variability in ex-post volatility measured by the squared returns in out-of-sample tests. Hence, the practical usefulness of GARCH models was challenged. [7] responds to the critique of the model and argues that the unsatisfied empirical results are due to the noisy volatility proxies used in these studies, that is, squared returns or absolute returns. In out-of-sample forecast evaluation, a common approach for evaluating the practical performance of any model is to compare the fitted predictions derived from the model with the subsequent realizations. However, volatility is not directly observed and dealt with as a latent variable in financial modelling. Therefore, the squared innovation return is usually employed as a However, this finding is not surprising and is considered evidence of the poor predicting ability of a volatility model. Although in statistical terms, the squared innovation return is an unbiased estimator of underlying variance, it is not an accurate estimator and displays a large degree of observation-by-observation variation. In summary, empirical findings of poor forecasting performance are due to unsatisfied volatility proxies rather than the predictive power of GARCH class models.
The recent availability of nearly continuous high-frequency transaction data provides a chance to explicitly compute the market volatility by using intra-day data, referred to as realized volatility in the literature. [7] shows that the basic GARCH(1,1) model performs rather well when it is evaluated with a volatility measurement constructed using high-frequency data. This stems from the fact that high-frequency volatility is a more accurate measure of daily volatility compared with those estimated using low-frequency data. Instead of only using the opening or closing price of a trading day in squared daily returns, high-frequency volatility exploits more information contained in intra-day trading data. In the literature, five-minute intra-day data are popularly used to construct realised volatility. However, in our study, we show that while contamination with microstructure noise can be reduced if realised volatility is constructed based on sparsely sampled high-frequency data, it is still a biased estimator.
Because of the crucial role of ex-post volatility measurement in evaluating forecasting performance, in addition to using five-minutes tick data we calculate the realized variance using ultra-high-frequency data and relying on the Two Time Scaled Realized Volatility (TSRV) estimator proposed in the study of [11] and [12], which is shown to be an unbiased ex-post variance estimator. Unlike an arbitrarily and subjectively selected sample frequency such as five minutes, the TSRV employs all available high-frequency intra-day data and therefore exploits full information about return variation contained in the ultra-high-frequency data. Our results show that forecasting errors relying on TSRV and ultra-high-frequency data are significantly lower than those based on sparsely sampled intra-day data.
This paper is an attempt to mimic volatility forecasting strategies using GARCH family models in practice, and examine natural questions arising from employing such strategies in the stock market in Australia. Crisis (GFC) of 2008 are of concern for financial practitioners. The current paper responds to these questions by examining the empirical predictive power of various strategies in pure out-of-sample tests, relying on a recently developed unbiased ex-post volatility proxy and using ultra-high-frequency trading data in the Australian equity market.
Our research is also motivated by the fact that no such study has previously been conducted on the Australian stock market. [13]  The Asymmetric Power ARCH model at least nests seven other GARCH-type models, see [6].  3 . We note that the volatility forecasting procedure presented in this study is not limited to GARCH class models. In practice, the out-of-sample predictive performance of any volatility model can be evaluated based on our procedure.
The reminder of the paper is organized as follows: Section 2 provides a brief discussion of the theory around measuring high-frequency volatility. Section 3 and Section 4 respectively illustrate forecasting models and detailed procedures to compare prediction performance. Section 5 describes the daily data and tick data used in this paper. The main empirical results are presented in Section 6 and conclusions are drawn in Section 7.

Theoretical Set-Up
Unlike for raw return, the actual daily return volatility process usually cannot be directly observed because there is just one daily return per trading day.
Conventionally, volatility is treated as a latent variable 4 in parametric models such as GARCH-type and Stochastic volatility (SV) models that are inferred from ex-post low-frequency return data. Volatility measurement by these models is based on specific distributional assumptions and usually involves complex procedures in estimating model parameters. [15] introduced the concept of realized volatility for the first time. 5 The realized volatility is a non-parametric estimator that does not rely on the distribution of parameters.
In the standard form, realized volatility is the second-order sample moment, that is, the sum of squares of the high-frequency returns over a fixed period, say, one day. r is the ith high-frequency return for day t.
In financial asset pricing models, the asset price is assumed to be driven by a continuous time diffusion process, ( ) W t is standard Brownian Motion, and ( ) t µ and 3 We also examine the performance of our selected volatility forecasting strategies during the financial turmoil period. The same procedure above is applied to the period of GFC in 2008. We choose three months after the date that the Lehman Brothers Holdings filed for Chapter 11 of the United States bankruptcy protection code as the test period. We compare the forecast performance of GARCH models in the crisis period with three months in early 2008 which has much lower unconditional daily volatility. We find that during financial turmoil, the degree of forecast losses are significantly increased, however, the overall ranking of the forecast does not change. The APARCH provides the best forecast across all cases suggesting the importance of the role of negative returns in predicting future volatility when the market is down. 4 Latent variables cannot be directly observed and are estimated by using other observed variables 5 In literature, realized volatility and realized variance are often used interchangeably.  [18] prove that if the underlying asset log price is a semi-martingale, the quadratic variation theory ensures that t RV converges in probability to the Quadratic Variation ( t QV ) of asset return, which is the actual underlying volatility we would like to measure in the continuous framework: 6 Thus, non-parametric realized volatility provides an efficient measure of daily market volatility and allows us to treat realized volatility as an observable variable rather than a latent one.
However, high-frequency raw data are contaminated by microstructure noise reflecting market frictions such as bid-ask bounce and price discreteness. Mathematically, the observed log price can be decomposed into two parts: where t X is the latent price process and t ε denotes microstructure noise and which is independent of t X . It then can be shown that where n is the sampling frequency. Note in Equation (1), as sampling frequency increases, the integrated variation that measures the actual volatility of the true price process will be swamped by the error terms. Hence, it may be unwise to sample the data too often when estimating RV. One way to mitigate the contamination caused by microstructure noise is sparsely sampling high-frequency data, such as sampling every five minutes instead of every second. This will reduce the bias term, because ( ) sparse n n < . In empirical work, a five-minute time interval is widely used as the sampling frequency. However, although the effect of the bias term can be mitigated, sparse sampling cannot completely remove it. Moreover, too much data are thrown away if high-frequency data are sampled every five minutes and this violates the statistical principle. In recent years, a few consistent estimators have been proposed that are designed to accurately calculate realized volatility by directly modelling microstructure noise. Theoretical and simulation studies show that they can improve the estimation to a large extent. (see [11] [12] [19] [20]) 6 To be accurate, the convergence is in probability, that is

An Unbiased Ex-Post Volatility Estimator Using Ultra-High-Frequency Data
As we discussed above, estimating daily volatility ( t RV ) by using all the high-frequency observations will lead to a rather unreliable result. Sparsely sampling can mitigate the effect of microstructure noise, but at the cost of discarding a huge amount of data, which is not advisable. Moreover, sparsely sampled estimators are statistically biased. [11] proposes a method of utilizing the full data set, which provides a consistent estimator, the TSRV, which is estimated based on the assumption that noise is independently and identically distributed (i.i.d) and independent of the price process. It involves three steps: 1) Sub-sampling Firstly, partition a full grid of observation and then average these estimators.
2nE ε , which is smaller than the original bias because n is the average size defined as n M which is much smaller than the full size n.
To remove the bias According to [11], this can be estimated by To the best of our knowledge, TSRV is the first proposed consistent estimator of quadratic variation ( t QV ). [21] further proposes a multi-scale RV (MSRV) that generalizes TSRV in Equation (2)  ). However, MSRV is difficult to implement in practice, and it does not significantly outperform TSRV. [12] examine the TSRV after relaxing the assumption of i.i.d microstructure noise. They find that TSRV works even in the situation where the noise is serially dependent.
In implementing the calculation of TSRV, firstly we can partition the full grid of data point,  For each sub-sample, we calculate the RV by summing the squared returns and then averaging RV estimators: The next step is subtracting the noise term from our estimator to render it unbiased. The unbiased estimator is thus: To perform the final step, we need to estimate the error term ( ) which is the realized variance among all data sets: Hence the final unbiased estimator is:

Forecasting Models
This paper examines the empirical accuracy of various approaches in predicting the conditional daily volatility t σ . Let t I denote the information set at time t. Then for h-step-ahead volatility forecasts we have Var r I Var a I a σ ε σ where t r is the log return series, t a are innovations and t ε follows a distribution with zero mean and unit variance. The predicted value of t h σ + is obtained from an alternative GARCH-type model estimated by maximum likelihood estimation (MLE) using historical daily return in the sample. The natural choice for the benchmark model is GARCH as proposed by [2]. The simple GARCH(1,1) has the form: GARCH models largely improve the practical use by adequately describing the feature of volatility of asset return while limiting the number of parameters that need to be estimated compared with the ARCH model of [1]. It has been documented that the simple GARCH(1,1) model can capture most of the volatility process of financial asset return. A comprehensive study by [22] also show that the most improved versions of GARCH models do not significantly outperform GARCH(1,1) in forecasting volatility.
However, the GARCH model deals with the symmetric price increase and drop, and fails to capture asymmetric effects on volatility. Empirical evidence has shown that volatility usually responds differently to large positive return versus large negative return. To overcome the weakness of the GARCH and the asymmetric effect, the TGARCH model was proposed by [4] and [5].
TGARCH(1,1) assumes the form ( ) The last model examined in this paper is the APARCH model of [6]. This model is commonly used in practice and [23] shows it nests many other GARCH models to capture the long memory feature in the volatility process. The APARCH has the following form: ( ) Here, ε follows a general distribution with zero mean and unit variance, and δ is a positive real number. Table 1 summarises the in-sample performance of the GARCH, TGARCH and APARCH models in our sample period spanning 2002-2012. All of the estimated parameters of the models are significantly different from zero and mostly significant at the 1% level 8 . These volatility models are estimated using conditional maximum likelihood method. 9 Moreover, the skewness coefficient of our sample log return is −0.582633, which suggests the log return is negatively skewed 10 . Therefore, to deal with the skewness of daily return, the volatility 8 The adequacy of the fitted models can be checked by the property of standardized residual series The t a process should be serially uncorrelated if the volatility model adequately captures the variations in asset return, our result indicates that after they are scaled by the fitted conditional volatility, the residuals are uncorrelated. 9 The Quasi Maximum Likelihood estimation is also used, and it yields similar results. 10 We performed formal test for the skewness. In the case of the GARCH model, the skew parameter is 0.8487 with a standard error of 0.0248. The t ratio is then ( )

Comparison Procedure
For comparing the predictive performance of various volatility forecasting models, the benchmark strategy involves a growing estimation window that employs all available data on and before each trading day in the forecasting sample and then fits the model by maximising the Gaussian likelihood function.
In addition, instead of using a growing estimation window in the benchmark, forecasting strategies with a medium rolling window of three years and a short rolling window of one year are tested. Further, to handle the skewness of daily return, we also re-estimate and evaluate the models with Student-t and skew Student-t innovations. Overall, we compare three GARCH-type models with five prediction strategies.
We adopt the recursive forecasting method to obtain forecasted daily volatility.
In the implementation, for each trading day t in the out-of-sample forecasting period we estimate GARCH class volatility models by using all the available daily data before date t. Then, for each GARCH-type model and each forecasting strategy, the fitted model is used to generate multiple forecasting horizons: one day, one week, two weeks and one month 11 . In this way, a series of overlapping 11 Usually, one week has 5 trading days and one month has 22 trading days. forecast paths is generated for each strategy. As the objective of the paper is to evaluate the empirical accuracy of each strategy, we compare the predicted daily volatility in the forecasting path with the high-frequency volatility, which is treated as a proxy for the true underlying return variation process. The high-frequency daily volatility is calculated by using both five-minute trading data (5-min RV) and the TSRV introduced in Section 2. As 5-min RV is noisy and TSRV is a robust and unbiased estimator of high-frequency volatility, our results mainly rely on the latter.
The comparison of empirical forecasting accuracy are based on loss functions, which examine the magnitude of the difference between the predicted volatility and the realized ex-post volatility proxy. The model with smaller forecasting loss is considered superior in predictability. We take the two loss functions suggested by [24] and [25]: Mean Squared Error (MSE) and Quasi-Like loss function (QL) which are defined as: where t h RV + is the ex-post realized volatility at time t h + and 2 | t h t θ + is the predicted corresponding value. Empirical studies of [22] [24] and [26] suggest that QL is a more robust function to compare forecasting loss, specifically for accuracy of comparisons across periods with different volatility levels. This is because the loss function of QL only depends on the scaled residual whereas MSE is determined by the additive errors. As a robustness test, we apply the Diebold-Mariano test 12 to statistically examine the performance of the loss functions (MSE and QL) for each forecasting strategy. The null hypothesis is the case that the two forecasts have the same accuracy and the alternative hypothesis is that one forecast has a higher level of accuracy. The ASX uses a call auction procedure at opening and closing of each trading day. Therefore, to avoid abnormal trading patterns around the start and the end 12 For details see [27]. Theoretical Economics Letters of the trading day, trading data for the opening and closing call auctions are discarded. Only trading data collected during normal trading hours (10 am to 4 pm) are considered. We also exclude the overnight return. Due to the recording errors in tick data, the intra-day data for some trading dates are missing. We reconcile the tick data and daily data to match the volatility for each date. In the final form, our sample has 2791 trading days with 1,211,122 trading records.

Empirical Results
In this section, forecasting results for each separate GARCH-type model with modified strategies are reported. We also directly compare and test the forecasting accuracy of all three GARCH-type models. We examine whether the ranking of forecasting models changes during the GFC in 2008, and the results are shown in the Appendix.

Comparing Forecasting Accuracy for Various Volatility Forecasting Strategies
We begin comparing the out-of-sample prediction performance of various strategies with respect to the benchmark for the full period from 2005 to 2013.
The benchmark is the GARCH-type model with the Gaussian innovation distribution estimated using all available data from 2002, and competing strategies are the modifications of the benchmark with a different forecasting window and innovations distributions. Both the 5-min RV and the more robust TSRV are calculated using intra-day trading data according to the model specifications in Section 2. They are treated as proxies for the underlying volatility process and used to yield prediction losses through comparison with the predicted value from each volatility forecasting model specification.
The forecasting performance of GARCH, TGARCH and APARCH volatility models, which are recursively estimated, are reported in Table 2 to Table 4 respectively. The prediction losses in the tables are based on two loss functions (QL and MSE) and two realized volatility measures (5-min RV and TSRV). The  It is apparent from Tables 2-4 that 5-min RV yields systematically higher losses than those of TSRV, as expected. This suggests that TSRV is a more precise proxy for the ex-post volatility than the sparsely sampled 5-min RV. This is because using sparsely sampled five-minute data discards too many high-frequency observations, along with the information they contain. Thus, we interpret the results relying on the out-of-sample forecasting losses yield from the TSRV. We note further from Tables 2-4 that except the standard GARCH model under skew Student t distribution, there are few systematic improvements in forecasting accuracy when the benchmark strategy is modified. In other words, the daily volatility forecasts obtained from the GARCH-type models estimated   Table 4). In Table 3, the TGARCH with skew innovation distribution appears to reduce forecasting accuracy for QL loss with TSRV proxy, compared with the benchmark. The possible reason is that the effect of negative skewness is incorporated in the standard TGARCH, because the TGARCH is designed to capture the heavier influence of past negative returns.
Using rolling estimation windows dramatically decreases forecasting performance. In the case of QL loss with TSRV proxy, the accuracy decreased by more than 20% and for other cases the maximum reduction is as much as 80%.
The one-year rolling estimation window has the largest forecasting losses, but the medium three-year rolling window is better, however, it is still much worse than the full sample growing estimation window. This may suggest that the information contained in the short estimation period is not sufficient to capture the dynamic of the volatility process.
Finally we move to the direct comparison of the three benchmark GARCH models. Table 5 displays the forecast performance of each GARCH-type model in the sample period 2005-2013. As discussed earlier, the benchmark strategy for each model works well so only the loss functions for the benchmark and for each forecast horizon are shown in Table 5. The best performance which has the smallest prediction loss is highlighted. We also use the Diebold-Mariano test to evaluate the significance of the improved forecast accuracy. From the results, we see that both the TGARCH and APARCH have significantly better prediction performance than GARCH. Further, it is usually observed that APARCH gives the best forecast. In other words the asymmetric specification captures the asymmetry in return volatility in the Australian equity market. The MSE results favor the GARCH for one-day ahead forecast, but this should not be considered a discrepancy because the result is not significant and the losses are very similar. We also examine the performance of our selected volatility forecasting strategies during the recent financial turmoil period. The same procedure above is applied to the period of the GFC in 2008. We find that during financial turmoil, the forecast loss of each model is significantly increased, however, the overall ranking of the models does not change. The details and results are provided in Appendix.

Conclusion
Volatility models are constructed to predict volatility, which is an essential input in financial asset pricing models and risk management practice. In this paper, we empirically investigate the predictive ability of various volatility forecasting strategies employing GARCH class models in the Australian equity market that has several distinguishing features. We specifically examine which specific features of GARCH models provide the best forecast; whether we should allow for heavy-tailed and skew distribution of innovations in our estimation; and whether we should use growing estimation or rolling estimation windows.
Because of the crucial role of ex-post volatility estimators in out-of-sample tests for predictive ability of volatility models-along with the realized volatility constructed using five-minute intra-day data that are arbitrary, subjective and biased, our analyses rely more on an unbiased volatility estimator-the TSRV, which utilizes ultra-high-frequency intra-day data.
In the pure out-of-sample test, we evaluate the predictive abilities of volatility We examine the forecast ability of GARCH models during the recent GFC with a high volatility level. Figure A1 plots the time series of the daily return, as expected, the variation in asset return is dramatically high during the period of high turmoil around autumn of 2008. We choose the test period of three months beginning 15 September, 2008 when Lehman Brothers Holdings filed for bankruptcy protection. We selected the three-months period from April to July 2008 as a control period. A similar procedure to the full sample period is also applied to these periods. The annualized unconditional daily variance for the GFC period is 0.224, which is four times as large as the variance in the control period with daily volatility of 0.055. Tables A1-A3 report the forecast losses for the control period April-July 2008. We repeat the procedure for the turmoil period September -December 2008, and the results are presented in Tables A4-A6. From the "benchmark" column in each table, we can see that the forecast losses are dramatically increased during the GFC. For instance, the TGARCH benchmark strategy has QL losses of 0.4 for the second quarter of 2008, which rises to 0.6 in the autumn. However, the results show a similar tendency as in the full sample period (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013). While the benchmark strategies across all three GARCH models are seldom beaten by their corresponding variation, skewed Student-t likelihood can significantly improve the forecast accuracy of each GARCH model. We also observe that during the GFC, the 1-year rolling estimation window improves the forecast for longer horizons. This is because historical data from earlier periods have less prediction power during financial turmoil. However, when the higher effect of negative returns is considered in the TGARCH and APARCH models, improved performances disappear. Table A7 provides the result of the direct comparison of GARCH class models during the normal and turmoil periods. Similar to the full sample period, the TGARCH and APARCH models perform better than the GARCH model. The APARCH model usually provides the best forecast accuracy, particularly for the longer forecast horizons during the normal period in 2008. However, the model wins across all cases for the GFC period. This reflects the important role of higher negative returns in predicting future volatility during times of financial turmoil. Overall, although the forecast accuracies are reduced during the GFC, the ranking of the forecasting models is unchanged. Figure A1. Plot of daily log return od ASX200. . From the 2nd through the last column report the percentage gains or losses achieved by modifying the estimation strategy with 1) Student t innovations, 2) skew student t innovation, 3) 1 year rolling estimation window, 4) 3 year rolling estimation window. Asterisks after the percentage gains represent the significance of the Diebold-Mariano test for whether the modification can improve the forecasting accuracy. The level of statistical significance is denoted by: *10%, **5%, ***1%.  Table shows the evaluation of APARCH volatility forecast strategies for the period September-December 2008. The 1st column under the "forecasting strategies" reports the out-of-sample losses (QL and MSE) of the benchmark forecasting strategy at multi-step ahead (1, 5, 10 and 22 trading days). From the 2nd through the last column report the percentage gains or losses achieved by modifying the estimation strategy with 1) Student t innovations, 2) skew student t innovation, 3) 1 year rolling estimation window, 4) 3 year rolling estimation window. Asterisks after the percentage gains represent the significance of the Diebold-Mariano test for whether the modification can improve the forecasting accuracy. The level of statistical significance is denoted by: *10%, **5%, ***1%.