Forecasting Density Function: Application in Finance

With increasing availability of data, in many situations it is now possible to reasonably estimate the probability density function (pdf) of a random variable. This is far more informative than us-ing a few summary statistics like mean or variance. In this paper, we propose a method of forecasting the density function based on a time series of estimated density functions. The proposed method uses kernel estimation to pre-process the raw data followed by dimension reduction using functional principal components analysis (FPCA). Then we fit Vector ARMA models to the reduced data to make a prediction of the principal component scores, which can then be used to obtain the forecast for density function. We need to transform and scale the forecasts to ensure non-nega-tiveness and integration to one. We compared our method to [1] for histogram forecasts, on simulated data as well as real data from S&P 500 and the Bombay Stock Exchange. The results showed that our method performed better on both the datasets and the simulation using uniform and Hilbert distance. The time dependence and complexity of density function are different for the two markets, which is captured by our analysis.


Introduction
Contemporaneous aggregation is often the only way to analyze temporal data, for example, considering the observations of a variable measured through time in a population, e.g. the monthly output of firms in a country.If the individuals considered are not the same through time, then it is not possible to deal with the longitudinal data.However if one is interested in the overall evolution of all firms, histograms or densities can still be studied.[2] displayed a time series of the weekly returns of the firms in the S&P 500, summarized by histograms.This is an interesting precedent which shows that in some cases, histograms are preferable to averages or totals.[1] used smoothing and non-parametric method to estimate and predict histogram time series on S&P 500 data.In this paper, we follow-up on this idea, but replace histograms by density estimates.Kernel density estimates are smooth estimates of the probability density function and do not depend on the choice of end-points as opposed to histograms.
Density function estimation has been widely used in many areas such as finance [3], energy market [4] and meteorology [5].Although there is a rich literature on density estimation, there is very little work on density function forecast.The problem studied in this paper is different from the above references, in that, we consider replications of density estimates observed over time as opposed to just one density.To this end, we utilize the technique of Functional Data Analysis (FDA).
FDA is a popular statistical technique that treats entire curves as units of data (see [6] for an introduction).The theory of estimation of individual functions from discrete observations is quite well developed and these curves are eventually used in various applications like regression, clustering, time series etc.
The difference between a general function and a density function lies in the fact that density functions are nonnegative everywhere and their integral over the whole space is always equal to one.These restrictions pose challenges to using functional data methodology to densities directly.[7] [8] used a differential equation method to fit a smooth and monotone function and with the exponential transformation and post hoc procedure to finally obtain a fitted density function.[9] used FPCA to fit density functions and viewed them as independent and identically distributed to make statistical inference.Other possibilities are to deal with the cumulative density function or quantile function (see [10]) instead of density, which also requires addressing challenges like smoothness, monotonicity and fixed support etc.
The paper is organized as follows.In Section 2 an overview of the methodology used in this paper is given, followed by a detail introduction of the three main statistical methods used: kernel estimation, Functional Principal Component Analysis(FPCA) and Vector Autoregression and Moving Average (VARMA).A simulation analysis is conducted on Section 3 to validate the performance of the proposed method and compare it to the performance of the method of [1] for prediction with a histogram time series using uniform and Hilbert norm distances.Two applications on main stock index of U.S. and India-S&P 500 and Bombay Stock Exchange are presented in Section 4 and 5 respectively.We present our conclusions in Section 6.

Methodology
Each dataset consists a set of t n observations from a distribution t F and density t f for 1, , t T =  .The aim is to obtain a forecast for the distribution 1 T f + .For each dataset, first we estimate a density function for each time point ˆt f using kernel based estimation method as described in detail in section 2.1.In the next step, we do FPCA on the density functions as explained in section 2.2.In section 2.3, we outline the modeling of the principal component scores ti ξ 's as a multivariate time series and fitting a VARMA (p, q) model.The asymptotic results for time series of functional data are available in [11].Since we are dealing with density functions, we need extra steps of transformation and normalization, so that the forecast functions are nonnegative and integrate to one.These are explained in section 2.4.The specification of distance between two functions is handled in section 2.5.

Kernel Density Estimation
We are interested in estimating the probability density function (pdf) f F′ = .In the rest of the paper we use density to signify the pdf.
As a motivation of the following method, observe that: where the approximation holds for "small" h.Replacing F by the empirical distribution and h by a sequence ( ) as n → ∞ we obtain the estimator: ( ) This density estimator can be understood as follows.First observe that the discrete empirical distribution gives mass 1 n at i X .The point mass is replaced by a uniform distribution centered at i X , with support of length n h .These uniform distributions are averaged to obtain the proposed estimator.Instead of spreading the mass uniformly, an arbitrary pdf 1 X can be used.This leads to a class of estimators of the form: ( ) where K is a pdf, i.e.
( ) 0 K x > , for all x and . Such as estimator is called kernel estimator, with the kernel K and bandwidth n h , both chosen by the user.Observe that f automatically is a pdf, as we have ( ) ˆ0 n f x ≥ for all x, and the integration of ( ) The choice of the kernel usually is not that crucial.The estimator in ( 1) is a special case with , the uniform kernel.Some common choices for K are Uniform, Normal, Logistic and Epanechnikov.The bandwidth n h is also called the smoothing parameter, as it determines how far the mass is spread out locally and hence how smooth the resulting estimator becomes.The choice of the bandwidth has a big influence on the performance of the estimator.For instance, for the uniform kernel, if the bandwidth is very large, the estimator would be flat.On the other hand, as the bandwidth becomes smaller, ˆn f consists of larger spikes around the observations.As estimators of a pdf, both of the extreme cases are undesirable.The bias increases with increasing bandwidth, and the variance increases with decreasing bandwidth.This trade off indicates that an optimal choice of the bandwidth balancing between bias and variance, can be found through a minimization problem using standards like mean squared error.Under certain mild regularity condition, if ( ) as n → ∞ , then the kernel estimator as given in (2) satisfies: where ( ) .

Principal Component Analysis of Functional Data
Consider a sample of T smooth random trajectories , , 1, , Following [12], throughout we assume that f is an element of the Hilbert space ( ) and the norm , s.The sample trajectories are modeled as realization of a stochastic process ( ) There is an orthogonal expansion of G in terms of eigenfunctions i φ and non-increasing eigenvalues i λ as: , , , , , , α α are the domain of each trajectory.The Karhunen Loève theorem then provides a representation of individual random pdfs, given by , , , 1, , , where the i ξ are uncorrelated random variables with zero mean and variance The deviation of each sample trajectory from the mean is thus a sum of orthogonal curves with uncorrelated random amplitudes.
Often it is realistic to incorporate uncorrelated measurement errors with mean zero and constant variance 2 σ into the model, reflecting additional variation in the measurements, compare [13].Let ( ) , , , where In special cases, one might assume in addition that the , ti tj W ξ are all jointly normally distributed, but generally we do not make such assumption.
Under Equation ( 5) and with indicator function ( ) . I , we can get: This implies that the smooth mean function µ and the smooth covariance function ( ) , G x y can be consistently estimated from pooling the sample of T trajectories and smoothing the resulting scatter plot.Well known procedure exists to infer eigenfunctions and eigenvalues [14].
Processes f are then approximated by substituting estimates and using a chosen finite number of principal components.The specific number of principal components to be retained in the model is chosen by some optimization criterion like cross-validation, AIC, BIC or a scree plot.

VARMA Modeling
, ,0 is an  white noise and the operator : θ →   is linear and compact.A higher order of autoregression process-ARH (p) (see [15]), could now be defined as: A natural extension would be to consider the series of functions follows the ARMAH (p, q) model with mean where . . ., is  white noise and 1 , , p θ θ  are linear functions.The expansion in Equation ( 4) is still valid, as long as the process is second-order stationary.Combining ( 4) and (7) we have: . . . . .
Combining all the terms involving µ and f µ into µ  and using vector representation, we have the following result: . . . ., where , , ( ) , , Since the columns of are orthonormal, we can left-multiply Equ- ation (8) by T  Φ to get: which implies a VARMA (p, q) structure on the vector of principal component scores Ξ .We can use this theory to model the time series structure of functions as time series structure of the first few principal component scores, which largely reduces the complexity of the problem.

Rescaling and Non-Negativeness
As mentioned before, the difference between density function estimation and general function estimation lies in that density function are required to be non-negative everywhere and integrate to one.However, the fitted function after FPCA estimation is not guaranteed to be positive everywhere.To address this, we took logarithm transformation of the fitted kernel density function before the FPCA estimation and used exponential transformation after the FPCA to guarantee the non-negativeness.In order to ensure that the fitted function integrates to one, we referred to [7] and imposed post hoc to rescale the fitted function.

Distance Measure between Two Functions
To compare the performance of FDA method and Arroyo's method, we used two different distance measures between predicted functions and actual functions.These are the uniform distance U D and the Hilbert distance H D , defined as follows: ( ) where f is the actual function and f is the predicted function.

Comparison to Existing Methods
In [1], the authors implemented exponential smoothing and k nearest neighbor methods to address the histogram time series (HTS) prediction problem.The authors defined the histogram data as: where ij π is the frequency for interval [ ] ij x .The distance between two histogram is defined as is the cumulative weight associated with the interval i.
By using this definition of the CDF of a histogram, the Wasserstein and Mallows distances formula can be written as functions of the centers and radii of the histogram bins:

Exponential Smoothing
The idea of exponential smoothing is to predict the next observation by a weighted average of previous observation and its estimate.Let 1, 2, , t X h t T =  be a histogram time series, the exponential smoothing forecast is given by: ( ) The authors show that the forecast is also the solution to the following optimization problem: where ( ) , D ⋅ ⋅ is the Mallows distance.The use of the Wasserstein distance is not suitable in this case because of the properties of the median, which will ignore the weighting scheme.For t large, the exponential smoothing formula can be approximated by: ( ) ( ) In the analysis below, we let 1 ˆ0 X h = and used the training data to estimate the α .Subsequently we plug in the estimated α in the approximation rule in Equation ( 9) to get the prediction.

k-Nearest Neighbor
The k-Nearest Neighbor (k-NN) method is a classic pattern recognition procedure that can be used for time series forecasting.The k-NN forecasting method in classic time series consists of two steps: identification of the k sequences in the time series that are more similar to the current one, and computation of the forecast as the weighted average of the sequences determined in the previous step.
The adaptation of the k-NN method to forecast HTS can be described in the following steps: 1) The HTS, { } , , , , where d ∈  is the number of lags and , , .
The dissimilarity between the most recent histogram valued vector  ( ) , is the Mallows or the Wasserstein distance of order q.
3) Once the dissimilarity measures are computed for each , 1, 2, , , , ,  are averaged by means of the barycenter approach to obtain the final forecast 1 ˆT X h + in the following minimization problem: ( ) where p w is the weight assigned to the neighbor p, with 0 In the analysis, we used equal weights when performing the minimization.The optimal parameter k and d , which minimize the mean distance error defined in the previous section in the estimation period, are obtained by conducting a two-dimensional grid search.Then the estimated parameter are plugged in the whole procedure again to get the prediction.

Simulation Results
Simulation was carried out to compare the performance of the proposed FDA method to the method of [1] for prediction with a histogram time series using uniform and Hilbert norm distances.
The data was simulated following Autoregressive Hilbertian (ARH) process as described in Equation ( 6).Suppose ( ) • Used Equation ( 6) with ( ) i t  's being i.i.d normally distributed with standard deviation equals to 0.05 to simulate 48 ARH (1) process each with length T equals to 250.
• Used FDA method and Arroyo's method to fit models on the first 200 density functions to predict the next 50 density functions.• Evaluated the performance of FDA method and compare the performance of FDA method and Arroyo's method.
The performance evaluation of FDA method and comparison with Arroyo's method are shown in Table 1.We observe that: • Most of time (40 48, 83%), FDA method chose the correct underlying process (ARH ( 1)).
• The choice of number of principal components varied.
• The FDA method outperformed Arroyo's to a great extent in all metrics and both in uniform measure and Hilbert's measure.Specifically, using uniform distance, FDA method is 90% less in average mean distance and 24% less in average standard deviation of distance than Arroyo's method; using Hilbert's distance, FDA method is 92% less in average mean distance and 89% less in average standard deviation of distance than Arroyo's method.

S&P 500 Data Analysis
The Standard & Poor's 500 (S&P 500) is a free-float capitalization-weighted (movements in the prices of stocks with higher market capitalizations have a greater effect on the index than companies with smaller market caps) index of the prices of 500 large-cap common stocks actively traded in the United States.The dataset we have is daily returns of all the constituents of the S&P 500 for 245 days from August 21, 2009 to August 20, 2010.This is the same data used by [1] and can be downloaded at http://pages.swcp.com/stocks/.Figure 1 shows the histogram of the first 4 days returns of all constituents.The first 3 days' histograms look like a bell shape, indicating possibly normal distribution while the fourth day's histogram is very centralized.

Kernel Density Estimation
After using the ksdensity function of Matlab on the S&P 500 data, we found out that over 40% of the fitted density function contains many extremely small (less than 0.0001) probability points, no matter how big bandwidth is, mainly due to some extreme returns each day.Example of fitted density functions that contain many extremely small probability points can be seen on Figure 2. Retaining these points is a problem since we need to take logarithm of the extremely small numbers, which will make the fitting procedure later nonapplicable.Therefore, we drop the top 5% and bottom 5% fitted density points and keep the other 90% of it.After using this procedure, we get rid of the extremely small probability points problem completely.However, one thing we need to keep in mind is that the method in this section cannot be used for problems in extremes like value-at-risk and expected shortfall.

Principal Component Analysis of Functional Data
We use the PACE program in MATLAB ( [16]) for this step.The program has AIC, BIC, and FVE (fraction of variance explained) method to determine to number of principal component functions.We observe that the AIC or BIC based method is not conservative enough and includes some extra functions which are very generic.Therefore, we decided to use FVE method and use scree plot to select the optimal number of principal component functions.See Figure 3 for the scree plot of the fitting of the S&P 500 data.Based on the scree plot, we chose 2 components for the fitting.

VARMA Modeling
We fitted multiple VARMA models of different order using Maximum Likelihood, Yule-Walker estimation methods as well as state-space models.These are not presented here, but are available from the authors on request.We observe that: • VARMA (1,1) is the best model when considering either AIC or BIC.
• The AIC/BIC performance are usually better when using the Maximum Likelihood Estimation approach than the Yule-Walker approach, except when the model considered is VAR (1) model.In that case, the AIC/BIC performance are the same for both approaches.• The most promising procedure of state space model fitting in this data set is the brute force technique.
The daily S&P 500 Data has been reduced to a 2 dimensional time series in the previous procedure.Therefore, a VARMA (1,1) model only needs to estimate 4 2 8 × = parameters, which has good power and accuracy.Based on this and the corresponding AIC/BIC performance, we decided to use VARMA (1,1) model for this data.

Comparison between FDA Method and Arroyo's Method on S&P 500 Data
To compare the prediction result of FDA method and Arroyo's method, we divided the S&P 500 sample into 185 days (around 75% of all data) as training period and 60 days (around 25% of all data) as prediction period.In the k-NN procedure, we also kept away the first 50 days' data from the training period, since the estimation needs to begin with more data when k and d are large.
We used the training data to fit the FPCA model and obtained the corresponding 2 estimated principal component functions, the mean function, and the estimated principal component scores.Then we used VARMA (1,1) model of the principal component scores for next-day prediction.After getting the next-day prediction of principal component scores for 60 days, we combined those with the principal component functions and mean function obtained in previous training steps to get the predicted densities for each of the 60 days.Finally, we used Uniform Norm and Hilbert Norm to measure the distance between the predicted densities and the original densities.The distance between the predicted densities and the original densities (in both histogram and kernel form) using Arroyo's method are also computed for comparison.
The time series plot of the Hilbert Norm distance of the 60 Days' prediction period is shown in Figure 4. From the plot, we can clearly see that the one using FDA method outperforms the one using Arroyo's method, using distance between the original density and the predicted density.It not only has a small value of distance on almost every day, but also has more stable performance.The descriptive statistic of the Uniform Norm Distance and Hilbert Norm distance of the 60 days' prediction period is given in Table 2 and Table 3.For Uniform Norm Distance measure, the distance using FDA method has smallest mean, second smallest median, and smallest standard deviation which indicates better and more stable performance.The fact that the minimum distance of the sixty days using FDA method is largest does not show any disadvantage of FDA method since we are looking at the overall performance across the 60 days.When it comes to Hilbert Norm Distance measure, the distance using FDA method has much smaller mean and median when compared with distance using Arroyo's method.The standard deviation using FDA method is the largest, which is due to the fact that FDA method has result in a couple of much smaller distance values which actually indicates good performance.In all, from the time series plot and descriptive statistic, the overall performance of FDA method is better than Arroyo's method, in both Uniform Norm Distance measure and Hilbert Norm Distance measure.

BSE Data Analysis
The Bombay Stock Exchange (BSE) is a stock exchange located in Mumbai, India and is the oldest stock exchange in Asia.The equity market capitalization of the companies listed on the BSE was US$1.7 trillion as of January 2015, making it the 4th largest stock exchange in Asia and the 11th largest in the world.The BSE has the largest number of listed companies in the world with over 5500 listed companies.The dataset we had was weekly returns of 507 stocks of the BSE from from January 1997 to December 2004, totally 365 weeks.Figure 5 shows the histogram of the first 4 weeks' returns of all the stocks.The first and third week's histograms look like a bell shape, indicating possibly normal distribution while the second and forth week's histogram is very centralized.There is also more skewness than in the S&P data.

Kernel Density Estimation
We used the same procedure as discussed in Section 1 on the BSE data.The weekly BSE data also suffers from the small probability points problem after applying the ksdensity function in Matlab (over 43% of the fitted density function contains many extremely small, less than 0.0001, probability points) and similar procedure was used to bypass this problem.
Examples of fitted density functions that contain many extremely small probability points can be seen in Figure 6.

Principal Component Analysis of Functional Data
FVE method of PACE package and scree plot is used again to select the optimal number of principal component functions.See Figure 7 for the scree plot of the fitting of the BSE data.Based on the scree plot and FVE procedure, we chose 4 components for the fitting.

VARMA Modeling
We did the similar VARMA modeling analysis on the BSE Data, namely fitted multiple models using different estimation methods and compared their AIC/BIC score.We observe that: • VAR ( 6) is the best model when considering AIC only.
• VAR ( 1) is the best model when considering BIC only.
• The AIC/BIC performance are usually better when using the Maximum Likelihood Estimation approach than the Yule-Walker approach, except when the model considered is VAR (1) model.In that case, the AIC/BIC performance are the same for both approaches.• The most promising procedure of state space model fitting in this data set is also the brute force technique.
• Model chosen by AIC or BIC criteria has MA degree zero.
The daily BSE Data has been reduced to a 4 dimensional time series in the previous procedure.Therefore, a VAR (6) model needs to estimate 16 6 96 × = parameters while a VAR (1) model only needs to estimate 16 parameters, which has far more power and accuracy.Based on this and the corresponding AIC/BIC performance, we decided to use VAR (1) model for this data.

Comparison between FDA Method and Arroyo's Method on BSE Data
For BSE data, from time series plots (Figure 8) using uniform distance, we can clearly see that the one using FDA method outperforms the one using Arroyo's method, using distance between the original density and the predicted density.
The descriptive statistic of the Uniform Norm Distance and Hilbert Norm distance of the 50 days' prediction period is given in Table 4 to Table 5.Under Uniform Norm Distance measure, the distance using FDA method has better performance in all metrics.When it comes to Hilbert Norm Distance measure, FDA method has best performance in terms of mean and median although it suffers from the maximum being considerably large (4.2083) and has much larger standard deviation.In all, from the time series plot and descriptive statistic, the overall performance of FDA method is better than Arroyo's method, in both Uniform Norm Distance measure and Hilbert Norm Distance measure.

Conclusion
The paper proposes tools from Functional Data Analysis to forecast the probability density function.The technique is found to perform better than the method of [1] to forecast histograms in simulation and real data examples.For both real datasets, the density estimates have long tails.For the components of S&P 500, 2  principal components are enough to explain most of the variation in the shapes of the kernel densities.For the stocks traded on the Bombay Stock Exchange, 4 principal components are required.Also, the time dependence in the first dataset is ARMA (1,1), whereas for the second it is AR (1).This reflects the variation across markets (mature vs emerging), nature of stocks (large cap vs all) and frequency of observation (daily vs weekly).The method is flexible enough to accommodate these variations.In all the real data examples, forecasts using the FDA method are more efficient than the existing method.

.
For example, the weights may be assumed to be equal for all the neighbors or inversely proportional to the distance between the last sequence

Figure 2 .
Figure 2. Fitted density function that contains many extremely small probability points of S&P 500-1.

Figure 3 .
Figure 3. Scree plot of the fitting of the S&P 500 data.Left: Cumulative variance; Right: Variance.

Figure 4 .
Figure 4. Time series plot of hilbert norm distance of the 60 days' prediction period on S&P 500 data.Blue: FDA method; Red: Arroyo's with exponential smoothing; Green: Arroyo's with k-NN.

Figure 5 .
Figure 5. Histogram of weekly returns of selected 507 stocks of BSE of January 1997.

Figure 6 .
Figure 6.Fitted density function that contains many extremely small probability points of BSE.

Figure 7 .
Figure 7. Scree plot of the fitting of the BSE data.Left: Cumulative variance; Right: Variance.

Figure 8 .
Figure 8.Time series plot of uniform norm distance of the 50 days' prediction period on BSE data.Blue: FDA method; Red: Arroyo's with exponential smoothing; Green: Arroyo's with k-NN.
is the order and is a distance measure such as Wasserstein or the Mallows distance.It is assumed that the data points are uniformly distributed within each bin of the histogram.Under this assumption, the CDF

Table 1 .
Summary of performance comparison between the FDA method and arroyo's method on simulation results using both uniform distance and Hilbert's distance.U-Uniform distance; H-Hilbert's distance.

Table 2 .
Descriptive statistic of uniform norm distance of the 60 days' prediction period on S&P 500 data.(1)and(2)denote exponential smoothing and k-NN respectively.

Table 3 .
Descriptive statistic of uniform norm distance of the 60 days' prediction period on S&P 500 data.(1)and(2)denote exponential smoothing and k-NN respectively.

Table 4 .
Descriptive statistic of uniform norm distance of the 50 days' prediction period on BSE data.(1) and (2) denote exponential smoothing and k-NN respectively.

Table 5 .
Descriptive statistic of hilbert norm distance of the 50 days' prediction period on BSE data.(1) and (2) denote exponential smoothing and k-NN respectively.