^{1}

^{*}

^{2}

^{*}

With increasing availability of data, in many situations it is now possible to reasonably estimate the probability density function (pdf) of a random variable. This is far more informative than using a few summary statistics like mean or variance. In this paper, we propose a method of forecasting the density function based on a time series of estimated density functions. The proposed method uses kernel estimation to pre-process the raw data followed by dimension reduction using functional principal components analysis (FPCA). Then we fit Vector ARMA models to the reduced data to make a prediction of the principal component scores, which can then be used to obtain the forecast for density function. We need to transform and scale the forecasts to ensure non-negativeness and integration to one. We compared our method to [1] for histogram forecasts, on simulated data as well as real data from S&P 500 and the Bombay Stock Exchange. The results showed that our method performed better on both the datasets and the simulation using uniform and Hilbert distance. The time dependence and complexity of density function are different for the two markets, which is captured by our analysis.

Contemporaneous aggregation is often the only way to analyze temporal data, for example, considering the ob- servations of a variable measured through time in a population, e.g. the monthly output of firms in a country. If the individuals considered are not the same through time, then it is not possible to deal with the longitudinal data. However if one is interested in the overall evolution of all firms, histograms or densities can still be studied. [

Density function estimation has been widely used in many areas such as finance [

FDA is a popular statistical technique that treats entire curves as units of data (see [

The difference between a general function and a density function lies in the fact that density functions are nonnegative everywhere and their integral over the whole space is always equal to one. These restrictions pose challenges to using functional data methodology to densities directly. [

The paper is organized as follows. In Section 2 an overview of the methodology used in this paper is given, followed by a detail introduction of the three main statistical methods used: kernel estimation, Functional Principal Component Analysis(FPCA) and Vector Autoregression and Moving Average (VARMA). A simula- tion analysis is conducted on Section 3 to validate the performance of the proposed method and compare it to the performance of the method of [

Each dataset consists a set of

Suppose

As a motivation of the following method, observe that:

where the approximation holds for “small” h. Replacing F by the empirical distribution and h by a sequence

This density estimator can be understood as follows. First observe that the discrete empirical distribution

gives mass

length

mass uniformly, an arbitrary pdf

mators of the form:

where K is a pdf, i.e.

with the kernel K and bandwidth

have

The choice of the kernel usually is not that crucial. The estimator in (1) is a special case with

uniform kernel. Some common choices for K are Uniform, Normal, Logistic and Epanechnikov.

The bandwidth

where

MSE in (3) is minimized for

Consider a sample of T smooth random trajectories

Following [

inner product

modeled as realization of a stochastic process

function

and non-increasing eigenvalues

where

where the

The deviation of each sample trajectory from the mean is thus a sum of orthogonal curves with uncorrelated random amplitudes.

Often it is realistic to incorporate uncorrelated measurement errors with mean zero and constant variance

into the model, reflecting additional variation in the measurements, compare [

servations of the random function

where

normally distributed, but generally we do not make such assumption.

Under Equation (5) and with indicator function

This implies that the smooth mean function

Processes f are then approximated by substituting estimates and using a chosen finite number of principal components. The specific number of principal components to be retained in the model is chosen by some optimization criterion like cross-validation, AIC, BIC or a scree plot.

A sequence

where

A higher order of autoregression process-ARH (p) (see [

A natural extension would be to consider the series of functions follows the ARMAH (p, q) model with mean

where

Using linearity of

Combining all the terms involving

where

ation (8) by

which implies a VARMA (p, q) structure on the vector of principal component scores

As mentioned before, the difference between density function estimation and general function estimation lies in that density function are required to be non-negative everywhere and integrate to one.

However, the fitted function after FPCA estimation is not guaranteed to be positive everywhere. To address this, we took logarithm transformation of the fitted kernel density function before the FPCA estimation and used exponential transformation after the FPCA to guarantee the non-negativeness. In order to ensure that the fitted function integrates to one, we referred to [

To compare the performance of FDA method and Arroyo’s method, we used two different distance measures between predicted functions and actual functions. These are the uniform distance

and

where f is the actual function and

In [

where

where

distance. It is assumed that the data points are uniformly distributed within each bin of the histogram. Under this

assumption, the CDF

where

By using this definition of the CDF of a histogram, the Wasserstein and Mallows distances formula can be written as functions of the centers and radii of the histogram bins:

The idea of exponential smoothing is to predict the next observation by a weighted average of previous ob- servation and its estimate. Let

The authors show that the forecast is also the solution to the following optimization problem:

where

In the analysis below, we let

the estimated

The k-Nearest Neighbor (k-NN) method is a classic pattern recognition procedure that can be used for time series forecasting. The k-NN forecasting method in classic time series consists of two steps: identification of the k sequences in the time series that are more similar to the current one, and computation of the forecast as the weighted average of the sequences determined in the previous step.

The adaptation of the k-NN method to forecast HTS can be described in the following steps:

1) The HTS,

where

2) The dissimilarity between the most recent histogram valued vector

computed by implementing the following distance measure

where

3) Once the dissimilarity measures are computed for each

4) Given the k-closest vectors, their subsequent values,

barycenter approach to obtain the final forecast

where

may be assumed to be equal for all the neighbors or inversely proportional to the distance between the last

sequence

In the analysis, we used equal weights when performing the minimization. The optimal parameter

Simulation was carried out to compare the performance of the proposed FDA method to the method of [

The data was simulated following Autoregressive Hilbertian (ARH) process as described in Equation (6).

Suppose

Specifically, in our simulation, we used

Then our simulation consisted of the following steps:

・ Considered 16 different initial density functions

・ Used Equation (6) with

・ Used FDA method and Arroyo’s method to fit models on the first 200 density functions to predict the next 50 density functions.

・ Evaluated the performance of FDA method and compare the performance of FDA method and Arroyo’s method.

The performance evaluation of FDA method and comparison with Arroyo’s method are shown in

・ Most of the time (40 out of 48, 83%), FDA method chose the correct underlying process (ARH (1)).

・ The choice of number of principal components varied.

・ The FDA method outperformed Arroyo’s to a great extent in all metrics and both in uniform measure and Hilbert’s measure. Specifically, using uniform distance, FDA method is 90% less in average mean distance and 24% less in average standard deviation of distance than Arroyo’s method; using Hilbert’s distance, FDA method is 92% less in average mean distance and 89% less in average standard deviation of distance than Arroyo’s method.

The Standard & Poor’s 500 (S&P 500) is a free-float capitalization-weighted (movements in the prices of stocks with higher market capitalizations have a greater effect on the index than companies with smaller market caps) index of the prices of 500 large-cap common stocks actively traded in the United States. It has been widely regarded as the best single gauge of the large cap U.S. equities market since the index was first published in 1957. The stocks included in the S&P 500 are those of large publicly held companies that trade on either of the two largest American stock market exchanges: the New York Stock Exchange and the NASDAQ. These 500 large-cap American companies included in S&P 500 capture about 75% coverage of the American equity market by capitalization. It covers various leading industries in United States, including energy (e.g. including com- panies like Exxon Mobil Corp.), materials (e.g. Dow Chemical), industrials (e.g. General Electric Co.), consumer discretionary (e.g. McDonald’s Corp.), consumer staples (e.g. Procter & Gamble), health care (e.g. Johnson & Johnson), financials (e.g. JPMorgan Chase & Co.), information technology (e.g. Apple Inc.), telecommunication services (e.g. AT&T Inc.), and utilities (e.g. PG&E Corp.). Though the list of the 500 companies is fairly stable, Standard & Poors does update the components of the S&P 500 periodically, typically in response to acquisitions, or to keep the index up to date as various companies grow or shrink in value. For example, TRIP (TripAdvisor Inc.) was added to replace TLAB (Tellabs Inc.) on Dec 20, 2011 due to the fact that Expedia Inc. spun off TripAdvisor Inc and WPX (WPX Energy Inc.) was added to replace CPWR (Compuware) on Dec 31, 2011 due to market cap changes.

The dataset we have is daily returns of all the constituents of the S&P 500 for 245 days from August 21, 2009 to August 20, 2010. This is the same data used by [

After using the ksdensity function of Matlab on the S&P 500 data, we found out that over 40% of the fitted density function contains many extremely small (less than 0.0001) probability points, no matter how big bandwidth is, mainly due to some extreme returns each day. Example of fitted density functions that contain

Method | Avg. Mean (U) | Avg. SD. (U) | Avg. Mean (H) | Avg. SD. (H) |
---|---|---|---|---|

FDA | 0.0874 | 0.0076 | 0.3662 | 0.0222 |

Arroyo | 0.8405 | 0.0103 | 4.552 | 0.2109 |

many extremely small probability points can be seen on

Therefore, we drop the top 5% and bottom 5% fitted density points and keep the other 90% of it. After using this procedure, we get rid of the extremely small probability points problem completely. However, one thing we need to keep in mind is that the method in this section cannot be used for problems in extremes like value-at-risk and expected shortfall.

We use the PACE program in MATLAB ( [

We fitted multiple VARMA models of different order using Maximum Likelihood, Yule-Walker estimation methods as well as state-space models. These are not presented here, but are available from the authors on request. We observe that:

・ VARMA (1,1) is the best model when considering either AIC or BIC.

・ The AIC/BIC performance are usually better when using the Maximum Likelihood Estimation approach than the Yule-Walker approach, except when the model considered is VAR (1) model. In that case, the AIC/BIC performance are the same for both approaches.

・ The most promising procedure of state space model fitting in this data set is the brute force technique.

The daily S&P 500 Data has been reduced to a 2 dimensional time series in the previous procedure. Therefore, a VARMA (1,1) model only needs to estimate

To compare the prediction result of FDA method and Arroyo’s method, we divided the S&P 500 sample into 185 days (around 75% of all data) as training period and 60 days (around 25% of all data) as prediction period. In the k-NN procedure, we also kept away the first 50 days’ data from the training period, since the estimation needs to begin with more data when k and d are large.

We used the training data to fit the FPCA model and obtained the corresponding 2 estimated principal component functions, the mean function, and the estimated principal component scores. Then we used VARMA (1,1) model of the principal component scores for next-day prediction. After getting the next-day prediction of principal component scores for 60 days, we combined those with the principal component functions and mean function obtained in previous training steps to get the predicted densities for each of the 60 days. Finally, we used Uniform Norm and Hilbert Norm to measure the distance between the predicted densities and the original densities. The distance between the predicted densities and the original densities (in both histogram and kernel form) using Arroyo’s method are also computed for comparison.

The time series plot of the Hilbert Norm distance of the 60 Days’ prediction period is shown in

The descriptive statistic of the Uniform Norm Distance and Hilbert Norm distance of the 60 days’ prediction period is given in

In all, from the time series plot and descriptive statistic, the overall performance of FDA method is better than Arroyo’s method, in both Uniform Norm Distance measure and Hilbert Norm Distance measure.

The Bombay Stock Exchange (BSE) is a stock exchange located in Mumbai, India and is the oldest stock exchange in Asia. The equity market capitalization of the companies listed on the BSE was US$1.7 trillion as of January 2015, making it the 4th largest stock exchange in Asia and the 11th largest in the world. The BSE has the largest number of listed companies in the world with over 5500 listed companies. The dataset we had was weekly returns of 507 stocks of the BSE from from January 1997 to December 2004, totally 365 weeks.

We used the same procedure as discussed in Section 1 on the BSE data. The weekly BSE data also suffers from the small probability points problem after applying the ksdensity function in Matlab (over 43% of the fitted density function contains many extremely small, less than 0.0001, probability points) and similar procedure was used to bypass this problem.

Examples of fitted density functions that contain many extremely small probability points can be seen in

FVE method of PACE package and scree plot is used again to select the optimal number of principal component functions. See

Method | Mean | Median | Std. Dev. | Maximum | Minimum |
---|---|---|---|---|---|

FDA | 0.3200 | 0.2756 | 0.1614 | 0.8288 | 0.1233 |

Arroyo (1) | 0.3551 | 0.3725 | 0.2076 | 0.7272 | 0.0284 |

Arroyo (2) | 0.3728 | 0.2676 | 0.2312 | 0.8366 | 0.1060 |

Method | Mean | Median | Std. Dev. | Maximum | Minimum |
---|---|---|---|---|---|

FDA | 0.6578 | 0.6554 | 0.1215 | 0.9327 | 0.4028 |

Arroyo (1) | 0.8236 | 0.8564 | 0.1044 | 1.0031 | 0.5521 |

Arroyo (2) | 0.8167 | 0.8587 | 0.1069 | 0.9301 | 0.5294 |

We did the similar VARMA modeling analysis on the BSE Data, namely fitted multiple models using different estimation methods and compared their AIC/BIC score. We observe that:

・ VAR (6) is the best model when considering AIC only.

・ VAR (1) is the best model when considering BIC only.

・ The AIC/BIC performance are usually better when using the Maximum Likelihood Estimation approach than the Yule-Walker approach, except when the model considered is VAR (1) model. In that case, the AIC/BIC performance are the same for both approaches.

・ The most promising procedure of state space model fitting in this data set is also the brute force technique.

・ Model chosen by AIC or BIC criteria has MA degree zero.

The daily BSE Data has been reduced to a 4 dimensional time series in the previous procedure. Therefore, a VAR (6) model needs to estimate

For BSE data, from time series plots (

The descriptive statistic of the Uniform Norm Distance and Hilbert Norm distance of the 50 days’ prediction period is given in

In all, from the time series plot and descriptive statistic, the overall performance of FDA method is better than Arroyo’s method, in both Uniform Norm Distance measure and Hilbert Norm Distance measure.

The paper proposes tools from Functional Data Analysis to forecast the probability density function. The technique is found to perform better than the method of [

Method | Mean | Median | Std. Dev. | Maximum | Minimum |
---|---|---|---|---|---|

FDA | 0.2136 | 0.2263 | 0.0803 | 0.3301 | 0.0200 |

Arroyo (1) | 0.2984 | 0.2496 | 0.1967 | 0.7199 | 0.0313 |

Arroyo (2) | 0.4033 | 0.3458 | 0.2148 | 0.8623 | 0.1238 |

Method | Mean | Median | Std. Dev. | Maximum | Minimum |
---|---|---|---|---|---|

FDA | 0.6485 | 0.3715 | 0.7318 | 4.2083 | 0.0927 |

Arroyo (1) | 0.8089 | 0.8266 | 0.1726 | 1.2892 | 0.4414 |

Arroyo (2) | 0.8469 | 0.8750 | 0.1141 | 1.0395 | 0.4808 |

principal components are enough to explain most of the variation in the shapes of the kernel densities. For the stocks traded on the Bombay Stock Exchange, 4 principal components are required. Also, the time dependence in the first dataset is ARMA (1,1), whereas for the second it is AR (1). This reflects the variation across markets (mature vs emerging), nature of stocks (large cap vs all) and frequency of observation (daily vs weekly). The method is flexible enough to accommodate these variations. In all the real data examples, forecasts using the FDA method are more efficient than the existing method.

Rituparna Sen,Changie Ma, (2015) Forecasting Density Function: Application in Finance. Journal of Mathematical Finance,05,433-447. doi: 10.4236/jmf.2015.55037