Aggregating density estimators: an empirical study

We present some new density estimation algorithms obtained by bootstrap aggregation like Bagging. Our algorithms are analyzed and empirically compared to other methods found in the statistical literature, like stacking and boosting for density estimation. We show by extensive simulations that ensemble learning are effective for density estimation like for classification. Although our algorithms do not always outperform other methods, some of them are as simple as bagging, more intuitive and has computational lower cost.


Introduction
Ensemble learning is one of the most challenging recent approaches in statistical learning. Bagging ( [1]), Boosting ( [5]), Stacking ( [2]), and Random forests ( [3]) have been declared to be the best of the chelf classifiers achieving very high performances when tested over tens of various datasets selected from the machine learning benchmark. All these algorithms had been designed for supervised learning, sometimes initially restricted to regression or binary classification. Several extensions are still under study: multivariate regression, multiclass learning, and adaptation to functional data or time series. Very few developments exist for ensemble learning for unsupervised framework, clustering analysis and density estimation. Our work concerns the latter case which may be seen as a fundamental problem in statistics. Among the last developments, we found some extensions of boosting ( [5]) and stacking ( [11]) to density estimation.
In this paper we suggest some simple algorithms for density estimation in the same spirit of bagging and stacking where the weak learners are histograms. We show by extensive simulations that aggregation gives rise to effective better estimates. We compare our algorithms to several algorithms for density estimation, some of them are simple like Histogram and kernel density estimators (KDE) and others rather complex like stacking and boosting which will be described in details. As we will show in the experiments we do, although the accuracy of our algorithm is not systematically higher than other ensemble methods, it is with no doubt simpler, more intuitive and computationally less expensive. Boosting algorithms and stacking for density estimation are described in section 2. Section 3 describes our algorithms. Simulations and results are given in section 4 and concluding remarks and future work in section 5.

A review of the existing algorithms
In this section we review some density estimators obtained by aggregation. They may be classified in two categories depending on the aggregation form. The first type has the form of linear or convex combination: where g m is typically a parametric or non parametric density model, and in general different values of m refer typically to • different parameters values in the parametric case or, • different kernels, or • different bandwidths for a chosen kernel for the kernel density estimators.
The second type of aggregation is multiplicative and is based on the ideas of high order bias reduction for kernel density estimation ( [6]). The aggregated density estimator has the form:

Linear or convex combination of density estimators
This kind of estimators (1) has been used in several works with different construction schemes.
• In [10], [8] and [12] the weak learners g m are introduced sequentially in the combination. At step m, g m is chosen to maximize the log likelihood of where g m is a density selected among a fixed class H.
In [10] g m is selected among a non parametric family of estimators, and in [8] and [12], it is taken to be a Gaussian density or a mixture of Gaussian densities whose parameters are estimated. Different methods are used to estimate both density g m and the mixture coefficient α. In [8] g m is a Gaussian density and the log likelihood of (3) is maximized using a special version of Expectation Maximization (EM) taking into account that a part of the mixture is known.
The main idea underlying the algorithms given by [10] and [12] is to use Taylor expansion around the negative log likelihood that we wish to minimize: For α small we have the approximation thus, minimizing the left side term is equivalent to maximizing i gm(x i ) f m−1 (x i ) . All the algorithms described above are sequential and the number of weak learners aggregated may be fixed by the user.
This matrix is used to compute the coefficients α 1 , . . . , α M using the Expectation-Maximization algorithm. Finally, for the output model, we re-estimate the individual densities g 1 , . . . , g M from the whole data L.
• In [9] the densities {g m } m=1,...,M are fixed in advance like for stacking (KDE estimators with different bandwidths). The dataset is split in two parts. The first sample is used to estimate the densities g m , whereas the coefficients α m are optimized using the second sample. The splitting process is repeated and the aggregated estimators for each data split are averaged. The final model has the form where S is the set of all the splits used and is the aggregated estimator obtained from one split s of the data, g s m is the individual kernel density function estimated over the learning sample obtained from the split s. This algorithm is called AggPure.

Multiplicative aggregation
The only algorithm giving rise to this form of aggregation is the one described in [4] called BoostKde. It is a sequential algorith where at each step m the weak learner is computed as follows:ĝ where K is a fixed kernel, h its bandwidth, and w m (i) the weight of observation i at step m. Like for boosting, the weight of each observation is updated: , where C is a normalization constant. The Algorithm is resumed in figure 1.

Aggregating Histograms
We suggest three new density estimators obtained by linear combination like in (1), all of them use histograms as weak learners. The first two algorithms may be parallelized and randomize the histograms. The third one is just a modification of Stacking. The first algorithm is similar to Bagging ( [1]). At each step m, a bootstrap sample of the original dataset is generated and g m is the histogram obtained from the generated bootstrap sample with a fixed number of equally spaced breakpoints. We will refer to this algorithm as BagHist, it is detailed in figure 2.
The second algorithm (AggregHist) works as follows: let g 0 be the histogram obtained with the data set at hand using equally spaced breakpoints denoted B = b 1 , b 2 , ..., b L . Each weak learner g m is an histogram constructed over the same initial data set but using a randomly modified set of breakpoints; γ is a tuning parameter which controls the variance of the perturbations. The algorithm is detailed in figure 3.
Finally, we introduce a third algorithm called StackHist where we replace in the stacking algorithm described in the previous section, the six kernel density estimators by three histograms with fixed number of breaks.
The values of the parameters used in these algorithms will be optimized. The procedure used will be described in the experiments section. 2. For m = 1 to M :

Experiments
We test several simulation models based on classical distributions and mixture models mostly used in the cited works and algorithms described above. The sample size is fixed at n = 100, 500, 1000. We first show the estimates obtained using BagHist and AggregHist and analyze the effect of the number M of histograms aggregated, and then compare them to the other algorithms.
Up to our knowledge the existing algorithms for density estimation by aggregation have never been compared over a common benchmark simulation data.

Models used for the simulations
We denote by M1, . . . , M11 the different simulation models which will be grouped according to their difficulty level.  • Mixtures density with highly inhomogeneous smoothness as in [9]: All the simulations are done with the R software, and for models M8 and M9 we use the benchden package.
We show below in figures 4, 5 and 6, the true densities for the eleven models as well as their estimates obtained using the three algorithms AggregHist, BagHist and StackHist for n = 500 observations and M = 300 histograms for the two first algorithms.
AggregHist and BagHist give more smooth estimators than StackHist.   • For AggPure six kernel density estimators are aggregated having bandwidths 0.001, 0.005, 0.01, 0.05, 0.1 and 0.5. We use the EM algorithm to optimize the coefficients of the linear combination. The final estimator is a mean over S = 10 random splits of the original data set.

Tuning the algorithms
• For Boostkde, we use 5 steps for the algorithm aggregating kernel density estimators whose bandwidths are optimized using Silverman rule. Normalization of the output is done using numerical integration. Extensive simulations showed that more steps give rise to less accurate estimators.
Simple one kernel density estimators are also used in our comparisons using optimized bandwidths following Silverman rule (KdeNrd0) and unbiased cross validation (KdeUCV). For the Histogram, fixed breaks are systematically used and their number is optimized over a fixed grid, retaining the one which maximizes the log likelihood of the obtained histogram over a test sample drawn from the same distribution as the learning sample. The tuning parameters for our algorithms, the number of breakpoints and the value of γ are optimized testing different values for each of them over a fixed grid. We test 10, 20 and 50 equally spaced breakpoints for each case. For γ we chose the grid 0.5, 1, 1.5, 2, 2.5. The best combination retained for each model is the one which maximizes the log likelihood over 100 independent test samples drawn from the corresponding model. For BagHist and AggregHist we aggregate M = 300 histograms. The optimal values for the histogram and for our algorithms are give in table 1. We denote the optimal number of breaks L H , L BH , L AH for the Histogram, BagHist and AggregHist respectively, and γ AH the perturbation coefficient for AggregHist.  Finally, for StackHist we aggregate six histograms having 5, 10, 20, 30, 40 and 50 equally spaced breakpoints. A ten fold cross validation is used.

Results
The performance of each model is evaluated using the Mean Integrated Squared Error (MISE). It is estimated as the average of the integrated squared error over 100 simula-tions. First, for both AggregHist and BagHist we analyze the effect of the number M of histograms aggregated. Figures 7, 8 and 9 show how the MISE varies when increasing the number of histograms. These graphics show clearly the contribution of the aggregation to the reduction of the MISE. For all the models, the error does not decrease significantly after about 100 iterations.     We compare now our algorithms BagHist, AggregHist and StachHist to the following methods: the Histogram, KdeNrd0, KdeUCV, Stacking, AggPure and BoostKde. We have limited the comparisons to the ensemble methods which aggregated non parametric density estimators. Tables 1 to 3 give the values of 100 × MISE for each method and simulation model for the three values of n. The best performances are indicated in bold. In most cases, aggregation models have higher accuracy than simple methods like the histogram and KDE. However this is not the case for model M3: KDE has a better performance, that is probably due to border effects. BoostKde gives in general better estimates for mixture models. All the methods have better accuracy in general when increasing the sample size n. BagHist and AggregHist always outperform the optimal Histogram and in general both have better accuracy than StackHist. For n = 100 BagHist and AggregHist outperform the other algorithms over the complicated models (M8 − M10). For n = 500 and n = 1000, AggPure outperforms the other methods for the last two models. Although our algorithms do not always outperform the other methods, their precision are not far from the best ones.

Conclusion
In this work we present three new algorithms for density estimation aggregating histograms. Two of them aggregate histograms over bootstrap samples of the data or randomly perturbed breakpoints. The third is a simple adaptation of the stacking algorithm where histograms are used instead of kernel density estimators. We have shown using extensive simulations that these algorithms and the other ensembles techniques have better accuracy than the histogram or KDE. The first two algorithms BagHist and AggregHist are very simple to implement, depend on very few parameters, and their computation complexity is proportional to that of a histogram. Theoretical properties of these algorithms are under study. Most of the algorithms described in this work may be easily generalized to the multivariate case.