Multivariate Modality Inference Using Gaussian Kernel

The number of modes (also known as modality) of a kernel density estimator (KDE) draws lots of interests and is important in practice. In this paper, we develop an inference framework on the modality of a KDE under multivariate setting using Gaussian kernel. We applied the modal clustering method proposed by [1] for mode hunting. A test statistic and its asymptotic distribution are derived to assess the significance of each mode. The inference procedure is applied on both simulated and real data sets.


Introduction
Mode is defined as the local maximum of a probability density.Modality, which is the number of the modes, is an important feature of any probability distribution.The natural evolution of multimodality occurs when a distribution is composed by several sub-populations.In practice, it is important to learn how many sub-populations the data have.In general, there are three different, but related, research areas that address this question: 1) the inference on the number of components in the finite mixture model; 2) estimating the number of clusters and/or merging the clusters of a clustering output; and 3) the modality inference.Each of these three approaches addresses the question of how many components the data have from its own angle.For the inference on the number of components in the mixture distribution, the hypothesis is usually 0 : where K is the parameter of the number of components.The likelihood ratio test (LRT) is often used to assess the significance of the hypothesis.However, in general, the distribution of the LRT is very complicated.More details can be found in e.g.[2].Depending on the clustering method, the inference procedures on the number of clusters could be different.[3] proposed the GAP statistic, which contains information regarding the distance between the data points within each cluster.In particular, the authors applied the method on the K-means clustering [4].[5] proposed the EM-clustering method, a model-based clustering method.The authors selected the ideal number of clusters by using the Bayesian information criterion (BIC), which is a model selection tool.However, as it is well established, BIC does not follow the regularity conditions and is inappropriate to use in the problem of determining the number of components.
The modality inference, which is used to assess the number of modes of the data, is often a robust nonparametric approach.There is a lot of existing literatures that address the problem of the modality of univariate probability distribution.These methods can be classified as the test of unimodality, bimodality or multimodality.Alternatively, these methods can be grouped as a global or local test.The global test considers the modality of the entire distribution.In contrast, the local test focuses on the specific region of the density that contains the particular investigated mode instead of considering the entire distribution.
In the case of the global test, [6] proposed the most commonly used critical bandwidth parameter, the smallest value of the bandwidth parameter h for which the KDE with Gaussian kernel ( ) is k-modal, where ( ) .φ is the probability distribution function (pdf) of the standard normal distribution.To assess the significance, [6] suggested using crit h as the bandwidth parameter, denoted as 0 h , and using the non- parametric bootstrap method proposed by [7] to sample the reference data, consequently to get the distribution of the test statistic under the null hypothesis.
Among the local tests, the one proposed by [8] is widely used.Denoting the mode as 2k u and the saddle point 2 1 k u − , the test statistic is defined as:

∫
where i is the ith investigated mode and ( ) . It can be thought as the probability mass of the mode above the higher of the two saddle points or antimodes around it.The advantage of this statistic is that it does consider not only the heights of the mode and saddle point, but also the distance between them.The reference distribution is generated by forcing the distribution flat (uniform) between the point 1 i u − and 1 i u + and keeping the rest of the distribution the same.
The generalization to multivariate data is sparse.[9] proposed a mode hunting tool together with a further test of the significance for the existence of these modes, by using the k-nearest neighbor (KNN) density estimate for multivariate data.The authors proposed an iterative nearest neighbor method for selecting the initial modal candidates and then thinning out this list of modal candidates to form a set of modal candidates, M. For the formal pairwise test of existence of the modes in M, [9] proposed the following statistic where troduced in this paper.[1] introduced a set of tools to detect the modes and saddle point between the two modes of the KDE.Section 3 provides a review of these tools.Section 4 proposes the test statistic and its asymptotic distribution.Section 5 discusses the choice of the bandwidth parameter.Section 7 applies the method on some simulated and real data sets.Section 8 closes the paper with some discussion and future work.

Kernel Density Estimate
In this section, we review some basic properties of the multivariate KDE, which provides the foundation of the modality inference introduced in the next chapter.The multivariate KDE is the most commonly used non-parametric estimate of the probability density of a multivariate random variable.Suppose the d-dimensional vectors ( ) are i.i.d samples from the population with some unknown probability density f.The ( ) where ( ) K ⋅ , a real-valued multivariate kernel function, has the properties of ( ) However, if the data have different scales on different dimensions, the KDE of (1.4) will lead to a poor estimation.To avoid the scaling problem, the sphering transformation can be applied to the data.

Sphering Transformation
Sphering transformation, also known as Whitening transformation, is a linear transformation that makes the data have the identity covariance matrix [10].To carry out the transformation, one computes the spectral decomposition of the sample covariance matrix of X, ˆ= T Σ PΛP .Let

=
X PΛ Y , one can transform the data back to the original scale.In this thesis, we will use this transformation and the kernel density estimator (1.4). Figure 1 illustrates two examples of original data and sphering transformed data.The original data is shown on the left panel while the transformed data is shown on the right panel.The plots show that after the transformation, both the scale and the "direction" of the data has changed, while the clustering or grouping information of the data is still preserved.

Bandwidth Selection
Traditional method to select the "optimal" choice of the bandwidth is to minimize the asymptotic mean integrated squared error (AMISE) of ( ) Thus the AMISE is minimized as h is: ( ) AMISE .
The "normal reference rule" [11] is widely used.It is given as: ( ) where j σ is the standard deviation of the jth dimension and can be estimated by the corresponding sample standard deviation.

Asymptotic Distribution of KDE
In this section, we review the details of the asymptotic properties of KDE defined in (1.4).
( ) showing that ( ) f x is asymptotically unbiased.We apply the normal reference rule on the sphering trans- formed data to choose the bandwidth parameter h as (2.2), ( ) . If we choose h with the optimal convergent rate, which is 2), for example, using normal reference rule, then The kernel density estimator has the following asymptotic normality property of: The proof of the above property is based on the Liapunov Central Limit Theorem.[12] provide the details of the proof.From the Property 1, it can be seen that by choosing the h at optimal convergent rate, the KDE ( ) , ˆn h f x asymptotically is a biased estimator of ( ) f x .It underestimates the local maxima since the term ( ) in bias is negative at maxima and it overestimates at local minima due to the same reason.
If we choose an h that converges faster than the optimal rate, i.e., where 1 the bias term can be made negligible.The asymptotic distribution then becomes: To satisfy the conditions If the h converges slower than the optimal rate, the bias term cannot converge as n → ∞ .
Note that the variance term in the asymptotic distribution contains the unknown parameter ( ) f x .Simply by applying the delta method and using the transformation function ( ) , the variance is stabilized and becomes invariant with ( ) Further, it is easy to verify the following property: x and ( ) x are uncorrelated.
Proof: We consider the covariance between ( ) x and ( ) n n We want to show that For the second last step, since ( ) x x u u is a convolution and hence another density, therefore its integration is bounded by 1.Now we consider the term Thus, it follows Therefore, we proved that ( ) x and ( ) x are asymptotically uncorrelated.Since under the same condi- tion of Property 3, ( ) x and ( ) x asymptotically follow normal distributions, we can claim they are asymptotically independent.[1] proposed a set of comprehensive tools to explore the geometric feature of the KDE.In this section, we review the basic quantities of the modality inference, which are the mode, the saddle point and the ridgeline.We will also discuss the algorithms to determine these quantities under the KDE.

Mode
Mode is defined as the local maximum of a probability density.Traditional techniques of finding local maxima, such as hill climbing, work well for univariate data.However, multivariate hill climbing is computationally expensive, thereby limiting its use in high dimensions.[1] proposed an algorithm that solves a local maximum of a KDE by ascending iterations starting from the data points.Since the algorithm is very similar to the expectation maximization (EM) algorithm [13], it is named as the modal expectation maximization (MEM).The finite mixture model can be expressed as x , the MEM solves the local maxima of the mixture density by alternating the following two steps until it meets some user defined stopping criterion.
Step 1: Let Step 2: ( ) ( ) Details of convergence of the MEM approach can be found in [1].The above iterative steps provide a computationally simpler approach than the grid search method for hill climbing from any starting point D x ∈  , by exploiting the properties of density functions.Given a multivariate kernel K, let the density of the data be given by ( ) ( ) , where Σ is the matrix of smoothing parameters.Moreover, in the special case of Gaussian kernels, i.e., ( ) ( ) Σ , where ( ) φ ⋅ is the pdf of a Gaussian distribution, the update of ( ) This allows us to avoid the numerical optimization of Step 2. Due to this reason, the normal kernel function is used throughout the methods introduced in this thesis.However, in general, one can also use other kernel functions.
The MEM algorithm can be naturally used to define clusters.If we start the algorithm from each data point, we can cluster the data that converges to the same mode as one group.[1] denotes this algorithm the Mode Association Clustering (MAC).If we choose a sequence of bandwidth parameters h, then we can get the Hierarchical MAC (HMAC).[14] provided the explicit formula for the ridgeline between the two means of the mixture of two multivariate normal distributions.The mixture density of two d-dimensional multivariate normal distributions is:

Saddle Point and Ridgeline
; , 1 ; , , where the 1 µ and 2 µ are the mean vectors and 1 Σ and 2 Σ are the two covariance matrices of the two mixed multivariate normal components respectively.The ridgeline of the distribution in (1.6) from one mean to another is given by: ( ) where [ ] 0,1 α ∈ and 1 α α = − .[14] showed that all the critical points of the d-dimensional distribution, including the modes and saddle points, are the points on the ridgeline.With different choices of the parameters, the probability density can be unimodal, bimodal or in some special cases, trimodal.See some examples in [15].The ridgeline provides a useful tool to discover the modality of a mixture of multivariate normal mixtures.
[1] also provided the algorithm to find the ridgeline of the KDE between the two modes identified by MEM, named the Ridgeline EM (REM).Here we provide a brief description of the REM: Let the density of the two clusters represented by the two modes of interest be 1 f and 2 f .We can consider both 1 f and 2 f as the mixtures of L parametric distributions: Starting from an initial value ( ) x , the REM updates x by iterating the following two steps: Step 1: Compute: Step 2: Update ( ) arg max 1 log log In the special case where , the multivariate normal distribution, the second step becomes illustrates one example of the ridgeline.The point on the ridgeline with the lowest density is the detected saddle point.The REM and MEM introduced in this section provide useful tools to detect the mode and saddle point of the KDE, which provides the basis of the inferential framework introduced in the following section.

Test Statistic and Its Asymptotic Distribution
We denote the one of and m 2 x with minimum density.We use Ridgeline EM (REM), which was reviewed in Section 1.3 to determine the saddle point s x .To identify the interested pair of modes, in practice, when several modes are identified by MEM, it starts with the one with the lowest density and its neighbor mode.Or, one can select the particular pair of modes based on the context of the study.After identifying m x and s x , the hypothesis in (1.2) can be restructured as:  x to make the inference, where ( ) f ⋅ is the kernel density estimate of ( ) and m x and s x are the estimated mode and saddle point of the KDE detected by the Modal EM (MEM) algo- rithm, which was reviewed in Section 3. We believe m x is a good estimation of the modal region and is close to the true population mode, and same for the s x .Theorem 1.Using Gaussian kernel function, under Proof: Using Property 2, Property 3 and density in 1.5, we can show that: , we have ( ) . Thus we prove the theorem.This is the test statistic of Hypothesis (1.8) and its asymptotic distribution.

Choice of the Bandwidth Parameter
In order to use the asymptotic distribution (1.9), the conditions of Property 2 must be satisfied.To satisfy the conditions γ < < + , is still wide if the dimension of the data is not high, e.g., 1 3 Theoretically, the bias of the KDE can be negligible if γ is within this interval.However, in practice, the selection of γ affects the variance-bias trade off, which affects the inference dramatically.We demonstrate the phenomenon using logctA20 data set.The description of the data set can be found in R package Modalclust, which will be described in the next chapter.logctA20 is a two-dimensional data with 2166 observations.The scatter plot of the data is shown in Figure 3.
Using the normal reference rule, the bandwidth parameter used for the MEM is: ( ) Using the MAC to cluster the data, the output shows that there are four major clusters.Figure 4 shows the clustering output as well as the modes, saddle points and ridgeline between the modes.The next step is to test if  the four modes are significant.We consider three tests for the three adjacent pairs of modes: the test of Mode 4 against Mode 3, Mode 3 against Mode 1 and Mode 2 against Mode 1.As mentioned at the beginning of this section, in order to use the asymptotic distribution (1.9), we should choose the bandwidth parameter h so that it converges faster than the optimal rate.Thus, we should choose Figure 5 provides the plot of the p-value of the modality test against the choice of the γ .It clearly demon- strates that the value of the γ affects the conclusion of the inference.For Mode 2, the lower values of γ lead to rejecting the null hypothesis, whereas the higher values of γ lead to not rejecting the null hypothesis, In practice, if we choose a small value of γ , the bias term still exists, even though asymptotically it will converge to 0. If we choose a large value of γ , the variance is relatively large and could mislead the conclusion.We suggest to use a small value of γ , which will lead to a large value of h * .Remark: Recall in Section 2, we reviewed that the bias term in Property 1 is Note that 0 b < at ˆm x and 0 b > at ˆs x .Thus, under 0 H of hypothesis in (1.8), the expectation of the test statistic is negative if the bias exists.Therefore, it makes the test conservative.
From the analysis in Figure 5, we conclude that Mode 4 is not significant, while Mode 2 and Mode 3 are.Group 4 can be merged with Group 3. The final plot after merging Group 4 with Group 3 is given in Figure 6.
Note that the resulting modes are all significant.When we have several mode candidates and when we want to inference on the entire distribution to see how many significant modes, there is a multiplicity issue.One can refer [15] for some method to control the overall Type I error rate.We focus on the local significance and do not provide an overall significance of the final result.

The Procedure of the Mode Hunting and Inference
The inference procedure proposed in the previous section, along with the MEM and REM reviewed in Section 3, provides a comprehensive tool for mode hunting and follow-up inference of a data set.In this section, we summarize the procedure.Step 1: Sphering transformed the data; Step 2: Use KDE to estimate the density of the data with bandwidth parameter chosen by some standard method, e.g., the normal reference rule, etc.; Step 3: Identify the modes of KDE using MEM.After determining the pair of modes ˆ1 m x and ˆ2 m x , identify the corresponding saddle point ˆs x by REM; Step 4: Use

Application
This section provides the application of the modality inferential framework on some real and simulated data sets.We start by providing a description of the data sets and follow up by providing the conclusion of the inferential framework.

Four Discs
The four disks data is a simulated data.It contains 10,000 observations and the data is a mixture of four bivariate normal distributions.The mean vectors are ( ) . The data contains multiple layers of the clusters.There are three main clusters with one of them having two sub-clusters.By the simulation design, the Group 1 and 2 shown in Figure 7 are two distinct groups.The p-value of Mode 1 compared with Mode 2 is 0.0194.However, Group 1 and 2 are relatively close compared to the other groups.After merging these two groups together, the resulting clusters and the ridgeline between each pair of modes are shown in Figure 8.The multiple layers of clusters are common in real life application.The decision of how many clusters the data has is often related to the application area and research question.

3-Dimensional Two Half Discs
The 3-dimensional two half discs is another simulated data set with 800 samples.It is formed by two half discs with equal size, i.e. 400 samples for each disc.Using 800 n = and 3 d = for ( )  There are 4 major clusters and the number of samples of each cluster is given in Table 1.
The inference between some major clusters is carried out and the resulting p-values are given in Table 2.It is straightforward to conclude that Group 1 is not significantly distinct from Group 3, and Group 6 is not distinct from Group 5. Group 3 is significantly separated from Group 5 and 6.Group 5 is significantly separated from Group 1 and 3. Thus, we get the conclusion that there are two main groups in this data set.

Flow Cytometry Data
Flow cytometry is a technology that simultaneously measures and then analyzes multiple physical characteristics of single cell, as they flow in a fluid stream through a beam of light.Flow cytometry is one of the most commonly used platforms in clinical and research labs worldwide.It is used to identify and characterize types and functions of cell populations e.g., dead or live cells, in a sample by measuring the expression of specific proteins on the surface and within each cell.
Flow cytometry data consists of per cell measurements (or events) in the form of scattering of light and fluorescence intensity from the fluorophore-conjugated markers.In a typical flow data analysis workflow, a human analyst visually inspects 2-dimensional scatter plots of a sample, where the dimensions could be scatters, marker intensities, or a combination of these, and it demarcates (or gates) specific populations of interest such as live cells, lymphocytes, etc. Often, gates are drawn around visually discernible congregations of events.For instance, for live gating, the dead cells or debris could be discerned by their small cell size and granularity, which appear as a distribution of points with low forward-and side-scatter values.Forward-scatter light (FSC) and Side-scatter light (SSC) reflects two features of the cells and forms a two-dimensional scatter plot.FSC is proportional to cell-surface area or size.SSC is proportional to cell granularity or internal complexity.The manual approach to gating is, however, labor-intensive and subjective, and gating results can vary considerably from one analyst   to another.[16] have used the MAC to gate flow cytometry data.However, the inference is distinctly missing.Figure 10 is one example of flow cytometry data.The data contains 4905 cells.In this plot, the dead cells have a relatively smaller size compared with the live cells, which shows that the dead cells have a smaller value of FSC and SSC.In the scatter plot, the dead cells are at the bottom left corner.For this data set, using the inference procedure introduced in Section 6, we applied the MAC on the data with ( ) and got the two major clusters.It is suspected that the cluster on the bottom left represents the dead cells, while the rest are the live cells.The p-value of the mode existence inference is p 0.0001 < .Thus, the procedure can automatically identify the dead cell population distinctly from the live cells.

Swiss Banknotes
The data set contains 6 measures of 200 Swiss banknotes, where 100 are real and 100 are counterfeit.The 6 measures are: 1 X : Length of the bank note, 2 X : Height of the bank note, measured on the left, 3 X : Height of the bank note, measured on the right, 4 X : Distance of inner frame to the lower border, 5 X : Distance of inner frame to the upper border, 6 X : Length of the diagonal of the inner image.All measurements are in millimeters.The original banknote image and the measurements are shown in Figure 11.In this data set, we know the truth of whether the banknote is real or forged.More information about the data set can be found in [17].We use the spectral degrees of freedom concept, which was proposed by [18], and supply 1.022 h = for the MAC.The MAC output shows there are two major clusters and can capture the two groups well.The output is shown in Table 3. Group 1 and Group 4 are the two major clusters.Using 0.517 h = , the p-value of the corresponding modality inference is 0.001774, which indicates the two clusters are clearly separated.

Discussion
In this paper, we developed the inference procedure to test the significance of a specific mode.The asymptotic distribution of the test statistic is derived based on the asymptotic normality of the KDE to assess the significance of the mode.The traditional method to assess the significance of the modality of the data is to determine the test statistic and decide the reference distribution under the null hypothesis.Then, a large scale simulation is performed to simulate the reference data and compute the test statistic of the simulated reference data to form   the null distribution of the test statistic.The method we introduced uses the asymptotic distribution of the statistic, thus, we can avoid the bootstrap testing, which could be computationally expensive.
Combined with the research work in [1], we provided a comprehensive mode hunting and inference tool for the investigated data set.The mode hunting and inference procedure is based on the KDE and using the normal density as the kernel function.It is important to select the bandwidth parameter h.There are two steps to select the bandwidth parameters.It is acknowledged that there is no best choice of h for estimating a density.For mode hunting, we chose to use the normal reference rule.For the inference, h has to satisfy the conditions of the asymptotic normality of the KDE.Due to the curse of the dimensionality, this method is limited to low or moderate dimensions.
We can apply this inference procedure on each pair of modes to assess how many modes the data have.In the MAC algorithm, the number of modes is the same as the number of clusters.It is difficult but worthwhile to generate the automated algorithm to decide on how many clusters/modes of the data have, based on the modality inference procedure.The difficulty here is that the method is based on the KDE.The outliers of the data could easily form the spurious modes, especially for the high dimensional data, which make it difficult to generate automated algorithm.

Figure 1 .
Figure 1.Two examples of original data and sphering transformed data.
the mixing components.Given any initial value ( ) 0 by m x .To test the hypothesis defined in (1.2), a nat- ural choice is to compare the density of m x against the density of the saddle point s x , which is the point on the ridgeline between 1 m x

∫
1.10) ˆm x and ˆsx are correlated.However, based on Property 3, as long as ˆ≠ m K u u.For univariate standard normal kernel function: ( ) for d-dimensional multivariate normal kernel function:

Figure 4 .
Figure 4. Mode, saddle point and ridgeline of logctA20 data.

Figure 6 .
Figure 6.Mode, saddle point and ridgeline of the example data after merging.

5 :
Make the inference of the Hypothesis (1.8) based on the asymptotic distribution (1.9).

Figure 7 .
Figure 7.The first layer of four discs data.

Figure 8 .
Figure 8.The second layer of four discs data.

Figure 10 .
Figure 10.One example of flow cytometry data clustered by MAC.

Figure 11 .
Figure 11.6 measurements of Swiss banknote data.
chosen as the standard multivariate normal density function.H is the d d× non-singular positive definite bandwidth matrix and H is the determinant of H.In practice, it is difficult to choose the values of H due to such a large number of parameters over d-dimensional space.It is more practical to keep the number of bandwidth parameters as small as possible, but retaining enough to provide good estimates.One approach to reducing the number of bandwidth parameters is to use the simplest model that contains only one bandwidth parameter:

Table 1 .
Cluster size of two half discs data.

Table 2 .
p-value of modality inference on two half discs data.

Table 3 .
MAC output of emphSwiss banknotes data.