^{1}

^{*}

^{1}

^{*}

We introduce and develop a novel approach to outlier detection based on adaptation of random subspace learning. Our proposed method handles both high-dimension low-sample size and traditional low-dimensional high-sample size datasets. Essentially, we avoid the computational bottleneck of techniques like Minimum Covariance Determinant (MCD) by computing the needed determinants and associated measures in much lower dimensional subspaces. Both theoretical and computational development of our approach reveal that it is computationally more efficient than the regularized methods in high-dimensional low-sample size, and often competes favorably with existing methods as far as the percentage of correct outlier detection are concerned.

We are given a dataset

which

It is also further assumed that the data set

where

And experimenters of often addressing and tackling the outlier detection task in such situations using either the so-called Minimum Covariance Determinant (MCD) algorithm [

Minimum Covariance Determinant (MCD)

Step 1. Select h observations, and form the dataset

Step 2. Compute the empirical covariance

Step 3. Compute the Mahalanobis distances

Step 4. Select the h observations having the smallest Mahalanobis distance;

Step 5. Update

The MCD algorithm can be formulated as an optimization problem.

The MCD algorithm can be formulated as an optimization problem. The seminal MCD algorithm proposed by [

where

out that even the above Regularized MCD cannot be contemplated when

in such cases. The solution to that added difficulty is addressed by solving

where the regularized covariance matrix

With

where

From which the principal component scores

We herein propose a technique that combines the concept underlying Random Subspace Learning (RSSL) by [

where m is the number of iterations needed for the MCD algorithm to converge.

The number h of observations in each subset is required to be

Random Subspace Learning in its generic form is designed for precisely this kind of procedure. In a nutshell, RSSL combines instance-bagging (bootstrap i.e. sampling observations with replacement) with attribute-bag- ging (sampling indices of attributes without replacement), to allow efficient ensemble learning in high dimensional spaces. Random Subspace Learning (Attribute Bagging) proceeds very much like traditional bagging, with the added crucial step consisting of selecting a subset of the variables from the input space for training rather than building each base learners using all the p original variables.

Random Subspace Learning (RSSL): Attribute-bagging step

Step 1. Randomly draw the number

Step 2. Draw without replacement the indices of d variables of the origina p variables;

Step 3. Perform learning/estimation in the d-dimensional subspace.

This attribute-bagging step is the main ingredient of our outlier detection approach in high dimensional spaces.

Random Subspace Outlier

Step 1. Draw with replacement

Step 2. Start for

Draw without replacement from

Drop unselected variables from

Build the b th determinant of covariance

End for

Step 3. Sort the ensemble

Step 4. Form

Step 5. Compute

We can build the robust distance by

The RSSL outlier detection algorithm computes a determinant of covariance for each subsample, with each subsample residing in a subspace spanned by the d randomly selected variables, where d is usually selected to be

along with the corresponding determinants. Then the best subsample, meaning the one with the smallest covariance determinant is singled. It turns out that in the LDHSS context

Random Subspace Learning for Outlier Detection when

Step 1. Draw with replacement

Step 2. Start for

Draw without replacement from

Drop unselected variables from

Build the b th determinant of covariance

End for

Step 3. Sort the ensemble

Step 4. Keep the k smallest samples based on elbow to form

Step 5. Start for

Select

End for

Step 6. Form

Step 7. Compute

We can build the robust distance by the same way:

Without selecting the smallest determinant of covariance, we choose to select a certain number of subsamples to achieve the variable selection through a sort of voting process. The portion of the most frequently appearing variables are elected to build an optimal space that allow us to compute our robust estimators. The simulation results and other details will be discussed later.

Conjecture 1. Let

Sketch 1. Let

strapped sample from

present in

In other words, if

have

Since

sample

The assumption of multivariate Gaussianity of the

subject to

using

so that any

OCSVM has been extensively studied and applied by many researchers among which [

In this section, we conduct a simulation study to assess the performance of our algorithm based on various important aspects of the data, and we also provide a comparison of the predictive/detection performance of our method against existing approaches. All our simulated data are generated according to the e-contami- nated multivariate Gaussian introduced via Equation (1) and Equation (2). In order to assess the effect the covariance between the attributes, we use an AR-type covariance matrix of the following form

where

As can be seen in

Since each bootstrapped sample selected has a small chance of being affected by the outliers, we can select the dimensionality that maximize this benefits. In our HDLSS simulations, determinants are computed based on all the randomly selected subspaces, and are ruled by predominantly small values, which imply the robustness of the classifier.

By Equation (20), it should be understood that we need to isolate the precious subsample

As indicated in our introductory section, we use the Mahalanobis distance as our measure of proximity. As since we are operating under the assumption of multivariate normality, we use the traditional distribution quantiles

than

where

function used here is the basic zero-one loss defined by

It will be seen later that our proposed method produces predictive accurate outlier detection results, typically competing favorably against other techniques, and usually outperforming them. Firstly however, we show in

The improvement of our random subspace learning algorithm in low dimensional data with dimensionality such that

the absolute value of their kurtosis coefficients. This method is shown to yield good performances when dealing with small shift of mean and scatter of the covariance matrix. However, if the outliers lied on larger

We have presented what we can rightfully claim to be a computational efficient, scalable, intuitive appealing and highly predictively accurate outlier detection method for both HDLSS and LDHSS datasets. As an adaptation of both random subspace learning and minimum covariance determinant, our proposed approach can be readily used on vast number of real life examples where both its component building blocks have been successfully applied. The particular appeal of the random subspace learning aspect of our method comes in handy for many outlier detection tasks on high dimension low sample size datasets like DNA Microarray Gene Expression datasets for which the MCD approach is proved to be computational untenable. As our computational demonstrations section above reveal, our proposed approach competes favorably with other existing methods, sometimes outperforming them predictively despite its straightforwardness and relatively simple implementation. Specifically, our proposed method is shown to be very competitive for both low dimensional space and high dimensional space outlier detection and is computationally very efficient. We are currently seeking out interesting real life datasets on which to apply our method. We also plan to extend our method beyond settings where the underlying distribution is Gaussian.

Ernest Fokoué wishes to express his heartfelt gratitude and infinite thanks to our lady of perpetual help for her ever-present support and guidance, especially for the uninterrupted flow of inspiration received through her most powerful intercession.

BohanLiu,ErnestFokoué, (2015) Random Subspace Learning Approach to High-Dimensional Outliers Detection. Open Journal of Statistics,05,618-630. doi: 10.4236/ojs.2015.56063