_{1}

^{*}

Dimension reduction is defined as the processes of projecting high-dimensional data to a much lower-dimensional space. Dimension reduction methods variously applied in regression, classification, feature analysis and visualization. In this paper, we review in details the last and most new version of methods that extensively developed in the past decade.

Any progresses in efficiently using data processing and storage capacities need control on the number of useful variables. Researchers working in domains as diverse as computer science, astronomy, bio-informatics, remote sensing, economics, face recognition are always challenged with the reduction of the number of data-variables. The original dimensionality of the data is the number of variables that are measured on each observation. Especially when signals, processes, images or physical fields are sampled, high-dimensional representations are generated. High-dimensional data-sets present many mathematical challenges as well as some opportunities, and are bound to give rise to new theoretical developments [

In many cases, these representations are redundant and the varaibles are correlated, which means that eventually only a small sub-space of the original representation space is populated by the sample and by the underlying process. This is most probably the case, when very narrow process classes are considered. For the purpose of enabling low-dimensional representations with minimal information loss according dimension reduction methods are needed.

Hence, we are reviewing in this paper the most important dimensional reduction methods, including most traditional methods, such as principal component analysis (PCA) and non-linear PCA up to current state-of-art methods published in various areas, such as signal processing and statistical machine learning literature. This actual survey is organized as follows: section 2 reviews the linear nature of Principal component analysis and its relation with multidimensional scaling (classical scaling) in a comparable way. Section 3 introduces non-linear or Kernel PCA (KPCA) using the kernel-trick. Section 4 is about linear discriminant analysis (LDA), and we give an optimization model of LDA which is a measuring of a power of this method. In section 5 we summarize another higher-order linear method, namely canonical correlation analysis (CCA)), which finds a low dimensional representation maximizing the correlation and of course its optimization-formulation. Section 6 reviews the relatively new version of PCA, the so-called oriented PCA (OPCA) which is introduced by Kung and Diamantaras [

1) Input Layer 2) Hidden Layer (bottleneck)

3) Output Layer and an auto-associative neural network with five layers:

1) Input Layer 2) Hidden Layer 3) Bottleneck 4) Hidden Layer 5) Output Layer A very nice optimizing formulation is also given. In section 9, we review the Nystroem method which is a very useful and well known method using the numerical solution of an integral equation. In Section 10, we look the multidimensional scaling (MDS) from a modern and more exact consideration view of point, specially a defined objective stress function arises in this method. Section 11 summarizes locally linear embedding (LLE) method which address the problem of nonlinear dimensionality reduction by computing low-dimensional neighborhood preserving embedding of high-dimensional data. Section 12 is about one of the most important dimensional-reduction method namely Graph-based method. Here we will see how the adjacency matrix good works as a powerful tool to obtain a small space which is in fact the eigen-space of this matrix. Section 13 gives a summary on Isomap and the most important references about Dijstra algorithm and Floyd’s algorithm are given. Section 14 is a review of Hessian eigenmaps method, a most important method in the so called manifold embedding. This section needs more mathematical backgrounds. Section 15 reviews most new developed methods such as

• vector quantization

• genetic and evolutionary algorithms

• regression We have to emphasize here the all of given references in the body of survey are used and they are the most important references or original references for the related subject. To obtain more mathematical outline and sensation, we give an appendix about the most important backgrounds on the fractal and topological dimension definitions which are also important to understand the notion of intrinsic dimension.

Principal component Analysis (PCA) [

In the mathematical language, PCA attempts to find a linear mapping

Why the above optimization Problem is equivalent to the eigenvalue problem (1.1)? consider the convex form

It is interesting to see that in fact PCA is identical to the multidimensional scaling (classical scaling) [

For the given data

in which

Now consider the multiplication of principal eigenvectors of the double-centered squared Euclidean distance matrix (i.e., the principal eigenvectors of the Gram matrix) with the square-root of their corresponding eigenvalues, this gives us exactly the minimum of the cost function in Equation (1.2).

It is well known that the eigenvectors

The probabilistic formulation of PCA leads to an EM-algorithm that may be computationally more efficient for very high-dimensional data. By using Gaussian processes, probabilistic PCA may also be extended to learn nonlinear mappings between the high-dimensional and the low-dimensional space [

PCA and classical scaling suffer from two main drawbacks. First, in PCA, the size of the covariance matrix is proportional to the dimensionality of the data-points. As a result, the computation of the eigenvectors might be infeasible for very high-dimensional data. In data-sets in which

Non-linear or Kernel PCA (KPCA) is in fact the reconstruction from linear PCA in a high-dimensional space that is constructed using a given kernel function [

Kernel PCA computes the kernel matrix

where

The centering operation corresponds to subtracting the mean of the features in traditional PCA: it subtracts the mean of the data in the feature space defined by the kernel function

In order to obtain the low-dimensional data representation, the data is projected onto the eigenvectors of the covariance matrix

where

An important weakness of Kernel PCA is that the size of the kernel matrix is proportional to the square of the number of instances in the data-set. An approach to resolve this weakness is proposed in [

Kernel PCA has been successfully applied to, e.g., face recognition [

The main Reference here is [

Now we define three scatter matrices:

The between-class scatter matrix

Hence in this way the dimension is reduced from

form the columns of U as above for LDA.

CCA is an old method back to the works of Hotelling 1936 [

Let

where

Actually CCA is a method for the following optimization problem:

which can be modified as

Assume the pair of projective directions

repeating the above process

In fact we can obtain this

and the eigenvectors

compose the feature sets extracted from

Oriented PCA is introduced by Kung and Diamantaras [

where

By the definition, principal curves are smooth curves that pass through the middle of multidimensional data sets, see [

Given the

we can associate to the curve

corresponding to the point on the curve

We say

It is shown in [

Theorem 1 A curve

Of course to solve (or even estimate) minimization (0.7) is a complex problem, to estimate

[

• For a fixed

•

• Fix

One can find in [

Given Input variables

where the weights

1) Input Layer 2) Hidden Layer (bottleneck)

3) Output Layer there are two steps here:

• In order to obtain the data at node

• Here we have to repeat step (1) with changing original data

We observe that the first part of network reduces the input data into the lower-dimensional space just as same as a linear PCA, but the second part decodes the reduced data into the original domain [

1) Input Layer 2) Hidden Layer 3) Bottleneck 4) Hidden Layer 5) Output Layer If

As we have seen in the last section the function

The Nystroem Method is a well known technique for finding numerical approximations of generic integral equation and specially to eigenfunction problems of the following form:

We can divide the interval

Now consider the simple quadrature rule:

which

without loss of generality we can shift interval

where

We can extend above arguments for

Motivated from 0.9 our main question is if

Nystroem method gives a positive answer to this question. Actually we can assume that the

Hence

where

The last equation is called “bottleneck” form. There is a very interesting application of this form in Spectral Grouping which it was possible to construct the exact eigen-decomposition of

Given

• Define an objective stress function and stress factor

• Now if for a given

• Determine the optimal data set

If we use Euclidean distance and take

Locally linear embedding is an approach which address the problem of nonlinear dimensionality reduction by computing low-dimensional neighborhood preserving embedding of high-dimensional data. A data set of dimensionality

• Identify neighbors of each data point

• Compute the weights

• Find the low-dimensional embedding vector

After finding the nearest neighbors in the first step, the second step must compute a local geometry for each locally linear sub-manifold. This geometry is characterized by linear coefficients that reconstruct each data point from its neighbors.

where

This objective can be restate as

where

The solution for

As before given a data set

we say

•

•

To associate the weights to edges, as well, there is two variations:

• Heat kernel, which

•

• Simple adjacency with parameter

We assume our graph, defined as above, is connected, otherwise proceed following for each connected component. Set

Compute eigenvalues and eigenvectors for the generalized eigenvector problem:

let

We leave out the eigenvector (trivial eigenfuntion) corresponding to eigenvalue o, which is a vector with all component equal to

which

Like LLE the Isomap algorithm proceeds in three steps:

• Find the neighbors of each data point in high-dimensional data space.

• Compute the geodesic pairwise distances between all points.

• Embed the data via MDS so as preserve those distances Again like LLE, the first, the first step can be performed by identifying the

The geodesic distances

High dimensional data sets arise in many real-world applications. These data points may lie approximately on a low dimensional manifold embedded in a high dimensional space. Dimensionality reduction (or as in this case, called manifold learning) is to recover a set of low-dimensional parametric representations for the high-dimensional data points, which may be used for further processing of the data. More precisely consider a d-dimensional parametrized manifold

for some

Of course, this problem is not well defined for a general nonlinear map

Given that the map

provides a locally) isometric coordinate system for

where

The main references for vector quantization are [

These algorithms introduced in [

• Random Versus Deterministic Operation

• Population Versus Single Best Solution

• Creating New Solutions Through Mutation

• Combining Solutions Through Crossover

• Selecting Solutions Via “Survival of the Fittest”

• Drawbacks of Evolutionary Algorithms In [

We can use Regression methods for dimension reduction when we are looking for a variable function

• The Wrapper method in machine learning community [

• Projection pursuit regression [

• Generalized linear models [

• Adaptive models [

• Neural network models and sliced regression and Principal hessian direction [

• Dimension reduction for conditional mean in regression [

• Principal manifolds and non-linear dimension reduction [

• Sliced regression for dimension reduction [

• Canonical correlation [

Our research has received funding from the (European Union) Seventh Framework Programme ([FP7/2007- 2013]) under grant agreement n [

The main Reference for this appendix is [

To begin at the very beginning: How can we best define the dimension of a closed bounded set

• When

• For more general sets

• Points, and countable unions of points, have zero dimension.

Local (or topological) Methods (2): The earliest attempt to define the dimension:

Definition 1 We can define the Topological dimension

Local (or topological) Methods (3):

Definition 2 Given

Example 1 For

Local (or topological) Methods (4): The Hausdorff dimension

Definition 3 Consider a cover

where the infimum is taken over all open covers

• Fact1: For any countable set

• Fact2:

Local (or topological) Methods (4) as shown in

Local (or to pological) Methods (5):

Example 2 (von Koch curve: [

For

Example 3 (

For the middle third Cantor set both the Box dimension and the Hausdorf dimension are

The set