^{1}

^{1}

^{*}

In dealing with high-dimensional data, such as the global climate model, facial data analysis, human gene distribution and so on, the problem of dimensionality reduction is often encountered, that is, to find the low dimensional structure hidden in high-dimensional data. Nonlinear dimensionality reduction facilitates the discovery of the intrinsic structure and relevance of the data and can make the high-dimensional data visible in the low dimension. The isometric mapping algorithm (Isomap) is an important algorithm for nonlinear dimensionality reduction, which originates from the traditional dimensionality reduction algorithm MDS. The MDS algorithm is based on maintaining the distance between the samples in the original space and the distance between the samples in the lower dimensional space; the distance used here is Euclidean distance, and the Isomap algorithm discards the Euclidean distance, and calculates the shortest path between samples by Floyd algorithm to approximate the geodesic distance along the manifold surface. Compared with the previous nonlinear dimensionality reduction algorithm, the Isomap algorithm can effectively compute a global optimal solution, and it can ensure that the data manifold converges to the real structure asymptotically.

In the process of analyzing high-dimensional data, it faces the problem of “dimensionality disaster” [

The traditional dimensionality reduction techniques are divided into two types: linear methods and nonlinear methods. The nonlinear methods are divided into preserving local features and retaining global features. Retaining local features is based on reconstruction weights, adjacency graphs, and cut-based spaces. The retention of global features is based on retention distance, kernel-based, and neural network-based. Based on distance preservation, it is divided into multidimensional scaling (MDS) and isometric mapping (Isomap) based on geodetic distance.

The isometric mapping algorithm is a classical algorithm in manifold learning. The goal of manifold learning is to find low-dimensional structures embedded in high-dimensional data spaces and give an efficient low-dimensional representation. Because the manifold learning algorithm can utilize the local geometry of the dataset to reveal its intrinsic manifold structure, it can achieve efficient dimensionality reduction. In addition to the Isomap algorithm, well-known manifold learning algorithms include local linear embedding, Laplacian feature mapping, and local hold projection. These algorithms can keep the topology of the original data unchanged, and can better solve the “dimension disaster” problem in the data processing. This paper will introduce the Isomap algorithm and compare it with the MDS algorithm to compare the dimensionality reduction effects of the two algorithms. The content of this paper is as follows: The second chapter focuses on the Isomap algorithm and the principle of the MDS algorithm. The third chapter is the experimental comparison verification. The fourth chapter summarizes this article and its outlook for the future.

The basic principle of the dimensionality reduction algorithm is to analyze and process high-dimensional data to find meaningful low-dimensional structures hidden in high-dimensional data. The main idea of the Isomap algorithm is to calculate the geodesic distance between data points by local neighborhood distance approximation, and complete the data dimensionality reduction by establishing the equivalence relationship between the geodesic distance of the original data and the distance between the data after dimension reduction. The Isomap algorithm is derived from the linear dimensionality reduction algorithm MDS, which has the main features of the MDS algorithm, namely the validity of the calculation, the global optimization and the asymptotic convergence. At the same time, it can be more flexible to learn the nonlinear structure of data [

Since the Isomap algorithm is based on the multidimensional scaling analysis MDS algorithm to improve the dimensionality reduction, before introducing the Isomap algorithm, it is very necessary to introduce the MDS algorithm, and understand the MDS algorithm can also understand the Isomap algorithm.

The MDS algorithm is a very traditional dimensionality reduction method. It is based on distance. The goal is to keep the distance between sample points in the original space and the distance between sample points in the low-dimensional space after dimension reduction equal [

D = [ d i s t 11 ⋯ d i s t 1 m ⋮ ⋱ ⋮ d i s t m 1 ⋯ d i s t m m ] (1)

The goal of the MDS algorithm is to obtain the representation of the m samples in the low-dimensional space Z = [ z 1 , z 2 , ⋯ , z m ] , where Z is the corresponding sample point after the original sample point projection. MDS requires maintaining the Euclidean distance between sample points in the original space in low-dimensional space, so the Euclidean distance of any two samples in Z in low-dimensional space is equal to its distance in the original space, which is

‖ z i − z j ‖ = d i s t i j (2)

Let the inner product matrix of the dimension-reduced samples be denoted by B and B = Z T Z , for each element in matrix B b i j = z i T z j , then matrix B can be expressed as

B = [ z 1 T z 1 ⋯ z 1 T z m ⋮ ⋱ ⋮ z m T z 1 ⋯ z m T z m ] = [ b 11 ⋯ b 1 m ⋮ ⋱ ⋮ b m 1 ⋯ b m m ] (3)

To square the two sides of the Equation (2), you can get

d i s t i j 2 = ‖ z i ‖ 2 + ‖ z j ‖ 2 − 2 z i T z j = b i i + b j j − 2 b i j (4)

Let the sample Z after dimension reduction be centered, which is ∑ i = 1 m z i = 0 , Then the sum of the row and column of matrix B is 0, which is ∑ i = 1 m b i j = 0 , ∑ j = 1 m b i j = 0 . Then, add different types of summation symbols to the left and right sides of Equation (4), and then simplify and merge the similar items to obtain

∑ i = 1 m d i s t i j 2 = ∑ i = 1 m b i i + ∑ i = 1 m b j j − ∑ i = 1 m 2 b i j = t r ( B ) + m b j j − 0 = t r ( B ) + m b j j (5)

∑ j = 1 m d i s t i j 2 = ∑ j = 1 m b i i + ∑ j = 1 m b j j − ∑ j = 1 m 2 b i j = m b i i + t r ( B ) − 0 = m b i i + t r ( B ) (6)

∑ i = 1 m ∑ j = 1 m d i s t i j 2 = ∑ i = 1 m ( m b i i + t r ( B ) ) = m ∑ i = 1 m b i i + m t r ( B ) = m t r ( B ) + m t r ( B ) = 2 m t r ( B ) (7)

After the deformation of the formula (5), the formula (6), and the formula (7), it is obtained

b i i = 1 m ( ∑ j = 1 m d i s t i j 2 − t r ( B ) ) (8)

b j j = 1 m ( ∑ i = 1 m d i s t i j 2 − t r ( B ) ) (9)

2 t r ( B ) = 1 m ∑ i = 1 m ∑ j = 1 m d i s t i j 2 (10)

Then, since the expression with the summation symbol is long and cumbersome, use d i s t i . 2 、 d i s t j . 2 and d i s t .. 2 instead of using the simple symbol and sum, the following expression can be obtained

d i s t i . 2 = 1 m ∑ j = 1 m d i s t i j 2 (11)

d i s t j . 2 = 1 m ∑ i = 1 m d i s t i j 2 (12)

d i s t .. 2 = 1 m 2 ∑ i = 1 m ∑ j = 1 m d i s t i j 2 (13)

After the Equation (4) is deformed, the Equation (8) and Equation (9) are substituted into the transformed equation, and the Equations (11)-(13) are substituted into the Equation (8) and Equation (9). After the substitution, the formula obtained can be obtained

b i j = − 1 2 ( d i s t i j 2 − d i s t i . 2 − d i s t j . 2 + d i s t .. 2 ) (14)

In this way, each element in the inner product matrix B can be calculated by the Euclidean distance matrix D, and each element in B is obtained. Then, the inner product matrix B is obtained. Since B = Z T Z , eigenvalue decomposition is performed on matrix B, and B V = λ V is obtained, then

B = Z T Z = V λ V T (15)

among them, λ = [ λ 1 ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ λ m ] ( λ 1 ≥ λ 2 ≥ ⋯ ≥ λ m ), V is a matrix composed of eigenvectors corresponding to eigenvalues.

In order to make the dimension reduction effective, only the distance after dimension reduction is as close as possible to the distance in the original space, instead of being strictly equal. Assuming that the dimension to be finally reduced is d-dimensional, then the largest d (from large to small) of the eigenvalues constitutes a diagonal matrix λ d , then the final output of the MDS algorithm, that is, the lower dimension of each sample in the original space is Z = λ d 1 2 V d T , where V d T is the eigenvector matrix composed of eigenvectors corresponding to d eigenvalues [

The input to the MDS algorithm is the Euclidean distance matrix D, but the Euclidean distance is not applicable to the manifold. For example, for the two-dimensional manifold of the Earth in three-dimensional space, suppose that the distance between the North Pole and the South Pole is calculated in three-dimensional space, which is the length of the line connecting the two points. However, this calculation is wrong. Because it is impossible to make a hole through the Arctic to the South Pole, it is necessary to walk along the surface of the earth. Of course, it is not acceptable to walk along any line, because there will be many different distances. Therefore, a new measure of the distance defined on the Earth’s surface (manifold) is needed. In order to correspond to the European space, here is a general definition of the straight-line distance. In the European space, “between two points, the shortest line segment”, the concept of the line segment is generalized to become “the shortest curve between two points is the line segment”, this shortest curve is usually called “geodetic line” [

The specific process of converting the Euclidean distance matrix into the geodesic distance matrix is as follows: first, the Euclidean distance matrix is known (the distance between all sample points is calculated by Euclidean distance), and the number of neighborhoods k is set, for each the sample point, the distance between a sample point and the k sample points that are closer to the sample point is calculated by the Euclidean distance, and the distance from the rest of the sample point (the point farther away from the sample point) is set to infinity; in the second step, the above matrix is updated to the shortest path matrix by the Floyd algorithm. The geodesic distance between sample points can be approximated by the shortest path, thus converting the Euclidean distance matrix to the geodesic distance matrix. After the geodesic distance matrix is added, the geodesic distance matrix is placed into the MDS algorithm to obtain the final dimensionality reduction result.

The Isomap algorithm flow is summarized as follows: 1) Calculate the Euclidean distance between each sample point to obtain the Euclidean distance matrix; 2) Set the number of neighborhoods. In the Euclidean distance matrix, except for the neighborhood points, the remaining distances are set to infinity; 3) Update the above matrix to the shortest path matrix by Floyd algorithm; 4) Input the shortest path matrix into the MDS algorithm, and the result is the result of dimension reduction of Isomap algorithm.

1) Capable of processing high dimensional data such as nonlinear manifolds;

2) Global optimization;

3) Whether the input space is highly folded, distorted, or curved, Isomap can still optimize the low-dimensional European representation globally;

4) Isomap can guarantee a gradual recovery to the real dimension.

1) It may be unstable and dependent on the topological space of data;

2) When it is guaranteed to gradually recover to the geometry of the nonlinear manifold: when N is increased, the point provides a distance closer to the geodetic distance, but takes more calculation time; if N is small, geodesic the distance will be very imprecise.

Both the Isomap algorithm and the MDS algorithm are dimension reduction algorithms. MDS algorithm is a linear dimension reduction algorithm, which is suitable for European space. Isomap algorithm is a nonlinear dimensionality reduction algorithm, which is suitable for manifolds, that is, high dimensional space.

The Isomap algorithm achieves the goal of dimensionality reduction by modifying the algorithm MDS originally applied to the European space. The purpose of the MDS algorithm is to keep the distance between the sample points in the original space and the distance between the sample points in the low-dimensional space after dimensionality reduction. The MDS is designed for the European space with Euclidean distance, but if the data is distributed in a manifold, the Euclidean distance is not applicable, only the geodesic distance can be used. Therefore, the Isomap algorithm replaces the input of the MDS algorithm (Euclidean distance matrix) with the geodesic distance matrix obtained by the shortest path algorithm, thus solving the problem that the Euclidean distance is not applicable to the manifold, which is the biggest difference between Isomap algorithm and MDS algorithm.

In order to more intuitively compare the Isomap algorithm with the MDS algorithm, an S-shaped surface as shown in

The above Isomap algorithm sets the neighborhood number k to 15 to obtain the dimensionality reduction result of

From the above experimental results, it can be seen that in the process of

dimensionality reduction using the Isomap algorithm, the selection of the neighborhood number k plays a key role. If the value of k is too small, the graph will not be connected; and if the value of k is too large, it will make the Isomap algorithm tend to the MDS algorithm. Therefore, the choice of the number of neighborhoods k is crucial. For the selection problem of k, an adaptive method was proposed later [

The dimensionality disaster caused by high-dimensional data has made the dimensionality reduction widely concerned. The traditional dimensionality reduction algorithm (such as MDS algorithm) applicable to European space has not been applied to high-dimensional space. Manifold learning is a new dimensionality reduction method, its main goal is to effectively discover the low-dimensional manifold structure inherent in high-dimensional data sets and give an effective low-dimensional representation. This paper mainly introduces the dimension reduction algorithm for manifolds, Isomap algorithm, which starts from the perspective of maintaining the global structure. In addition, this paper also compares it with the MDS algorithm through experiments, from the experimental results in addition to the difficulty of selecting the neighborhood number k; it is proved that the dimensionality reduction is performed on the manifold. The Isomap algorithm can maintain the topology of the high-dimensional data more than the MDS algorithm; that is, the dimensionality reduction effect is better.

The authors declare no conflicts of interest regarding the publication of this paper.

Yang, H. and Li, H.M. (2019) Implementation of Manifold Learning Algorithm Isometric Mapping. Journal of Computer and Communications, 7, 11-19. https://doi.org/10.4236/jcc.2019.712002