^{1}

^{*}

^{2}

^{*}

A novel permutation-dependent Baire distance is introduced for multi-channel data. The optimal permutation is given by minimizing the sum of these pairwise distances. It is shown that for most practical cases the minimum is attained by a new gradient descent algorithm introduced in this article. It is of biquadratic time complexity: Both quadratic in number of channels and in size of data. The optimal permutation allows us to introduce a novel Baire-distance kernel Support Vector Machine (SVM). Applied to benchmark hyperspectral remote sensing data, this new SVM produces results which are comparable with the classical linear SVM, but with higher kernel target alignment.

The Baire distance was introduced to classification in order to produce clusters by grouping data in “bins” by [

In this paper we introduce a permutation-dependent Baire distance for data with

find the asymptotic minimum for

The Support Vector Machine (SVM) is a well known technique for kernel based classification. In kernel bas- ed classification, the similarity between input data is modelled by kernel functions. These functions are em- ployed to produce kernel matrices. Kernel matrices can be seen as similarity matrices of the input data in reproducing kernel Hilbert spaces. Via optimization of a Lagrangian minimization problem, a subset of input points is found, which is used to produce a separating hyperplane for the data of various classes. The final de- cision function is dependent only on the position of these data in the feature space and does not require esti- mation of first or second order statistics on the data. The user has a lot of freedom on how to produce the kernel functions. This offers the option of producing individual kernel functions for the data.

As an application of our theoretical result, we introduce the new class of Baire-distance kernels which are functions of our parametrized Baire distance. For the asymptotically optimal permutation, the resulting Baire distance SVM yields results comparable with the classical linear SVM on the AVIS Indian Pine dataset. The latter is a well known hyperspectral remote sensing dataset. Furthermore, the kernel target alignment [

After a short review on the ultrametric parametrized Baire distance, it is shown how to find for

Let

where

word is defined as the number of letters from

Definition 2.1. The expression

is the

Later on, we will study the limiting case

Remark 2.2. The metrics

The Baire distance is important for classification, because it is an ultrametric. In particular, the strict triangle inequality

holds true. This is shown to lead to efficient hierarchical classification with good classification results [

Data representation is often related to some choice of alphabet. For instance, the distinction “Low” and “High” leads to

Example 2.3. The simplest example of

The role of the parameter

Observe that

Given data

i.e. a word with letters from the alphabet

In order to determine a suitable permutation for the data, consider the average Baire distance. A high average Baire distance will arise if there is a large number of singletons, and branching is high up in the hierarchy. On the other hand, if there are lots of common initial features, then the average Baire distance will be low. In that case, clusters tend to have a high density, and there are few singletons. From these considerations, it follows that the task is to find a permutation

is minimal, leading to the optimal Baire distance

Let

where

where

Some first properties of

1.

2.

3.

4.

5.

These properties follow from Equation (2) above, and they imply some first properties of

An important observation is that

The following two examples list all values of

in the case

Example 2.4.

Example 2.5.

Let

The function

from Equation (1) is to be minimised, where

To

The counts

Observe that all edge weights are non-negative:

because

An injective path

where

Definition 2.6. A permutation

where

Lemma 2.7. If

where the path

Proof. Let

Assume that

from which the assertion follows for

The following is an immediate consequence:

Corollary 2.8. Let

The minimising

Corollary 2.9. Dijkstra’s shortest path algorithm on

The main problem with applying Corollary 2.9 is the size of

Algorithm 2.10. (Gradient descent) Input.

Step 0. Set

Step 1. Collect in

Step

Output. The subgraph of

This algorithm clearly terminates after

Lemma 2.11. Let

Proof. We may assume that there exists some

as otherwise

as otherwise

is a polynomial with real coefficients such that

An immediate consequence of the lemma is that gradient descent is asymptotically the method of choice:

Theorem 2.12. There exists a constant

Proof. Let

The competitiveness of the gradient descent method is manifest in the following Remarks:

Remark 2.13. Algorithm 2.10 is of run-time complexity at most

Proof. In the first step, there are

Notice that the efficiency holds only for the case that the weights

Lemma 2.14. Let

We will write

the computation of which seems at first sight exponential in the dimension of

which counts all pairs

as this allows to define a nice way of computing the weight

Lemma 2.15. Let

Proof. This is an immediate consequence of the identity

which follows from Lemma 2.14.

Assume now that we are given for each pair

and its corresponding cardinality

together with the conventions

Then the identity

is immediate. Its usefulness is that the right hand side is computed more quickly than the left hand side:

Lemma 2.16 The cost of

Proof. Take each

Algorithm 2.17 Input.

Step 1. Find minimal edge

minimal. Set

Step

Output. Path

Theorem 2.18 Algorithm 2.17 has run-time complexity at most

Proof. The complexity in Step

Notice that the constant

Within this section the potential of integrating ultrametrics into state-of-the art classifiers―the Support Vector Machine (SVM) as introduced by [

Kernel matrices are the representation of similarity between the input data used for SVM classification. To integrate ultrametrics into SVM classification the crucial step is therefore to create a new kernel function [

where

This new kernel function could be used for classification directly. However, one feature of kernel based classification is that multiple kernel functions can be combined to increase classification performance [

This multiple kernel also belongs to the new class of Baire distance kernels and has the advantage of in- corporating the similarity at different bit depths. It is compared against the standard linear kernel frequently used for SVM:

where the bracket

Within this section, a comparison on a standard benchmark dataset from hyperspectral remote sensing is presented, cf. also [

Although our implementation of Algorithm 2.17 is capable to process 220 features, only the first six principal components are considered. The reason is that there are two sources of coincidences. The first is coincidence due to spectral similarity of land cover classes (signal), the second is coincidence due to noise. For this work, only the coincidence of signal is relevant. Since the algorithm is not fit to distinguish between the two sources, only the six first principal components are considered relevant. They explain 99.66% of the sum of eigenvalues and are therefore believed to contribute considerably to coincidences due to signal and only marginally to coincidence due to noise.

At first, the dataset is classified with a linear kernel SVM as given in Equation (18). A visual result can be seen in

The overall accuracy is the percentage of correctly classified pixels from the reference data. The

As can be seen, both results have a lot of resemblance in the major part. However, the result produced with the linear kernel tends to confuse the brown crop classes in the north with green pasture classes. On the other hand, the linear kernel SVM better recognizes the street in the Western part of the image.

The kernel target alignment between these kernels and the ideal kernel

was computed. The ideal kernel is defined via the label

where

denotes the usual scalar product between Gram matrices.

The kernel target alignment takes values in the interval

The users’ accuracy shows what percentage of a particular ground class was correctly classified. The pro- ducers’ accuracy is a measure of the reliability of an output map generated from a classification scheme which tells what percentage of a class truly corresponds to a class in the reference. Both are local (i.e. class-dependent) measurements of performance.

Value | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 |
---|---|---|---|---|---|---|---|---|

pa(Kmult) | 0 | 46.6 | 12.6 | 1.2 | 17.5 | 82.9 | 0 | 99.1 |

pa(Klin) | 0 | 39.9 | 0.8 | 2.4 | 49.1 | 80.4 | 0 | 99.1 |

pa(Kmult) − pa(Klin) | 0 | 6.7 | 11.8 | −1.2 | −31.6 | 2.5 | 0 | 0 |

ua(Kmult) | 0 | 43.0 | 33.4 | 20.0 | 74.3 | 72.8 | 0 | 75.8 |

ua(Klin) | 0 | 38.9 | 45.4 | 28.5 | 57.1 | 78.5 | 0 | 84.5 |

ua(Kmult) − ua(Klin) | 0 | 4.1 | −12.0 | −8.5 | 17.2 | −5.7 | 0 | −8.7 |

Value | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 |
---|---|---|---|---|---|---|---|---|

pa(Kmult) | 0 | 10.1 | 80.6 | 4.6 | 90.5 | 90.2 | 15.4 | 63.6 |

pa(Klin) | 0 | 0.1 | 88.0 | 1.1 | 91.8 | 86.5 | 15.0 | 84.8 |

pa(Kmult) − pa(Klin) | 0 | 10.0 | −7.4 | 3.5 | −1.3 | 3.7 | 0.4 | −21.2 |

ua(Kmult) | 0 | 38.9 | 46.3 | 32.2 | 54.0 | 68.8 | 45.5 | 93.3 |

ua(Klin) | 0 | 50.0 | 43.7 | 12.8 | 56.6 | 72.5 | 65.5 | 86.1 |

ua(Kmult) − ua(Klin) | 0 | −11.1 | 2.6 | 19.4 | −2.6 | −3.7 | −20.0 | 7.2 |

As had to be expected, each classification approach outperformed the other for some classes. The approach based on

Since producers’ accuracy outlines which amount of pixels from the reference are found in the classification (completeness) while users’ accuracy outlines which amount of the pixels in one class are correct, it can be concluded, that the proposed approach produces more complete results for many classes than with the standard linear kernel approach. Of course, due to the low overall accuracy values yielded, the approach should be ex- tended by applying e.g. Gaussian functions over the similarity matrices.

Finding optimal Baire distances defined by permutations of

This work has grown out of a talk given at the International Conference on Classification (ICC 2011) and the discussions afterwards. The first author is funded by Deutsche Forschungsgemeinschaft (DFG), and the second author by the Deutsches Zentrum für Luft-und Raumfahrt e.V. (DLR). Thanks to Fionn Murtagh, Roland Glantz, and Norbert Paul for valuable conversation, as well as Fionn Murtagh and David Wishart for the organising of the International Conference on Classification (ICC 2011) in Saint Andrews, Scotland. The article processing charge was funded by the German Research Foundation (DFG) and the Albert Ludwigs University Freiburg in the funding programme Open Access Publishing.

Patrick ErikBradley,Andreas ChristianBraun, (2015) Finding the Asymptotically Optimal Baire Distance for Multi-Channel Data. Applied Mathematics,06,484-495. doi: 10.4236/am.2015.63046