1. Introduction
The approach of reproducing kernel function has many applications in probability theory, in statistics and recently in machine learning (see [1] [2] [3]). They have been applied in many fields and domains of life sciences such as pattern recognition, biology, medical diagnosis, chemistry and bio-informatics.
It is well-known from many years that data sets can be modeled by a family of points
and similarity between them is given by an inner product on
. The classification problem consists of separating these points into classes with respect to given properties. In the simplest situation, the points are linearly separable in the sense that there exists an hyperplane
separating the two classes.
Support vector machines (SVMs) have become a very powerful tool in machine learning in particular in classification problems and regression ( [1] [4] [5]). In classification problem, we can consider only two-classes of data, so we speak about a binary classification. Also we can rencounter more than two classes of data. This is called a multi-task classification problem. In our cases, we focus on binary classification and the application of kernel functions in SVM classification.
In simplest situations, the linear SVM technic allows to find the optimal hyperplane separating points in two classes. Unfortunately, in concrete examples, these points are not linearly separable. Then, one can traduce similarity of points in term of positive definite kernel function k on
. This leads to invent a new technic named non-linear SVM which combines linear SVM and kernel tools.
Mathematically, this new approach of non-linear SVM is related to the Kolmogorov representation
, where
represents the feature space and
is the feature map (see [1]). Since separation expresses a degree of similarity between points in the same class, Vapnick used the kernel approach to translate the problem from initial space
to the feature space. The transfer between initial space and the feature space is made under the feature map and similarities are expressed by the inner product in the feature space which is given by a kernel k. The crucial idea of the kernel function methods is that non-linearly separable points can be transformed to linearly separable points in the feature space guarding similarities which is expressed by
Furthermore, the solution of the classification problem using non-linear SVM is given by the decision function which depends only on the kernel and support vectors (
). Precisely, it takes the following form
Since the choice of the kernel function is not canonic, the first problem of this approach is how to choose a suitable kernel function for such given data points. The second problem is that the feature space has an infinite dimensional and this causes a technical problem in modeling computational algorithm for such a solution. This problem is also related to the feature map
which is in general unknown. In our case, we use the Mercer decomposition theorem in order to obtain a type polynomial kernel having a good separation property. This means that infinite dimensional feature space can be approximated by finite dimensional feature space. This reduces the dimensionality of new data in feature space.
The paper is organized as follows: In Section 2, we recall some known results on Reproducing Kernel Hilbert Space (RKHS in what follow), in particular, the Mercer decomposition theorem 2.2 and the high power kernel theorem 2.3. Section 3 is devoted to the main result in which we introduce the one-dimensional Legendre polynomial kernel
and we give its canonical decomposition in term of Legendre orthogonal polynomials (see Theorem 3.1). Next, we deduce the high power Legendre polynomial kernel
defined on the cube
for which the feature space
is the tensor product Hilbert space of the RKHS associated with
. In Section 4, we recall the theoretical foundation of linear and non-linear SVMs. In Section 4, we give some illustrative examples in order to evaluate the performance of the Legendre polynomial kernel in comparison with some predefined kernels in Python.
2. Preliminaries
In this section, we begin by describing the RHKS and its associated kernel. Then we give the decomposition of a given positive definite kernel on a measure space
(see Theorem 2.2).
Definition 1. (Positive definite kernel).
Let
be a nonempty set. A symmetric function
is called positive definite kernel if,
(2.1)
and for mutually distinct
, the equality holds only when all the
are zero.
Clearly that every inner product is a positive definite function. Moreover if
is any Hilbert space and
a nonempty set and
, then the function
is a positive definite.
Now let
be a Hilbert space of functions mapping from some non-empty set
to
, i.e.,
is considered as a subset of
. We write the inner product on
as
and the associated norm will be denoted by
We may alternatively write the function f as
, to indicate it takes an argument in
.
Definition 2. (Reproducing kernel Hilbert space).
Let
be a Hilbert space of
-valued functions defined on a non-empty set
. We say that
is a Reproducing Kernel Hilbert Space if there exists a kernel function
such that the following properties are satisfied:
1)
,
,
2)
,
,
(the reproducing property).
Remark 1. From the reproducing property, we note that
(2.2)
Thus necessarily, k is positive definite. Note also that reproducing kernel k associated with
, it is unique.
Theorem 2.1. (Moore-Aronszajn [6])
Let
be a positive definite kernel. There is a unique RKHS
with reproducing kernel k. Moreover, if space
(2.3)
is endowed with the inner product
(2.4)
where
and
, then
is the completion of
w.r.t the inner product (4).
Now let us consider a compact metric space
and a finite Borel measure
on it. We consider a positive definite continuous kernel on
. Let
be the linear map
and
, where
↪
. Then we have the following results:
Theorem 2.2. (Mercer’s theorem [6])
1)
is a compact self-adjoint positive definite from
to itself. It is called the integral operator corresponding to the kernel k.
2) There exists an ortho-normal system
of eigenvectors of
with eigenvalues
, where J is at most countable set of indices, corresponding to the strictly positive eigenvalues of
such that
(2.5)
3) For all
(2.6)
where
and where the convergence is uniform on
, and absolute for each pair
.
4) The Hilbert space
endowed with the inner product
(2.7)
is the RKHS associated with k.
Theorem 2.3. (Power of kernel [4] [7])
Let k be a kernel on
. Then
is a kernel on
. In addition, there is an isometric isomorphism between
and the Hilbert space tensor product
.
Example 1. Here we give some most known kernels used in SVM.
1) The linear kernel:
,
.
2) The polynomial kernel:
,
,
,
.
3) The exponential kernel:
,
.
4) The radial basis function (Rbf):
,
. It is also called the Guassian kernel.
5) The sigmoid kernel:
,
.
3. Legendre Polynomial Kernel and High Power Legendre Polynomial Kernel
Let us consider the family of Legendre polynomials
defined by the following recurrence relation (see [8])
(3.1)
where
(3.2)
It well-known [8] that they are orthogonal w.r.t the inner product
Moreover we have
(3.3)
In order to apply the Mercer theorem in SVM, we consider the Legendre polynomial kernel defined on
by
(3.4)
Theorem 3.1. Let
(3.5)
be the linear space generated by this family
. Then
1)
is a Hilbert space endowed with the inner product
(3.6)
2) The sequence
is an orthonormal basis of
.
3)
is a RKHS with reproducing kernel
.
4) The feature map associated with
is
Proof. First: Clearly that (3.6) defines a inner product on
, for which
becomes an Hilbert space since it has a finite dimension.
Second: Form the definition of the inner product (3.6), clearly that
is an orthonormal system of
whose cardinality coincides with the dimension of
. Thus it is an orthonormal basis of
.
Third: Let us consider the operator
defined on
by
Setting
, where
is the canonic injection of
into
.
1)
is positive definite. In fact
2)
is symmetric since
is symmetric, so by Mercer theorem 2.2,
is self-adjoint compact and positive definite.
3) For all
, we have
So the eigenvectors of
are exactly the polynomials
and the corresponding eigenvalues are
. Thus
is an orthonormal basis of
formed by eigenvectors of
.
From Mercer theorem 2.2, The RKHS associated with
is given by
Since
for all sequence
, the space
coincides with
given by (3.5). Thus
is the RKHS associated with the reproducing kernel
.
The reproducing property is immediate from Mercer theorem but also it can be checked by hand using (3.6)
Forth: The feature map is given by
Let
given by
So from (3.6), we get
□
Now let us consider the set
. Define the high power Legendre polynomial kernel function
as follows:
(3.7)
Let us introduce
called the d--times tensor product of
.
It’s known that
is a real Hilbert space endowed with the inner product
(3.8)
For
, we have
(3.9)
Define now the function
given by
(3.10)
by substituting (3.10) in (3.9), we get
(3.11)
Hence
is a kernel, with the feature map
and feature space
.
Now, applying Theorem 2.3 to the kernel
, we deduce the following result.
Corollary 3.1.1. The space
is a RKHS with corresponding kernel
.
4. Application of Kernels in Classification Problem
The binary classifications means that given a date points
which are belonging to two classes. One is denoted Class(1), the other is denoted Class(−1). To each point
we associate
indicating to which class
belongs. The points
are called training points. The idea of SVM is to find an hyperplane
separating the two classes and construct a decision function f from which a new point can be classified (x belongs to Class(1) or x belongs to Class(−1)).
4.1. Support Vector Machine: Linearly Separable Classification
When the training samples
are linearly separable, we speak about a linear classification (i.e., there exists an hyperplane separating the two classes), the idea of SVM is the following:
Let us consider the set of training points,
, where
and the corresponding labels
. The first step is to find an hyperplane
in the space
separating the two classes. The second step is to construct the decision function f.
Support Vectors are the examples closest to the separating hyperplane
and the aim of SVM is to orientate this hyperplane in such a way as to be as far as possible from the closest members of both classes. For example in 2-dimensional space this means that we can draw a line on a graph of
v.s
separating the two classes. In high dimensional space (
), the line will be replaced by an affine hyperplane
, i.e.,
(see Figure 1).
Step.1.
Recall that any affine hyperplane is described by the equation
where “
” is the dot product in
and w is normal to the hyperplane. So we have to determinate the appropriate values of w and b for the hyperplane
. If we now just consider the points that lie closest to the separating hyperplane, i.e., the Support Vectors (shown in circles in the diagram), then the two planes
and
that these points lie on can be described by:
Figure 1. Separating hyperplane and marginal distance.
(4.1)
(4.2)
Referring to Figure 1, we define
as being the distance from
to the hyperplane
and
from
to it. The hyperplane’s equidistance from
and
means that
. The quantity
is known as the SVM’s margin. In order to orientate the hyperplane to be as far from the Support Vectors as possible, we need to maximize this margin. It is known from [1] that this margin is equal to
Now the problem is to find w and b such that the marginal distance
is maximal and for all
,
(4.3)
This is equivalent to the optimization problem under constraint
(4.4)
which is in fact a quadratic programming optimization (Q.P-optimization). Using the Lagrange multipliers
, where
, the problem (4.4) will be equivalent to
(4.5)
where
This is a convex quadratic optimization problem, and we run a QP--solver which will return
. Thus we can deduce w and b which are given by [1]
(4.6)
where S the set of the support vectors
(i.e., the vectors of indices i corresponding to
) and
is the number of support vectors.
Step.2.
The second step is to create the decision function f which determines to which class a new point belongs. From [1], the decision function is given by
(4.7)
Thus for a new data x,
means x belongs to the first class and
means that x belongs to the second class.
In practise, in order to use an SVM to solve a linearly separable, binary classification problem we need to:
1) Create the matrix
, where
.
2) Find
so that
is maximized, subject to the constraints (
and
,
).
This is can be done using a QP solver.
3) Calculate
.
4) Determine the set of Support Vectors S by finding the indices such that
.
5) Calculate
.
6) Each new point
is classified by evaluating
.
4.2. SVM for Data That Is Not Fully Linearly Separable
In order to extend the SVM methodology to handle data that is not fully linearly separable, we relax the constraints (4.3) slightly to allow for misclassified points. This is done by introducing a positive slack variable
,
which can be combined into:
(4.8)
In this soft margin SVM, data points on the incorrect side of the margin boundary have a penalty that increases with the distance from it. As we are trying to reduce the number of misclassifications, a sensible way to adapt our objective function (4.8) from previously, is to find
(4.9)
where the parameter C controls the trade-off between the slack variable penalty and the size of the margin.
Similarly to what have been done in the previous case, the Lagrange method leads to the following convex quadratic optimization problem
(4.10)
We run a QP-solver which will return
. The values of w and b are calculated in the same way as (4.6), though in this instance the set of Support Vectors used to calculate b is determined by finding the indices i for which
.
In practise, we need to:
1) Create the matrix H, where
.
2) Select a suitable value for the parameter C which determines how significantly misclassifications should be treated.
3) Find
satisfying
This is done using a QP-solver.
4) Calculate
.
5) Determine the set of Support Vectors S by finding the indices such that
.
6) Calculate
.
7) Each new point
is classified by evaluating
4.3. Non-Linear SVM
In the case when the data points are not linearly separable, i.e., there is no hyperplane separating data in two classes, we have to insert some modification on the data in order to obtain a linearly separable points. This is based on the kernel functions. It is worth noting that in the case of linearly separable data, the decision function requires only the dot product of the data points
and the input vector x with each
. In fact, when applying the SVM technic to linearly separable data we have started by creating a matrix H and the scalar b from the dot product of our input variables
This is an important constatation for the Kernel Trick. In fact the dot product will be replaced by such a kernel which also a positive definite function. The idea is based on the choice of such kernel function k and the trick is to maps the data into a high-dimensional feature space
via a transformation
related to k, in such way that the transformed data are linearly separable.
is called feature map
. When this hyperplane is back into the original space it describes a surface.
Similarly to the previous section, we adopt the same procedure of separation at the level of the feature space
. This leads to the following steps.
1) Choose a kernel function k.
2) Create the matrix H, where
.
3) Choose how significantly misclassifications should be treated, by selecting a suitable value for the parameter C.
4) Determine
so that
is maximal under the constraint
and
.
This is can be done by using a QP-solver.
5) Determine the set of Support Vectors S by finding the indices such that
.
6) Calculate
.
7) Each new point
is classified by evaluating
Note that in general, the feature map
is unknown so w is also unknown. But we don’t need it! We need only the values of the kernel at the training points (i.e.,
for all
) to know b, and
to determine the values of the decision function f at the input vector
.
5. Numerical Simulations
Dataset of female patients with minimum twenty one year age of Pima Indian population has been taken from UCI machine learning repository. This dataset is originally owned by the National institute of diabetes and digestive and kidney diseases. In this dataset, there are total 768 instances classified into two classes: diabetic and non diabetic with eight different risk factors: Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function and Age. To diagnose diabetes for Pima Indian population, performance of all kernels is evaluated upon parameters like Accuracy, precision, recall score, F1 score and execution time.
Before giving experiment results, we would like to recall some characteristics of the confusion matrix (Table 1) and related metric evaluation.
5.1. Terminology and Derivations from a Confusion Matrix
where:
- (TP) is the number of True Positive cases (real positive cases and detected positive),
- (TN) is the number of True Negative cases (real negative cases and detected negative),
- (FP) is the number of False Positive cases (real negative cases and detected positive),
- (FN) is the number of False Negative cases (real positive cases and detected negative),
- (PP) is the number of Predicted Positive cases
,
- (PN) is the number of Predicted Negative cases
,
- (RP) is the number of Real Positive cases
,
- (RN) is the number of Real Negative
,
- The Total Population is given by
These are some metrics used in order to compare performance of Legendre polynomial kernel with linear, rbf and polynomial kernels.
- Precision (or positive predictive value PPV) has been used to determine classifier’s ability that provides correct positive predictions of diabetes.
- Recall score (or true positive rate or sensitivity) is used in our work to find the proportion of actual positive cases of diabetes correctly identified by the classifier used.
- Precision and recall score provide F1 score, so this score takes into account of both. It is the harmonic mean of precision and sensitivity:
- Accuracy (ACC) is the ratio of true positives and true negatives to all positive and negative observations. In other words, accuracy tells us how often we can expect our machine learning model will correctly predict an outcome out of the total number of times it made predictions.
5.2. Numerical Results
Since Legendre polynomial kernel is defined on
, a scaling procedure is needed in order to make data in
. For this, we suggest the following transformation
, where
The confusion matrices corresponding to each kernels mentioned above are given in Figures 2-5, where the test size is 0.05 and the penalty coefficient C= 0.01.
In order to show the powerful separation properties of the kernel defined by (3.1), our numerical simulations were carried out on the diabetes detection model mentioned above. In this example we apply SVM with different kernels in Pima Indian diabetes dataset in order to compare the performance of the Legendre polynomial kernel with linear, rbf, polynomial kernels with respect to their Accuracy, Precision, Recall score, F1 score and Execution time (see Table 2).
The programs were implemented in Python.3.7 a Windows 7 computer with 3G memory. The test size is equal to 0.2 and the penalty is taken to be C= 0.001.
5.3. Conclusion
Form the comparative table above, it is clear that the polynomial Legendre kernel have a good separation property with a good precision and accuracy w.r.t the classical predefined kernel in Python. Essentially, we have shown that it is not necessary to have an RKHS with an infinite dimension in order to separate by an hyperplane all dataset in the feature space. In fact we can obtain a good classification with non big degree
. The more N increases, the more we obtain a better separation. The only disadvantage is the time required to fit the model with the polynomial Legendre kernel. The idea used here can be generalized with arbitrary sequence of orthogonal polynomials in order to get new kernel functions. The performance of separation property depends on the polynomials sequence and it should be tested on some examples of dataset.
Acknowledgements
The authors gratefully acknowledge Qassim University, represented by the Deanship of Scientific Research, for the support for this research.