Legendre Polynomial Kernel: Application in SVM

In machines learning problems, Support Vector Machine is a method of classification. For non-linearly separable data, kernel functions are a basic ingre-dient in the SVM technic. In this paper, we briefly recall some useful results on decomposition of RKHS. Based on orthogonal polynomial theory and Mercer theorem, we construct the high power Legendre polynomial kernel on the cube [ ] 1,1 d − . Following presentation of the theoretical background of SVM, we evaluate the performance of this kernel on some illustrative examples in comparison with Rbf, linear and polynomial kernels.


Introduction
The approach of reproducing kernel function has many applications in probability theory, in statistics and recently in machine learning (see [1] [2] [3]). They have been applied in many fields and domains of life sciences such as pattern recognition, biology, medical diagnosis, chemistry and bio-informatics.
It is well-known from many years that data sets can be modeled by a family of points d ⊂   and similarity between them is given by an inner product on d  . The classification problem consists of separating these points into classes with respect to given properties. In the simplest situation, the points are linearly separable in the sense that there exists an hyperplane s H separating the two classes.
Support vector machines (SVMs) have become a very powerful tool in machine learning in particular in classification problems and regression ( [1] [4] [5]).
In classification problem, we can consider only two-classes of data, so we speak about a binary classification. Also we can rencounter more than two classes of data. This is called a multi-task classification problem. In our cases, we focus on binary classification and the application of kernel functions in SVM classification.
In simplest situations, the linear SVM technic allows to find the optimal hyperplane separating points in two classes. Unfortunately, in concrete examples, these points are not linearly separable. Then, one can traduce similarity of points in term of positive definite kernel function k on  . This leads to invent a new technic named non-linear SVM which combines linear SVM and kernel tools.
Mathematically, this new approach of non-linear SVM is related to the Kolmogorov representation ( ) , Φ  , where  represents the feature space and Φ is the feature map (see [1]). Since separation expresses a degree of similarity between points in the same class, Vapnick used the kernel approach to translate the problem from initial space d  to the feature space. The transfer between initial space and the feature space is made under the feature map and similarities are expressed by the inner product in the feature space which is given by a kernel k. The crucial idea of the kernel function methods is that non-linearly separable points can be transformed to linearly separable points in the feature space guarding similarities which is expressed by Furthermore, the solution of the classification problem using non-linear SVM is given by the decision function which depends only on the kernel and support vectors ( i x S ∈ ). Precisely, it takes the following form Since the choice of the kernel function is not canonic, the first problem of this approach is how to choose a suitable kernel function for such given data points. The second problem is that the feature space has an infinite dimensional and this causes a technical problem in modeling computational algorithm for such a solution. This problem is also related to the feature map Φ which is in general unknown. In our case, we use the Mercer decomposition theorem in order to obtain a type polynomial kernel having a good separation property. This means that infinite dimensional feature space can be approximated by finite dimensional feature space. This reduces the dimensionality of new data in feature space.
The paper is organized as follows: In Section 2, we recall some known results on Reproducing Kernel Hilbert Space (RKHS in what follow), in particular, the Mercer decomposition theorem 2.2 and the high power kernel theorem 2.3. Section 3 is devoted to the main result in which we introduce the one-dimensional Legendre polynomial kernel n K and we give its canonical decomposition in term of Legendre orthogonal polynomials (see Theorem 3.1 is the tensor product Hilbert space of the RKHS associated with n K . In Section 4, we recall the theoretical foundation of linear and non-linear SVMs. In Section 4, we give some illustrative examples in order to evaluate the performance of the Legendre polynomial kernel in comparison with some predefined kernels in Python.

Preliminaries
In this section, we begin by describing the RHKS and its associated kernel. Then we give the decomposition of a given positive definite kernel on a measure space  (see Theorem 2.2). Definition 1. (Positive definite kernel). Let | XX be a nonempty set. A symmetric function : and for mutually distinct i x , the equality holds only when all the i a are zero.
Clearly that every inner product is a positive definite function. Moreover if  is any Hilbert space and  a nonempty set and : Now let  be a Hilbert space of functions mapping from some non-empty set  to  , i.e.,  is considered as a subset of   . We write the inner product on  as , , , f g f g ∈  and the associated norm will be denoted by 1 2 2 , , .
We may alternatively write the function f as ( ) ., is endowed with the inner product ( ) and absolute for each pair ( ) is the RKHS associated with k.

Theorem 2.3. (Power of kernel [4] [7])
Let k be a kernel on  . Then In addition, there is an isometric isomorphism between K  and the Hilbert space tensor product Example 1. Here we give some most known kernels used in SVM. 1) The linear kernel: 2) The polynomial kernel: 3) The exponential kernel: 4) The radial basis function (Rbf): 5) The sigmoid kernel:

Legendre Polynomial Kernel and High Power Legendre Polynomial Kernel
Let us consider the family of Legendre polynomials { } 0 n n L ≥ defined by the following recurrence relation (see [8]) It well-known [8] that they are orthogonal w.r.t the inner product In order to apply the Mercer theorem in SVM, we consider the Legendre poly- be the linear space generated by this family { } 0 k k n L ≤ ≤ . Then 1) n  is a Hilbert space endowed with the inner product 2) The sequence { } 0 k k n L ≤ ≤ is an orthonormal basis of n  .
3) n  is a RKHS with reproducing kernel n K .

4) The feature map associated with
Proof. First: Clearly that (3.6) defines a inner product on n  , for which n  becomes an Hilbert space since it has a finite dimension.
Second: Form the definition of the inner product (3.6), clearly that { } 0 k k n L ≤ ≤ is an orthonormal system of n  whose cardinality coincides with the dimension of n  . Thus it is an orthonormal basis of n  . Third: Let us consider the operator n S defined on is self-adjoint compact and positive definite.
3) For all 0,1, , is an orthonormal basis of n  formed by eigenvectors of n T . From Mercer theorem 2.2, The RKHS associated with n K is given by ∈ for all sequence j α , the space n K  coincides with n  given by (3.5). Thus n  is the RKHS associated with the reproducing kernel n K .
The reproducing property is immediate from Mercer theorem but also it can be checked by hand using (3.6) The feature map is given by Let us introduce

Application of Kernels in Classification Problem
The binary classifications means that given a date points

Support Vector Machine: Linearly Separable Classification
When the training samples are linearly separable, we speak about a linear classification (i.e., there exists an hyperplane separating the two classes), the idea of SVM is the following: Let us consider the set of training points, ( )  and the corresponding labels (see Figure 1).
Step.1. Recall that any affine hyperplane is described by the equation where " ⋅ " is the dot product in d  and w is normal to the hyperplane. So we have to determinate the appropriate values of w and b for the hyperplane S  . If we now just consider the points that lie closest to the separating hyperplane, i.e., the Support Vectors (shown in circles in the diagram), then the two planes 1  and 1 −  that these points lie on can be described by: Referring to Figure 1, we define 1 δ as being the distance from 1  to the hyperplane S  and 2 δ from The quantity δ is known as the SVM's margin. In order to orientate the hyperplane to be as far from the Support Vectors as possible, we need to maximize this margin. It is known from [1] that this margin is equal to Now the problem is to find w and b such that the marginal distance δ is maximal and for all 1 for and 1 for .
This is equivalent to the optimization problem under constraint which is in fact a quadratic programming optimization (Q.P-optimization). Using the Lagrange multipliers ( ) This is a convex quadratic optimization problem, and we run a QP--solver which will return λ . Thus we can deduce w and b which are given by [1] where S the set of the support vectors s x (i.e., the vectors of indices i corresponding to 0 i λ > ) and S N is the number of support vectors. Step.2.
The second step is to create the decision function f which determines to which class a new point belongs. From [1], the decision function is given by Thus for a new data x, ( ) 1 f x = means x belongs to the first class and ( ) 1 means that x belongs to the second class.
In practise, in order to use an SVM to solve a linearly separable, binary classification problem we need to: 1) Create the matrix 2) Find λ so that

SVM for Data That Is Not Fully Linearly Separable
In order to extend the SVM methodology to handle data that is not fully linearly separable, we relax the constraints (4.3) slightly to allow for misclassified points. This is done by introducing a positive slack variable 0 which can be combined into: In this soft margin SVM, data points on the incorrect side of the margin boundary have a penalty that increases with the distance from it. As we are trying to reduce the number of misclassifications, a sensible way to adapt our objective function (4.8) from previously, is to find where the parameter C controls the trade-off between the slack variable penalty and the size of the margin. Similarly to what have been done in the previous case, the Lagrange method leads to the following convex quadratic optimization problem We run a QP-solver which will return λ . The values of w and b are calculated in the same way as (4.6), though in this instance the set of Support Vectors used to calculate b is determined by finding the indices i for which 0 i C λ < < . In practise, we need to: 2) Select a suitable value for the parameter C which determines how significantly misclassifications should be treated.

Non-Linear SVM
In the case when the data points are not linearly separable, i.e., there is no hyperplane separating data in two classes, we have to insert some modification on the data in order to obtain a linearly separable points. This is based on the kernel functions. It is worth noting that in the case of linearly separable data, the decision function requires only the dot product of the data points i x and the input vector x with each i x . In fact, when applying the SVM technic to linearly separable data we have started by creating a matrix H and the scalar b from the dot product of our input variables , 1 ; .
This is an important constatation for the Kernel Trick. In fact the dot product will be replaced by such a kernel which also a positive definite function. The idea is based on the choice of such kernel function k and the trick is to maps the data into a high-dimensional feature space  via a transformation φ related to k, in such way that the transformed data are linearly separable. φ is called feature map : φ →   . When this hyperplane is back into the original space it describes a surface.
Similarly to the previous section, we adopt the same procedure of separation at the level of the feature space  . This leads to the following steps.
1) Choose a kernel function k.
Note that in general, the feature map φ is unknown so w is also unknown.
But we don't need it! We need only the values of the kernel at the training points (i.e., ( )

Numerical Simulations
Dataset of female patients with minimum twenty one year age of Pima Indian population has been taken from UCI machine learning repository. This dataset is originally owned by the National institute of diabetes and digestive and kidney diseases. In this dataset, there are total 768 instances classified into two classes: diabetic and non diabetic with eight different risk factors: Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function and Age. To diagnose diabetes for Pima Indian population, performance of all kernels is evaluated upon parameters like Accuracy, precision, recall score, F 1 score and execution time.
Before giving experiment results, we would like to recall some characteristics of the confusion matrix (Table 1) and related metric evaluation.  and negative observations. In other words, accuracy tells us how often we can expect our machine learning model will correctly predict an outcome out of the total number of times it made predictions. The confusion matrices corresponding to each kernels mentioned above are given in Figures 2-5, where the test size is 0.05 and the penalty coefficient C = 0.01.    In order to show the powerful separation properties of the kernel defined by (3.1), our numerical simulations were carried out on the diabetes detection model mentioned above. In this example we apply SVM with different kernels in Pima Indian diabetes dataset in order to compare the performance of the Legendre polynomial kernel with linear, rbf, polynomial kernels with respect to their Accuracy, Precision, Recall score, F 1 score and Execution time (see Table 2). Journal of Applied Mathematics and Physics The programs were implemented in Python.3.7 a Windows 7 computer with 3G memory. The test size is equal to 0.2 and the penalty is taken to be C = 0.001.

Conclusion
Form the comparative table above, it is clear that the polynomial Legendre kernel have a good separation property with a good precision and accuracy w.r.t the classical predefined kernel in Python. Essentially, we have shown that it is not necessary to have an RKHS with an infinite dimension in order to separate by an hyperplane all dataset in the feature space. In fact we can obtain a good classification with non big degree 20 N = . The more N increases, the more we obtain a better separation. The only disadvantage is the time required to fit the model with the polynomial Legendre kernel. The idea used here can be generalized with arbitrary sequence of orthogonal polynomials in order to get new kernel functions. The performance of separation property depends on the polynomials sequence and it should be tested on some examples of dataset.