^{1}

^{*}

^{2}

^{*}

The paper establishes a theorem of data perturbation analysis for the support vector classifier dual problem, from which the data perturbation analysis of the corresponding primary problem may be performed through standard results. This theorem derives the partial derivatives of the optimal solution and its corresponding optimal decision function with respect to data parameters, and provides the basis of quantitative analysis of the influence of data errors on the optimal solution and its corresponding optimal decision function. The theorem provides the foundation for analyzing the stability and sensitivity of the support vector classifier.

Many methods of data mining exist, including machine learning which is a major research direction of artificial intelligence. The theory of statistics plays a fundamental role in machine learning. However the classic theory of statistics is often based on large sample properties, while in reality we often face small samples, sometimes with a very limited number of observations due to resource or other constraints. Consequently the performance of some large sample methods may be unsatisfactory in certain real applications. Vapnik and collaborators pioneered in the 1960s machine learning for finite samples and developed the statistical learning theory [

Let the training data set be T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x m , y m ) } ∈ ( X × Y ) m , where x i ∈ X = R n is the training input and y i ∈ Y = { − 1 , + 1 } is the training output, i = 1 , 2 , ⋯ , m . The essence of the SVC problem for data mining is to seek a real-valued function f ( x ) on the training input set R n such that the decision function s g n f ( x ) infers the corresponding category in the set { − 1, + 1 } of an arbitrary data input x ∈ X = R n , where s g n f ( x ) = − 1 if f ( x ) < 0 and s g n f ( x ) = 1 if f ( x ) > 0 [

The primary problem of the standard support vector classifier (SVC) is to

min w ∈ H , b ∈ R , ψ ∈ R m τ ( w , b , ψ ) = 1 2 w T w + ∑ i = 1 m C i ψ i subjectto g i ( w , b , ψ ) = − y i ( ( x i ) T w + b ) − ψ i + 1 ≤ 0, i = 1,2, ⋯ , m , g m + i ( w , b , ψ ) = − ψ i ≤ 0 , i = 1 , 2 , ⋯ , m , (1)

where T stands for transpose, C i > 0 , i = 1 , ⋯ , m and ψ = ( ψ 1 , ⋯ , ψ m ) T are penalty parameters, w consists of the slopes of the super-plane, b ∈ R is the intercept of the super-plane, and x i = ϕ ( x i ) .

The quadratic Wolfe dual problem of the above primary problem (1) is to

min α W ( α ) = 1 2 α T H α − e T α subject to h ( α ) = α T y = 0 , C i ≥ h i ( α ) = α i ≥ 0 , i = 1 , 2 , ⋯ , m , (2)

where α = ( α 1 , ⋯ , α m ) T , y = ( y 1 , ⋯ , y m ) T , e = ( 1, ⋯ ,1 ) T , K ( x i , x j ) = ( ϕ ( x i ) ) T ϕ ( x j ) is the kernel function, and H i j = y i y j K ( x i , x j ) .

The relationships between the optimal solution α i * , i = 1 , ⋯ , m of the dual problem (2) and the optimal solution ( w * , b * ) of the primary problem (1) is given by

w * = ∑ i = 1 m α i * y i x i , b * = y i − ( ϕ ( x i ) ) T w * = y i − ∑ j = 1 m α j * y j K ( x j , x i ) , for any α i * ∈ ( 0, C i ) .

The corresponding optimal decision function is given by s g n f ( x ) , where f ( x ) = ∑ i = 1 m α i * y i K ( x i , x ) + b * . Furthermore, each component α i * of the optimal solution α * corresponds to a training input, and we call a training input x i a support vector if its corresponding α i * > 0 .

In data mining, the training input data x i , i = 1 , 2 , ⋯ , m of the support vector classifier (SVC) model are approximation of the true values. When using the approximate values to establish the SVC model, errors in data will inevitably impact on the optimal solution and the corresponding optimal decision function of the SVC model. When the upper bound of data errors is known, the method of data perturbation analysis is used to derive the upper bounds of the optimal solution and its corresponding optimal decision function.

We are interested in the stability and sensitivity analysis of the optimal solution and its corresponding optimal decision function. Our analysis is based on the Fiacco’s Theorem on the sensitivity analysis of a general class of nonlinear programming problem [

Suppose that ( w * , b * , ψ * ) is the optimal solution of the primary problem (1). Corresponding to the solution ( w * , b * , ψ * ) , we divide, for convenience, the training data ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x m , y m ) into categories A , B , C as follows:

1) A category: points satisfying ( x i ) T w * + b * = 1 and y i = + 1 , or ( x i ) T w * + b * = − 1 and y i = − 1 . These points are denoted as ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x t , y t ) for convenience. Denote A = { 1 , ⋯ , t } .

2) B category: points satisfying ( x i ) T w * + b * > 1 and y i = + 1 , or ( x i ) T w * + b * < − 1 and y i = − 1 . These points are denoted as ( x t + 1 , y t + 1 ) , ( x t + 2 , y t + 2 ) , ⋯ , ( x s , y s ) for convenience. Denote B = { t + 1 , ⋯ , s } .

3) C category: points satisfying ( x i ) T w * + b * < 1 and y i = + 1 , or ( x i ) T w * + b * > − 1 and y i = − 1 . These points are denoted as ( x s + 1 , y s + 1 ) , ( x s + 2 , y s + 2 ) , ⋯ , ( x m , y m ) for convenience. Denote C = { s + 1 , ⋯ , m } .

We call those indices i such that either g i ( w , b , ψ ) = 0 or g m + i ( w , b , ψ ) = 0 as working constraint indices. Let’s start with a lemma.

Lemma 1. Suppose that ( w * , b * , ψ * ) is the optimal solution of the primary problem (1). Then the set of all working constraint indices is give by

I ( w * , b * , ψ * ) = { 1 , ⋯ , t , m + 1 , ⋯ , m + t , m + t + 1 , ⋯ , m + s , s + 1 , ⋯ , m } = A ∪ { m + i : i ∈ A } ∪ { m + j : j ∈ B } ∪ C .

Proof. The proof is straightforward from the definitions of working constraint index and the observation that points in A and B categories imply that ψ i = 0 and points in the C category imply that ψ i > 0 . ,

Then we offer a sufficient condition for the linear independence of the gradients indexed by the set I ( w * , b * , ψ * ) .

Lemma 2. Let α * = ( α 1 * , ⋯ , α m * ) T be the optimal solution of the dual problem (2). If there exists any component α j * of α * such that 0 < α j * < C j where

j ∈ 1,2, ⋯ , m , then the set of gradients { ( ∂ h ∂ α ) T , ( ∂ h i ∂ α ) T , i ∈ I ( w * , b * , ψ * ) } is linearly independent.

Proof. For i ∈ A , suppose there exists 1 ≤ t 1 ≤ t such that α i * > 0 for i = 1 , 2 , ⋯ , t i but α i * = 0 for i = t 1 + 1 , ⋯ , t . For i ∈ B , we always have α i * = 0 . For i ∈ C , we always have α i * = C i . The set of gradients

{ ( ∂ h ∂ α ) T , ( ∂ h i ∂ α ) T , i ∈ I ( w * , b * , ψ * ) } becomes

( y 1 y 2 y 3 ⋮ ⋮ ⋮ ⋮ y m ) , ( 0 − 1 0 ⋮ 0 0 ⋮ 0 ) , ( 0 0 − 1 ⋮ 0 0 ⋮ 0 ) , ⋯ , ( 0 0 0 ⋮ − 1 0 ⋮ 0 ) , ( 0 0 0 ⋮ 0 1 ⋮ 0 ) , ⋯ , ( 0 0 0 ⋮ 0 0 ⋮ 1 )

Since it is assumed that there is a support vector multiplier α i * such that

0 < α i * < C i , the number of gradients in { ( ∂ h i ∂ α ) T , i ∈ I ( w * , b * , ψ * ) } must be smaller than the dimension m, which is the dimension of the vector y = ( ∂ h ∂ α ) T . Therefore the vector ( ∂ h ∂ α ) T = y whose m components are either −1 or +1 cannot be expressed as a linear combination of the set of gradients { ( ∂ h i ∂ α ) T , i ∈ I ( w * , b * , ψ * ) } . Hence the set of constrained gradients is linearly independent. ,

We use the lower case letter p to denote the training input variable and p 0 to denote the training input data. When the form of the kernel function ϕ ( x ) is known, the following main Theorem investigates how the errors in the training input data impact on the optimal solution and the corresponding optimal decision function of the SVC model.

Theorem 3. Suppose that α * = ( α 1 * , ⋯ , α m * ) T is the optimal solution of the dual problem (2) when p = p 0 , and the corresponding Lagrange multiplier is b * . Furthermore, g * = ( g 1 * , ⋯ , g m * ) T and ψ * = ( ψ 1 * , ⋯ , ψ m * ) T = ( g m + 1 * , ⋯ , g 2 m * ) T . Suppose that A category input vectors x 1 , x 2 , ⋯ , x t are all support vectors with 0 < α i * < C i , i = 1 , 2 , ⋯ , t , and the set of vectors

( y 1 ϕ ( x 1 ) y 1 ) , ( y 2 ϕ ( x 2 ) y 2 ) , ⋯ , ( y t ϕ ( x t ) y t )

corresponding to points in category A is linearly independent. Then the following results hold.

1) The optimal solution α * = ( α 1 * , ⋯ , α m * ) T is unique, and the corresponding multiplier b * as well as g * = ( g 1 * , ⋯ , g m * ) T and ψ * = ( ψ 1 * , ⋯ , ψ m * ) T = ( g m + 1 * , ⋯ , g 2 m * ) T are all unique.

2) There is a neighbourhood N ( p 0 ) of p 0 on which there is a unique continuously differentiable function y ( p ) = ( α ( p ) , b ( p ) , g ( p ) , ψ ( p ) ) such that

a) y ( p 0 ) = ( α * , b * , g * , ψ * ) ,

b) α ( p ) is a feasible solution to the dual problem (2) for any p ∈ N ( p 0 ) ,

c) the partial derivatives of y ( p ) = ( α ( p ) , b ( p ) , g ( p ) , ψ ( p ) ) with respect to data parameters satisfy

M ( p ) ( ( ∂ α ∂ p ) T ( ∂ b ∂ p ) T ( ∂ g ∂ p ) T ( ∂ ψ ∂ p ) T ) = M 1 ( p ) ,

where M 1 ( p ) = − [ ∂ ( ∇ α L ) ∂ p , g 1 ∇ p α 1 , ⋯ , g m ∇ p α m , ∇ p ( y T α ) ] T , ∇ α L is the

gradient of L with respect to α , ∇ p α i , i = 1 , ⋯ , m is the gradient of α i with respect to p, ∇ p ( y T α ) is the gradient of y T α with respect to p, and L is the

Lagrange function of the primary problem (1), M ( p ) = ( E F U V ) , and

E = ( y 1 2 K ( x 1 , x 1 ) y 1 y 2 K ( x 1 , x 2 ) ⋯ y 1 y m K ( x 1 , x m ) y 2 y 1 K ( x 2 , x 1 ) y 2 2 K ( x 2 , x 2 ) ⋯ y 2 y m K ( x 2 , x m ) ⋮ ⋮ ⋱ ⋮ y m y 1 K ( x m , x 1 ) y m y 2 K ( x m , x 2 ) ⋯ y m 2 K ( x m , x m ) ) ,

F = ( − 1 0 ⋯ 0 1 0 ⋯ 0 y 1 0 − 1 ⋯ 0 0 1 ⋯ 0 y 2 ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 ⋯ − 1 0 0 ⋯ 1 y m ) ,

U = ( − g 1 * 0 ⋯ 0 0 − g 2 * ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ − g m * ψ 1 * 0 ⋯ 0 0 ψ 2 * ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ ψ m * y 1 y 2 ⋯ y m ) ,

and

V = ( − α 1 0 ⋯ 0 0 0 ⋯ 0 0 0 − α 2 ⋯ 0 0 0 ⋯ 0 0 ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 ⋯ − α m 0 0 ⋯ 0 0 0 0 ⋯ 0 α 1 − C 1 0 ⋯ 0 0 0 0 ⋯ 0 0 α 2 − C 2 ⋯ 0 0 ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 ⋯ 0 0 0 0 α m − C m 0 0 0 ⋯ 0 0 0 0 0 0 )

Proof.

Our results follow directly from the Fiacco Theorem after checking the following three conditions required in the Fiacco Theorem:

The Second Order Sufficiency Condition for α * ,

Linear Independence of Gradients over the Working Constraint Set,

The Strict Complementarity Property.

Proof of the Second Order Sufficiency Condition. Suppose that α * = ( α 1 * , ⋯ , α m * ) T is the optimal solution to the dual problem. Then there exist multipliers b * ∈ R , g * = ( g 1 * , ⋯ , g m * ) T ∈ R m and ψ * = ( ψ 1 * , ⋯ , ψ m * ) T = ( g m + 1 * , ⋯ , g 2 m * ) T ∈ R m satisfying the Kurash-Kuhn-Tucker Condition:

H α * − e + b * y − g * + ψ * = 0 ,

g i * ≥ 0 for any i ,

ψ i * ≥ 0 for any i ,

( g * ) T α * = 0 ,

( ψ * ) T ( α * − C ) = 0 ,

where H = ( H i j ) and C = ( C 1 , ⋯ , C m ) T .

The Hessian matrix corresponding to the Lagrangian function

L ( α * , b * , g * , ψ * ) = 1 2 ( α * ) T H α * − e T α * + b * ( ( α * ) T y ) − ( g * ) T α * + ( ψ * ) T ( α * − C )

becomes the matrix H.

For i ∈ A , we have assumed that α i * > 0 . For i ∈ B , we have g i * > 0 , α i * = 0 and ψ i * = 0 . For i ∈ C , we have g i * = 0 , α i * = C i and ψ i * > 0 .

Define the set Z = { z : z T y = 0 ; z i ≤ 0 , i = 1 , ⋯ , t ; z i = 0 , i = t + 1 , ⋯ , m } . Then for any z = ( z 1 , ⋯ , z t , 0 , ⋯ , 0 ) ∈ Z , z ≠ 0 , we have

z T ∇ 2 L ( α * , b * , g * , ψ * ) z = ( z 1 , ⋯ , z t , 0 , ⋯ , 0 ) H ( z 1 ⋮ z t 0 ⋮ 0 ) = ( z 1 , ⋯ , z t ) H * ( z 1 ⋮ z t ) = ( ∑ i = 1 t z i y i ϕ ( x i ) ) T ( ∑ i = 1 t z i y i ϕ ( x i ) ) ,

where ∇ 2 L ( α * , b * , g * , ψ * ) = H is the Hessian matrix of L ( α * , b * , g * , ψ * ) and H * is the upper left t × t sub-matrix of H.

Suppose that z 1 y 1 ϕ ( x 1 ) + ⋯ + z t y t ϕ ( x t ) = 0 . Then because the set of vectors

( y 1 ϕ ( x 1 ) y 1 ) , ( y 2 ϕ ( x 2 ) y 2 ) , ⋯ , ( y t ϕ ( x t ) y t )

corresponding to points in category A is linearly independent and that z 1 y 1 + ⋯ + z t y t = 0 , we must have z 1 = ⋯ = z t = 0 . This contradicts with the assumption that z ≠ 0 . Hence we must have z 1 y 1 ϕ ( x 1 ) + ⋯ + z t y t ϕ ( x t ) ≠ 0 . This implied that z T ∇ 2 L ( α * , b * , g * , ψ * ) z > 0 and therefore the Second Order Sufficiency Condition is satisfied by α * .

Proof of the Linear Independence of Gradients over the Working Constraint Set. This is Lemma 2.

Proof of the Strict Complementarity Property. We need to show that g * and α * cannot be 0 simultaneously, and ψ * and α * − C cannot be 0 simultaneously. For i ∈ A , we have α i * > 0 and hence the intersection between A and the working constraint set is empty because of the assumption α i * < C i , i ∈ A . For i ∈ B , we have α i * = 0 and multiplier g i * > 0 . For i ∈ C , we have α i * = C i and multiplier ψ i * > 0 . Therefore the Strict Complementarity Property holds. ,

Support vector classifier plays an important role in machine learning and data mining. Due to the standard results that connect the primary problem and its dual problem, analysis of the primary problem can be achieved by working on its dual problem. Our main result establishes the equation for solving the partial derivatives of the optimal solution and the corresponding optimal decision function with respect to data parameters. Because the derivative measures the rate of change and a large value of the derivative often implies a large rate of change and hence a sensitive and unstable solution, our main result provides the foundation of quantitative analysis based on the derivatives.

This work is funded with grants from the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions CIT & TCD201404080 and New Start Academic Research Projects of Beijing Union University. Xikui Wang acknowledges research support from the Natural Sciences and Engineering Research Council (NSERC) of Canada and from the University of Manitoba.

The authors declare no conflicts of interest regarding the publication of this paper.

Cai, C. and Wang, X.K. (2018) Data Perturbation Analysis of the Support Vector Classifier Dual Model. Journal of Software Engineering and Applications, 11, 459-466. https://doi.org/10.4236/jsea.2018.1110027