^{1}

^{*}

^{2}

^{*}

^{2}

^{*}

^{2}

^{*}

^{3}

^{*}

In this paper we study the problem of model selection for a linear programming-based support vector machine for regression. We propose generalized method that is based on a quasi-Newton method that uses a globalization strategy and an inexact computation of first order information. We explore the case of two-class, multi-class, and regression problems. Simulation results among standard datasets suggest that the algorithm achieves insignificant variability when measuring residual statistical properties.

Hyper-parameters estimation is currently one of the general open problems in SV learning [

Current efforts involve techniques of -fold cross validation [

In this research we use the -fold cross validation technique to estimate the true test generalization error. Along with this technique, we define error functions as measures of the training generalization error for both classification and regression problems. We propose to minimize the estimated true generalization error by adapting the Newton method with line-search. From the optimization point of view, the solution to the problem is non-trivial. To illustrate this,

This article is organized as follows: In Section 2, we begin our discussion by defining a generalized method to minimize a set of error functions that will reduce the training generalization error of a generic pattern recognition problem; therefore, in order to demonstrate the potential of this method. Section 3 discusses the usage of an LP-SVR formulation. The latter particularization is addressed in Section 4 where specific error functions are chosen to solve the problem. Section 5 discusses implementation details and other considerations in this research for this particular problem. The results are discussed in Section 6, and conclusions are drawn in Section 7.

Let us consider, such that and is a real function representing some estimate of error; where is a vector of parameters, and defines a training set given by N samples of the M-dimensional data vector, and a desired output class value. Then, let be denoted as

That is, represents different measures of error, provided model parameters, and training data. Here, we aim to make.

In this paper we will address the case when, that is, when the number of model parameters to estimate, is equal to the number of error metrics used to find such model parameters:, and.

If, then it has a gradient usually known as Jacobian given by

where denotes the gradient of the th function, given by

Since we want to find the vector of parameters that given a training set produce minimal error functions, such that, then we can use Newton’s method assuming, for now, that is continuously differentiable on.

The algorithm of Newton has been used for a number of years; it is well known from basic calculus and is perhaps, one of the most fundamental ideas in any numerical optimization course. This method can be summarized as shown in Algorithm 1.

Newton’s method is known because it has q-quadratic rate of convergence, finding a solution in very few iterations. Such that, if and only if such a solution exists.

This method is also known for one of its main disadvantages: it is a local method. Therefore, one need to have in advance a vector of parameters that is close to an acceptable solution. To overcome this difficulty, we can establish a globalization strategy.

In our globalization strategy we use the following merit function:

where denotes the -norm (a.k.a. euclidean norm). Then we define the following property.

Property 1. is a descent direction for . That is, in the system given by

Proof. Let be the derivative of the merit function (7) denoted as:

Then, substituting (9) into (8) results

which reduces to

Hence,.

Given the fact that the merit function (7) is indeed a valid function guaranteeing a descent at every iterate, then, we can establish the globalization strategy by defining the next property.

Property 2. If is a descent direction of , then, there exists an such that

The proof for this property is already given by Dennis et al. in 1996 [

which reduces to

where is a parameter controlling the speed of the line search. Typically [

Using the line-search globalization strategy, we can modify Newton’s method to include a sufficient decrease condition (a.k.a. Armijo’s condition). The Globalized Newton method is as shown in Algorithm 2.

Note that the fact that the algorithm makes progress to an acceptable solution is due to the new update step (15) that considers the new sufficient decrease condition. In the following sections it is shown how to find parameters from the LP-SVR model.

In this paper we aim to find the parameter of a linear programming support vector regression (LP-SVR) approach, that uses an infeasible primal-dual interior point method to solve the optimization problem [

In order to describe the LP-SVR formulation we need to start with the -SVR. The formulation of an - SVR (i.e. norm-1-SVR) problem is as follows:

where is the Lagrange multiplier associated with the support vectors (SVs); the summation in the cost function accounts for the -insensitive training error, which forms a tube where the solution is allowed to be defined without penalization; is a constant describing the trade off between the training error and the penalizing term; the variable is a nonnegative slack variable that describes the -insensitive loss function; is the desired output in response to the input vector; the variable is a bias; is any valid kernel function (see [11,12]). The parameter vector and the bias b are the unknowns, and can take on any real value.

Then, since the requirement of an LP-SVR is to have the unknowns greater than or equal to zero, we typically decompose such variables in their positive and negative parts. Therefore, we denote α = α^{+} −α^{−}, and b = b^{+} −b^{−}. Then, in order to pose the problem as a linear program in its canonical form and in order to use an interior point method solver, problem (16) must have no inequalities; thus, we need to add a slack variable u, which alltogether results on the following problem

which is the formulation we used in the analysis we presented in this paper, along with an interior point solver and a radial-basis function (RBF) kernel with parameter.

Note that (17) allows us to define the following equalities:

which is an acceptable linear program of the form:

Note that this problem has variables and constraints.

This is a definition more appropriate than the one described by Lu et al. in late 2009 [

Since the LP-SVR definition suffers from an increase in dimensionality, it is suggested that in large scale applications use the approach presented in [

Typically in computational intelligence methods and pattern recognition we want to minimize the true test error. And the true test error has to be measured in some way. The measurement of the test error is in fact modeldependent. In the following paragraphs we will chose error metrics particular to classification and regression problems for the LP-SVR model in (17).

In this paper we want to particularize and estimate the model vector of parameters. The error functions we want to use for multi-class problems are two: a modified estimate of sacled error rate (ESER), and the balanced error rate (BER). The ESER metric is given by

where is a scaling factor used only to match the ESER to a desired range of of values; denotes the outcome of the LP-SVR classifier when an input vector is fed at the LP-SVR’s input; and the function is denoted by the following equation:

that approximates the unit step function. The step function is widely used and for this case, the quality of its approximation depends directly of the parameter as shown in

In all of our experiments is fixed to. Ifthen the f_{1} has values only within the interval.

The ESER could become biased towards false positive counts, especially if we have a large number of unbalanced class examples. Therefore, we use the BER which is defined as follows:

where stands for “True Positive,” “False Positive,” “True Negative,” and “False Negative.” (The reader may find Tables 1 and 2 useful in understanding or visualizing the concepts above using multi-class confusion matrices.)

Whenever there there are equal number of positive and negative examples, which is the case when TN + FP = FN + TP (see [

In the other hand, for a two class approach, it is more

convenient to use the area under the receiver operating characteristic (ROC) curve, as well as the BER metric. It is well known that maximizing the area under the ROC curve (AUC) leads to better classifiers, and therefore, it is desirable to find ways to maximize the AUC during the training step in supervised classifiers. The AUC is estimated by means of adding successive areas of trapezoids. For a complete treatment of the ROC concept and AUC algorithm, please consult [

The area under the ROC curve, , is computed using Algorithms 1, 2, and 3 from [

For regression problems, the error metrics have to be different. The next paragraphs explain functions that can be used.

In regression we want to use a different measure of error. The error functions we want to use for classification are two: sum of square error (SSE), and balanced error rate (BER). The SSE metric is given by

where is the actual output of the classifier LP-SVR when the input vector is presented at its input.

The second metric is based on the statistical properties of the residual error given by the difference. From estimation theory it is known that if we have the residual error expected value equal to zero, and a unit variance, we have achieved the least-squares solution to the regression problem, either linear or non-linear. Furthermore, it is understood that as the variance of the residual error approaches zero, the regression problem is better solved. Let us denote the expected value of the residual error as

and the variance of the residual error as follows

from where it is desired that. Hence, the second error metric is defined as:

where the term has the meaning of the absolute value of the mean, since is easier to handle in optimization problems.

In this section we follow the development presented in Section 2, and cope it with the metrics on Section 4, to find the model parameters of the formulation in Section 3.

Particularizing (1) for the cases presented in Sections 3 and 4, the formulation of simply becomes

where clearly and. The typical challenge is to compute the Jacobian matrix, since not all the error functions are differentiable, i.e. (25) or (26). Then, the classical approaches are to estimate via finite difference approximation, or secant approximation. For convenience, we used the finite difference approximation. In this case, corresponds to a finite difference derivate approximation which solves (2) using (3) where

allowing to be sufficiently small, as appropriate.

Although the globalization strategy presented in Section 3 prevents Newton method from going far away from a solution, and guarantees a decrease of the error at each iteration, it does not guarantees a global minima since it is not a convex problem. As a consequence, we needed to implement a common approach to find the initial vector of parameters, by varying and and observing for the pair of parameter producing the minimum error. In this paper, this is achieved by varying in the following interval and . In spite that this approach is very powerful to find a good starting point, it requires a loop of iterations, which is sometimes very costly depending on the application.

Estimating the true test error given a training set is not trivial. Two popular approaches to this problem exist: -fold, and leave-one-out cross validation.

In our implementation and experiments, we have used former approach, which led us to define the following rule to find a “good” number of partitions in, that is:

where represents the maximum number of constraints in (22) that are computationally tractable; the function represents a round up operation; and is the number of partitions in. In the case of a largescale implementation is directly set to the maximum working set size parameter.

The partitions in the set is denoted as

where, and denotes the th partition and contains the indexes of those training data points in

. Therefore we say that

and.

It is understood that the main idea behind this method is to partition the training set in groups of data (ideally of equal size), then train the classifier with and use the remaining data as validation set. The process is repeated for all the partitions and the error is averaged as follows

where is the error obtained for the th partion; is an estimate of the true test error;

and.

The complete algorithm considering all the refinements and particularizations for the particular case study of the LP-SVR (shown as Algorithm 3) requires as input the indexes S corresponding to the cross validation indexes, and also the training set from which the LP-SVR parameters producing the minimum error will be estimated as.

Then, the algorithm proceeds using the approximation to the true Jacobian (31)-(33) shown in Section 5.1. However, note that every single function evaluation of (31) requires cross validation, as explained in Section 5.3. As a consequence, the Jacobian implies internal cross validation. The remaining steps are the linear system solution, Armijo’s condition, and update.

The linear system in Step 5.4 requires special attention specially in a large-scale setting. If this is the case, one possible approach is to use any well known direct approach such as LU-factorization [

where is the minimum eigenvalue of the Jacobian, is a constant sufficiently small that cannot be interpreted as zero, and is the identity matrix of identical size to the Jacobian. In (39) we typically chose.

The stopping criteria used in this algorithm includes three conditions.

First, a condition that monitors if the problem has reached an exact solution to the problem, that is,

where. Ideally, , and.

Second, the norm of the objective function is monitored, which measures the distance to zero (or to a certain threshold) from an approximate solution at iteration t. That is,

where is some threshold, ideally.

Third, we set a condition that measures the change between solutions at each iterate, as follows

where is typically set to a very small value.

Condition (42) states an early stopping criteria if the algorithm has no variability in terms of the updates at each iteration. However, it may happen that the algorithm is indeed changing the solution at each iterate, but indeed this represent no significant progress towards a solution. In such case, another classical early stopping criteria is used: maximum iterations. The criteria is simply

where is the maximum number of iterations permitted.

To show the effectiveness and efficiency of the proposed model selection method, we performed simulations over different datasets. The summary of the properties of these datasets are shown in

First, let us consider the results shown in

A second part of our the simmulation involves using a testing sed. For this purpose, let us define

as the testing set, where is the number of samples available for testing. The testing set has never been showed to the LP-SVR model before.

The simulation results in

Moreover, in

Although statistical properties of residuals based on the testing set demonstrate that the approach has an acceptable behavior, the reader must be aware that this approach has some characteristic properties that may lead

to unexpected results. First, the algorithm works with an approximation to first order information, that in the worst case may also be singular. Second, the algorithm is not convergent to a global minimum; however a “good” initial point is obtained as explained in Section 5.2. Third, the globalization strategy may become computationally expensive if the first order information leads far away from the solution. A good way to reduce the computational expense in finding a that produces a sufficient decrease at each iterate can be found in text books [9,17]; in these, the most common state-of-the-art approach is is to have decrease in the following pattern

. This approach has demonstrated to be efficient and is widely used in the community. However, further research must be conducted in the three aspects mentioned above.

Furthermore, different or more error functions may also be studied we well as the case when more LP-SVR parameters are being estimated, such as. Moreover, the concepts discussed in this paper can also be applied

to other support vector (SV)-based learning machines with little or no modification.

An algorithm for LP-SVR model selection has been discussed in this paper. We propose a quasi-Newton method for function minimization, that uses a globalization strategy and an inexact computation of first order information. This Jacobian is computed via finite differences techniques. We have explored the case of two and multiclass problems including regression.

Simulation results suggest that the algorithm achieves insignificant variability when measuring residual statistical properties. These simulations included mostly standard benchmark datasets from real-life applications, and fewer synthetic datasets.

This paper discussed discussed a particularization of a generalized method that was introduced at the beginning of this paper; this method can be used to train other types of SV-based formulations. This research significantly advances the natural problem of model-selection in most of today’s SV-based classifiers that have the hyper-parameters out the problem formulation.

The author P. R. P. performed part of this work while at NASA Goddard Space Flight Center as part of the Graduate Student Summer Program (GSSP 2009) under the supervision of Dr. James C. Tilton. This work was supported in part by the National Council for Science and Technology (CONACyT), Mexico, under Grant 193324/ 303732 and mentored by Dr. Greg Hamerly who is with the department of Computer Science at Baylor University. Finally, the authors acknowledge the support of the Large-Scale Multispectral Multidimensional Analysis (LSMMA) Laboratory (www.lsmmalab.com).