Support Vector Machines for Regression : A Succinct Review of Large-Scale and Linear Programming Formulations

Support Vector-based learning methods are an important part of Computational Intelligence techniques. Recent efforts have been dealing with the problem of learning from very large datasets. This paper reviews the most commonly used formulations of support vector machines for regression (SVRs) aiming to emphasize its usability on large-scale applications. We review the general concept of support vector machines (SVMs), address the state-of-the-art on training methods SVMs, and explain the fundamental principle of SVRs. The most common learning methods for SVRs are introduced and linear programming-based SVR formulations are explained emphasizing its suitability for large-scale learning. Finally, this paper also discusses some open problems and current trends.


Introduction
Statistical pattern recognition methods based on optimization methods have shifted the paradigm in the computational intelligence area.The reason is that most learning algorithms proposed in the past 20 years focused on heuristics or on analogies with natural learning systems, such as evolutionary algorithms (EA) or neural networks (NN).EAs and NNs were mostly the result of biological findings and experimental tuning of parameters.However, the underlying reasons for their performance was not fully understood.Most of the work was based on designing heuristics to avoid local minima trapping in the training (design) process.In contrast, recent statistical pattern recognition algorithms overcome many of these disadvantages by using known theories in the mathematical sciences, e.g., Mercer kernels [1], Hilbert spaces [2], and Kernel expansion [3].
In the last decade, a theoretical basis for learning machines has been developed to explicitly maximize their performance [4].The combination of learning machines with kernel functions has led to novel pattern recognition algorithms characterized by their formulation as convex optimization problems.Convex optimization problems have the advantage of being free from local minima.In particular, it is now possible to model non-linear pattern recognition systems under the convenience of a linear representation using kernel functions [3].
Support vector machines (SVMs) are arguably the best known example among kernel learning algorithms.Since their introduction in 1992 [5], SVMs have been studied, generalized, and applied to a number of problems.Currently, SVMs hold records in performance benchmarks for handwritten digit recognition [6], text categorization [7], information retrieval [8], and time-series prediction [9] and are commonly used in the analysis of DNA micro-array data [10].Due to its novelty and potential, there are many areas of improvement, particularly in the area of numerical methods that helps with the problem posing [11] (e.g., vectorial calculus and algebra) and optimization [12] (e.g., quadratic and linear programming).
In 1995, Vapnik et al. [13] and later Smola et al. [14] explored and developed the SVM approach for regretssion problems.This approach is commonly known as Support Vector Regression (SVR), which increased the application range of SVMs since SVRs can also perform multi-class pattern recognition [15].This type of machine is typically formulated using quadratic optimization under the umbrella of convex optimization.In return, the solution is always global and is easy to compute for very small problems.This paper is devoted to introduce, familiarize, and motivate the reader to the field of SVR, ultimately, for large-scale formulations posed as linear programs.
The rest of the paper is organized as follows.The state-of-the-art on training SVMs is summarized in Section 2, emphasizing the number of the training samples used.Section 3 defines the fundamental principle of SVRs.The two major learning methods for SVRs are introduced in Section 4, while in Section 5 it is introduced the linear programming-based SVR formulations, aiming to emphasize formulations suitable for large-scale learning.Finally, open problems, current trends, and conclusions are drawn in Section 6.

Large-Scale Support Vector Regression Training
The original training of an SVM [16] was only valid for small data sets and was designed to find the solution of a QP problem using a constrained conjugate gradient algorithm.The size of the problem is directly proportional to the number of training vectors, making its solution computationally intractable even for moderate size training sets.
Later Vapnik et al. [16] improved this method by considering only those variables associated to support vectors lying on the boundaries, whose gradient take them out of the feasible region.This modification allowed computational tractability of a limited number of largescale problems (less than 5000 samples).
In 1997, Osuna et al. [17] proposed a decomposition method that achieves optimality by solving a sequence of smaller sub-problems from randomly selected samples out of a training dataset.The algorithm selects the support vector coefficients that are active on either side of the optimal hyperplane (in a two class problem), then proceeds iteratively until the completion of the problem size.This algorithm performs satisfactory in applications of up to 110,000 data points.
Later in 1999, Joachims introduced the concept of "shrinking" under the SVM training context and named it "SVMlight" [18].The idea is to get rid of points that have less probability of becoming support vectors, thereby saving time in the optimization problem solution.Joachims' method demonstrated to be faster than Osuna's method.The author reported results for as many as 49,749 data points.
Also in 1999, Platt extended Osuna's work with an algorithm called "sequential minimal optimization" (SMO) [19].This is a famous algorithm implemented in several commercial and open-source software applications.The key idea behind Platt's method is to decompose the large QP problem into a series of very small QP sub-problems.These sub-problems are as small as they can be solvewithout a QP solver, thus much faster.Platt reported as many as 60,000 data points.
Collobert et al. [20] reported in 2001 an adaptation of Joachims' method for SVR problems named "SVMT orch" (see Section 4).The authors show the implementation of reduction algorithms and give mathematical proof of convergence.The authors also reported using as many as 60,000 data points.
Rifkin presented the "SVMFu" algorithm in 2002 [21].This work reviews and synthesizes Osuna's, Platt's, and Joachims' work.The key idea is to divide the large problem into sub-problems such that their Hessian matrices fit within memory limitations.In QP, Hessian matrices are useful since they describe the second partial derivatives of an n-dimensional function.Rifkin reports as many as 60,000 data points.Mangasarian et al. in 2002 [22], explored the very first linear programming (LP) approach to kernel-based regression.The authors proposed a linear programming chunking (LPC) algorithm with guaranteed termination.The method demonstrated efficiency, but slow convergence solving linear programs with commercial optimization software.The authors report solving linear programs with as many as 16,000 data points in a total of 19 days.
In 2005, Drineas et al. [23] developed an algorithm to approximate a low rank kernel matrix.The overall computational time is reduced.The authors report an interesting derivation of their ideas based on the Nystron method.
Continuing with QP-based methods, in 2006, Hush et al. [24] proposed an algorithm that produces approximate solutions without compromising accuracy in a QP SVM problem.The authors propose a first stage where the algorithm provides an approximate solution for the QP dual, and a second stage that maps the QP-dual problem to the QP-primal problem, based on the duality theorem.The authors report over a 100,000 samples data-set.
In 2006, Sra [25] proposed an algorithm that explicitly takes advantage of the LP formulation for SVMs to produce an efficient training method for large-scale problems.The method converts the LP problem into a QP problem, applies Bregman [26] and Hildreth [27] decomposition, and then the QP problem is solved.The author reports solving a problem with as many as 20,000,000 samples.
For some time, no new LP approaches for SVR were developed until 2009, when Lu et al. [28] reported an SVR method based on a novel LP formulation.The authors present an alternative of the typical kernels and use wavelet-based kernels instead.The paper shows an application to nonlinear dynamical systems identification, but the authors do not elaborate on the data-set size.
In 2010, by the time this paper was already in progress, Torii et al. [29] published work on a novel LP formu-lation along with three decomposition methods for training an LP-SVM approach.The authors report as many as 60,000 data points.
The development found up to date does not report any advances of sub-problem-based SVR approaches entirely motivated and aimed for large-scale LP-SVR formulations since Mangasarian et al. [22].Even the idea from Sra [25] that seems to work for up to 20,000,000 samples, fails to take advantage of the efficiency of LP solvers and still works for only a limited number of samples.
In spite of all these advances in SVMs and SVRs, training schemes for large-scale applications are still required.It is of high importance to address the problem of large-scale training as technology advances and large amounts of data is being stored for further analysis.For now, let us review briefly the concept of SVR.

Support Vector Regression Fundamental Principle
To begin with, let us consider the linear regression case where the dependency of a scalar observable on a regressor is denoted as follows: where the parameter vector and the bias are the unknowns.The problem is to estimate and given training samples where the vector elements i are assumed to be statistically independent and identically distributed (iid).The problem formulated by Vapnik is aimed to minimize, on the variables and , the structural risk functional where the variable i y is the estimator output produced in response to the input i , that is where is a prescribed parameter (see Figure 1). The structural functional (2) can be expressed as an optimization problem in its primal form as follows [30] where the summation in the cost function accounts for the -insensitive training error, which forms a tube where the solution is allowed to be defined without penalization, as shown in Figure 2 for the linear case, and in Figure 3 for the non-linear regression case.The constant describes the tradeoff between the training error and the penalizing term Next, we define the dual problem using the Lagrange multipliers method, where the following Lagrangian function is defined: , where , , , , , , , , , , 0, Then, the dual problem using the method of Lagrange multipliers is stated as follows: , , , , , 0, , , , , , , , 0 , , , , , , ,  (7) and the following problem is the expanded version of problem (7) above: , , , , , 0 However, it is possible to remove three constraints by noticing from (6a) that one can solve for as follows: and also one can solve for both  and   from (6c) and (6d) as which also yields the boundary condition 0   , . If we substitute the Equalities (9a)-(9c) into the objective function, and perform some analytic operations, we arrive to the following well known reduced dual problem [30]: , .
Problems ( 4) and (10) solve the linear regression prob-lems, and for the non-linear regression case one just introduces a kernel function formulation.For the primal case, the sole modification is on the restrictions which are redefined as follows:   , , for .For the dual problem the objective function is redefined as This formulation is used to implement learning methods discussed next.

Support Vector Regression Learning Methods
Support Vector Machines for Regression (SVRs) share the same advantages and disadvantages as SVMs.That means, SVMs training methods can be extended to SVRs whenever they are not directly dependent on the particularities of the SVM problem.Since most SVM training methods depend on the particular mathematical formulation of SVM, very few of them can be extended to SVR.Collobert et al. reported in 2001 an adaptation of Joachims' SVM method for SVR problems [20].The authors start with the primal problem in (4), and then they reformulate the dual (10) with a vector-matrix notation in order to have the following Quadratic Programming (QP) minimization problem: where is a kernel matrix.K Next, the authors perform the same decomposition proposed by Osuna [17] defining the working set and the fixed set , where the size of the working set is    .Also, the authors extend Joachims' idea of the steepest descent direction to select the working set at each iteration of the SVR dual problem.That is, the problem described in [18] is reformulated as follows: where , , , , , , and the size of the working set is given the by for all 0 i p  .Finally, the authors give proof of convergence following Platt;s SMO idea selecting a working set size of 2 samples.The problems with this implementation are inherited from Joachims' and augmented due to the complexity of the SVR definition.Particularly, the issues with this algorithm are the inherent QP solution process and the additional steepest descent process.Furthermore, they introduced the concept of shrinking, that is, to permanently remove from the working set the bounded support vectors after a number of iterations.Although the heuristic of shrinking dramatically speeds up the training process, convergence is not guaranteed anymore.However, Collobert's et al. shrinking method is still the most popular large-scale SVR training strategy.

Linear Programming Support Vector Regression
Linear Programing (LP) problems can be stated in the standard form (15) where is a vector containing the unknowns, and are vectors of known parameters, and is a matrix of known coefficients associated with through a linear relationship.However, there is also the canonical form  (16) In fact, there is currently no agreement as to which of the previous problems is the canonical or standard form.In this paper, problem (16) will be regarded as the LP primal in its canonical form.A λ s c s (17) where is a vector of dual variables defined over λ m  and s is a slack vector variable in .The solution to the primal problem is denoted as , and the solution to the dual problem is denoted as  . The duality theorem states that , which means that the solution also solves the dual, and the solution of the dual also solves the primal [31].
The idea of Linear Programming Support Vector Regression (LP-SVR) is to pose the SVR problem (4) as a linear program either in its canonical (16) or standard (15) forms.Before we continue with linear programming and SVRs, we need to inform the reader of the complexity of the typical SVR problems to justify further implementations with interior point methods.The following paragraphs will inform the reader in regard to the computational expenses on SVR learning methods.

SVR Learning
Let and be the number of variables and constraints respectively in an SVR-based optimization problem.To solve such optimization problems, there are three known strategies.First, if the problem can fit within system memory, the strategy is to directly use interior point methods which have memory consumption of . Second, if the problem is of medium size, active-set methods are proven to be more appropriate [32].They need memory, where SVs n is characterized by the number of support vectors of the problem.That is, active-set methods deal with the complete set of support vectors only.Third, for very large datasets, one currently uses working set methods (i.e., decomposition methods).Among these are Collobert's [20] or Mangasarian's [33].These methods have memory consumption of . The training strategies are summarized in Figure 4. Now, let us review the most common LP-SVR formulations that will provide the reader with the general knowledge on LP-SVRs.

v-LPR
Smolaetal in 1999 [34], first introduced a linear programming approach for SVR.Their formulation uses the following  -SVR model: where can be used to control the number of errors in  .this way, the value of  is automatically estimated.The In  -SVR comes rom a special type of SVM called f  -SVM [32]. he authors named their formulation as " T  -LPR", as in "nu-linear programming regression".Smola et al. poses problem (18) as a linear program in the following form: where , and the matrices The dual problem is derived in order to use a primal-dual interior point solver with a predictor-corrector strategy.
In doing this, they also derive the corresponding Karush-Kuhn-Tucker (KKT) conditions.Smola et al. [34] suggest that the parameter  is proportional to the number of support vectors.Although the main contribution is to show the efficiency of an SVR using LP, a critique is that the authors eliminated one parameter by adding another, i.e.,  instead of Moreover, the problems with LP-SVR models is that the program dimension is very large, i.e., 4 . 2 N  .As a consequence, the authors report performing regression with training sets consisting of only 50 and 80 data points.

LP-SVR with Modified LPC
Mangasarian et al. [22] in 2002 pioneered the linear programming formulations for large-scale SVRs.The authors analyzed the  -LPR [34] method and provided means for training with large-scale datasets.The SVR formulation they use is the following: where   is a positive parameter that drives the tolerance error : if  1   the errors are highly penalized.If one take a closer look to problems ( 18) and ( 21), one notices that In fact, the authors show that problems ( 18) and ( 21) are equivalent.The significance of this formulation is the use of a modified version of the Linear Programming Chunking (LPC) algorithm [33].The LPC algorithm was originally utilized to solve the problem with a subset of the constraints.However, Mangasarian et al. also proposed a modified version that constraint the variables as well.This modification promotes a smaller version of the problem, hence faster computations in terms of memory allocation.

Wavelet Kernel-Based LP-SVR
In 2009, Lu et al. [28] reported an SVR method based entirely in the LP primal in its standard form.The authors present an alternative to the typical kernels and use wavelet-based kernels instead.The paper shows an application to nonlinear dynamical systems identification [28].The authors start by changing the 2 2 λ -norm by the 1 λ -norm, and also use the notion of the kernel functions for nonlinear problems; and they reformulate problem (4) as follows: where  comes from the Representer Theorem nota- tion [35].Then, the authors arrive at the following LP problem in its primal standard form:   ; that is, the decomposition of the positive i   and the negative i   part of i  .Then, the authors show that this formulation performs around 25 times faster than the QP-SVR.No details were given with respect to the solver used, and no proof of convergence was addressed.The authors also show that in measuring the root mean squared error (RMSE), the QP-SVR performs better than their LP-SVR approach.However, they also report that this formulation requires fewer support vectors.Hence, a sparser solution is obtained, having the advantage of a reduced number of support vectors, which also represents computationally efficient solutions.

1 λ -Norm LP-SVR
In 2010, Zhang et al. [12] performed a study in regard to the sparseness of 1 λ -norm-based SVMs and SVRs.The authors reported the following formulation of an 1 λnorm SVR: , 2, , , where    (typically small: ) is used to avoid possible infinite optimal solutions, and u i is a slack variable.Then, defining 11 1 10   (23) the authors proceeded to solve the following linear programming variation: The evident problem with this formulation is the size of the problem: variables and constraints.Therefore, the author only reports experiments for small size problems; e.g., 351, 297, and 345 data points, using an RBF kernel.The poor formulation makes the size problem more evident since the authors used the simplex method.As a consequence very large training times will be experienced.One obvious critique is the inclusion of 6N  2 2N  , making the set of hyper-parameters   , , , C    to be chosen more heuristically.

Large-Scale LP-SVR
Recently, in 2011, Rivas et al. [36][37] attempted to solve the difficulties of previous formulations, particularly those that may hinder the learning phase.Similarly to previous formulations, the authors [36][37] assumed that the slack variables , i i    can be expressed as simply Then, introduced a slack variable to avoid inequalities in the SVR formulation.As a consequence of these assumptions, the following optimization problem was proposed: Then problem (27) was posed as a linear programming problem by defining the following equalities: where In comparison with the  -LPR formulation by Smola et al. [34], problem (27) 1) uses the canonical formulation, 2) computes , and implicitly, 3) does not compute implicitly, 4) does not require the parameter 5)   promotes efficiency in the sense of using only one  , and 6) is a lower dimensional problem.
In comparison with Mangasarian et al. [22], problem (27) 1) uses the canonical formulation, 2) computes implicitly, 3) does not compute implicitly, and 4) does not require the parameter b   .By 3) and 4) the experimenter has more control of the sparseness of the solution [12].In this case sparseness means fewer number of support vectors.
Similarly, problem (27)  λ -norm LP-SVR formulation by Zhang et al. [12] problem (27) does not require parameter  and is more efficient in several ways: 1) uses only one  , 2) avoids penalization of , 3) reduces computational efforts by forcing positivity in which reduces the LP problem size by , and 4) is a smaller problem.Finally, in order to find the solution to the problem, Rivas et al. [36][37] use an interior point methods-based solver to find the variables that satisfy the following KKT conditions: for 1, 2, , , i n  Λ z s where the equality i i implies that one of both variables must be zero.This equality is typically referred to as the complementarity condition.Note that the KKT conditions depend on the variables , and if the set of solutions

Conclusions, Current Trends, and Open Problems
In this paper we have explained the efforts being made towards a solution of large-scale learning using SVRs.
We have reviewed the most commonly used formulations of SVRs emphasizing its usability on large-scale applications.We reviewed the general concept of SVMs, addressed the state-of-the-art on training methods SVMs, and explained the fundamental principle of SVRs.The two most common learning methods for SVRs were introduced and linear programming-based SVR formulations are explained emphasizing its suitability for large-scale learning.The Support Vector learning field is still very active.As a consequence, the treatment of large-scale SVM remains being of interest to researchers in the field (see reference [38]).However, a number of open issues that have to be addressed still persist.Although recently, SVR algorithmic development seems to be at a more stable stage, one issue is to find whether linear programming SVR approaches will lead to more satisfactory results [12,14].The problem of empirical tuning [3,14] of SVR hyper-parameters (e.g., , C  ), has to be devised to make SVRs less dependent on the skill of the experimenter [14,39].Optimization techniques developed within the context of SVRs are also required in order to improve treatment of large datasets [14].This may be done in combination with reduced set methods for speeding up the training phase for large datasets.This topic is of huge importance as machine learning applications demand algorithms that are capable of dealing with datasets that are at least larger than 1 million samples [14].Other problems still considered open include the following: more data dependent generalization bounds, efficient training algorithms, and automatic kernel selection procedures.The authors would like to kindly invite highly motivated researchers to contribute in these areas as they can lead to a settlement on large-scale SV-based learning methods. :

Figure 1 .
Figure 1.Illustration of the -insensitive loss function.

Figure 2 .Figure 3 .
Figure 2. Linear regression: An -insensitive tube, fitted to the data points shown as circles. By introducing the Lagrange multipliers   , λ s into the primal problem above, one obtains the following dual s.t.

Figure 4 .
Figure 4. Depending on the problem size, one may find three different strategies for SVR training.
all the conditions, the problem is said to be solved.The set   primal-dual solution.