A Fast Algorithm for Training Large Scale Support Vector Machines

Mayowa Kassim Aregbesola; Igor Griva

doi:10.4236/jcc.2022.1012001

Journal of Computer and Communications > Vol.10 No.12, December 2022

A Fast Algorithm for Training Large Scale Support Vector Machines

Mayowa Kassim Aregbesola, Igor Griva
George Mason University, Fairfax, USA.
DOI: 10.4236/jcc.2022.1012001 PDF HTML XML 100 Downloads 531 Views Citations

Abstract

The manuscript presents an augmented Lagrangian—fast projected gradient method (ALFPGM) with an improved scheme of working set selection, pWSS, a decomposition based algorithm for training support vector classification machines (SVM). The manuscript describes the ALFPGM algorithm, provides numerical results for training SVM on large data sets, and compares the training times of ALFPGM and Sequential Minimal Minimization algorithms (SMO) from Scikit-learn library. The numerical results demonstrate that ALFPGM with the improved working selection scheme is capable of training SVM with tens of thousands of training examples in a fraction of the training time of some widely adopted SVM tools.

Keywords

SVM, Machine Learning, Support Vector Machines, FISTA, Fast Projected Gradient, Augmented Lagrangian, Working Set Selection, Decomposition

Share and Cite:

Aregbesola, M. and Griva, I. (2022) A Fast Algorithm for Training Large Scale Support Vector Machines. Journal of Computer and Communications, 10, 1-15. doi: 10.4236/jcc.2022.1012001.

1. Introduction

A support vector machine (SVM) [1] [2] is a supervised machine learning technique used for classification and regression [3]. SVM were initially designed for binary classification and have been expanded to multiclass classification that can be implemented by combining multiple binary classifiers using the pairwise coupling or one class against the rest methods [4]. The main advantage of the support vector machines is in their ability to achieve accurate generalization and their foundation on well developed learning theory [2].

Training dual soft margin SVM requires solving a large scale convex quadratic problem with a linear constraint and simple bounds for variables. To solve large scale SVM problems, decomposition methods such as sequential minimal optimization (SMO) [5] gained popularity due to their efficiency and low memory usage requirement. Each iteration the SMO keeps all the variables except two of them fixed and solves a small two-dimensional subproblem. In an effort to investigate a performance of decomposition based algorithms with larger working sets, one of the previous manuscripts [6] presented an augmented Lagrangian fast projected gradient method (ALFPGM) with working set selection for finding the solution to the SVM problem. The ALFPGM used larger working sets and trained SVM with a size up to 19020 training examples. The current study extends the previous results by presenting an updated working selection method pWSS and testing the improved algorithm on larger data sets with up to tens of thousands of training examples.

The paper is organized as follows. The next section describes the SVM training problem, the ALFPGM and SMO algorithms. Section 3 describes a new working set selection principle responsible for an efficient implementation of ALFPGM. Section 4 provides some details of the calculations. Section 5 provides numerical results. Section 6 contains concluding remarks and Section 7 discusses some future directions of further improving the efficiency of the developed algorithm.

2. SVM Problem

To train the SVM, one needs to find $α^{*} = {(α_{1}^{*}, \dots, α_{m}^{*})}^{T}$ that minimizes the following objective problem:

$\begin{array}{l} \underset{α}{minimize} - \sum_{i = 1}^{m} α_{i} + \frac{1}{2} \sum_{i j} α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j}) \\ subjectto \sum_{i = 1}^{m} α_{i} y_{i} = 0, \\ 0 \leq α_{i} \leq C, i = 1, \dots, m . \end{array}$ (1)

Here ${(x_{1}, y_{1}), \dots, (x_{m}, y_{m})}$ is a set of m labeled data points where $x_{i} \in ℝ^{n}$ is a vector of features and $y_{i} \in {- 1,1}$ represent the label indicating the class to which $x_{i}$ belongs. The Matrix Q with the elements $q_{i j} = y_{i} y_{j} K (x_{i}, x_{i})$ is positive semi-definite but usually dense.

The sequential minimal optimization (SMO) algorithm developed by Platt [7] is one of the most efficient and widely used algorithms to solve (1). It is a low memory usage algorithm that iteratively solves two-dimensional QP subproblems. The QP subproblems are solved analytically. Many popular SVM solvers are based on SMO.

There have been several attempts to further speed up the SMO [8] [9] [10]. In particular, there have been attempts on parallelizing SMO [10]. However, selection of the working set in SMO depends on the violating pairs, which are constantly changed each iteration, making it challenging to develop a parallelized algorithm for the SMO [6]. Other attempts have focused on parallelizing selections of larger working sets of more than one pair. The approach presented here follows the latter path.

2.1. SVM Decomposition

Matrix Q is dense and it is inefficient and often memory prohibitive to populate and store Q in the operating memory for large training data. Therefore decomposition methods that consider only a small subset of variables per iteration [11] [12] are used to train an SVM with a large number of training examples.

Let B be the subset of selected l variables called the working set. Since each iteration involves only l rows and columns of the matrix Q, the decomposition methods use the operating memory economically [11].

The algorithm repeats the select working set then optimize process until the global optimality conditions are satisfied. While B denotes the working set with l variables, N denotes the non-working set with $(m - l)$ variables. Then, $α$ , y and Q can be correspondingly written as:

$α = [\begin{array}{l} α_{B} \\ α_{N} \end{array}], y = [\begin{array}{l} y_{B} \\ y_{N} \end{array}], Q = [\begin{matrix} Q_{B B} & Q_{B N} \\ Q_{N B} & Q_{N N} \end{matrix}] .$

The optimization subproblem can be rewritten as

$\begin{array}{l} \underset{α_{B}}{minimize} \frac{1}{2} [\begin{matrix} α_{B}^{T} & α_{N}^{T} \end{matrix}] [\begin{matrix} Q_{B B} & Q_{B N} \\ Q_{N B} & Q_{N N} \end{matrix}] [\begin{matrix} α_{B} \\ α_{N} \end{matrix}] - [\begin{matrix} e_{B}^{T} & e_{N}^{T} \end{matrix}] [\begin{matrix} α_{B} \\ α_{N} \end{matrix}] \\ subjectto α_{B}^{T} y_{B} + α_{N}^{T} y_{N} = 0 \\ 0 \leq α_{B} \leq C . \end{array}$ (2)

with a fixed $α_{N}$ . The problem (2) can be rewritten as

$\begin{array}{l} \underset{α_{B}}{minimize} \frac{1}{2} α_{B}^{T} Q_{BB} α_{B} + α_{B}^{T} χ \\ subjectto α_{B}^{T} y_{B} + G_{k} = 0 \\ 0 \leq α_{B} \leq C . \end{array}$ (3)

where $χ = Q_{BN} α_{N} - e_{B} = q - e_{B}$ and $G_{k} = α_{N}^{T} y_{N}$ .

The reduced problem (3) can be solved much faster than the original problem (1). The resulting algorithm is shown as Algorithm 1.

2.2. Augmented Lagrangian—Fast Projected Gradient Method for SVM

A relatively simple and efficient algorithm capable of training medium size SVM with up to a few tens of thousands of data points was proposed in [13]. The algorithm takes advantage of two methods: one is a fast projected gradient method for solving a minimization problem with simple bounds on variables [14] and the second is an augmented Lagrangian method [15] [16] employed to satisfy the only linear equality constraint. The method projects the primal variables onto the “box-like” set: $0 \leq α_{i} \leq C, i \in B$ .

Using the following definitions

$\begin{array}{l} f (α) = \sum_{i \in B} α_{i} (q_{i} - 1) + \frac{1}{2} \sum_{i, j \in B} α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j}) \\ g (α) = \sum_{i \in B} y_{i} α_{i} + G_{k} \end{array}$ (4)

and the bounded set:

$B o x = {α \in ℝ^{l} :0 \leq α_{i} \leq C, i \in B},$ (5)

the optimization problem (3) can be rewritten as follows:

$\begin{array}{l} \underset{α}{minimize} f (α) \\ subjectto g (α) = 0 \\ α \in B o x \end{array}$ (6)

The augmented Lagrangian is defined as

$L_{μ} (α, λ) = f (α) - λ g (α) + \frac{μ g {(α)}^{2}}{2},$ (7)

where $λ \in ℝ$ the Lagrange multiplier that corresponds to the only equality constraint and $μ > 0$ is the scaling parameter.

The augmented Lagrangian method is a sequence of inexact minimizations of $L_{μ} (α, λ)$ in $α$ on the $B o x$ set:

$\hat{α} \approx α (λ) = \underset{α \in B o x}{\arg \min} L_{μ} (α, λ)$ (8)

followed by updating the Lagrange multiplier $λ$ :

$\hat{λ} = λ - μ g (\hat{α}) .$ (9)

The stopping criteria for (8) uses the following function that measures the violation of the first order optimality conditions of (8):

$ν (α, λ) = \max_{1 \leq i \leq m} ν_{i} (α, λ),$ (10)

where

$ν_{i} (α, λ) = {\begin{array}{l} | \nabla_{α} L_{μ} {(α, λ)}_{i} | 0 < α_{i} < C \\ \max {0, - \nabla_{α} L_{μ} {(α, λ)}_{i}} α_{i} = 0 \\ \max {0, \nabla_{α} L_{μ} {(α, λ)}_{i}} α_{i} = C \end{array}$ (11)

The minimization (8) (the inner loop) is performed by the fast projected gradient method (FPGM). The inner loop exits when $ν (α, λ) < ε$ . The final stopping criteria for the augmented Lagrangian method uses $μ (α, λ) = \max {ν (α, λ), | g (α) |}$ , which measures the violation of the optimality conditions for (6).

Algorithm 2 describes the ALFPGM (see [6] for more detail). The convergence of Algorithm 2 is established in [17].

3. PWSS Working Set Selection

The flow chart describing pWSS is shown in Figure 1. An important issue of any decomposition method is how to select the working set B. First, a working set selection (WSS) method WSS_2nd that uses the second order information [18] is described.

Consider the sets:

$\begin{array}{l} I_{u p} (α) \equiv {t | y_{t} = 1, α_{t} < C or y_{t} = - 1, α_{t} > 0}, \\ I_{l o w} (α) \equiv {t | y_{t} = - 1, α_{t} < C or y_{t} = 1, α_{t} > 0} . \end{array}$

Then let us define

$\begin{array}{l} q_{t} (α) = - y_{t} \nabla f {(α)}_{t} . \end{array}$

In WSS_2nd, one selects

$i \in \underset{t \in I_{u p} (α)}{\arg \max} {q_{t} (α)},$ (14)

Figure 1. pWSS Working set selection.

and

$j \in \underset{t \in I_{l o w} (α)}{\arg \min} {\frac{- b_{i t}^{2}}{{\hat{a}}_{i t}} : q_{t} (α) < q_{i} (α)} .$ (15)

with

$\begin{array}{l} a_{i j} = K_{i i} + K_{j j} - 2 K_{i j} > 0, \\ {\hat{a}}_{i j} = {\begin{array}{l} a_{i j}, if a_{i j} > 0, \\ τ \end{array} \\ b_{i j} = q_{i} (α) - q_{j} (α) > 0, \end{array}$

where τ is a small positive number.

3.1. Limiting the Search Space for I_up and I_low

One of the challenges of the previously developed dual decomposition WSS_2nd is that a particular data set point can be selected multiple times in different iterations of decomposition in the working set, often increasing the SVM training time. The previously developed algorithm [6] is improved, and the changes to pWSS are made by limiting the search space of I_up and I_low as described below.

3.2. MinMaxLimiter Algorithm

The MinMaxLimiter algorithm with the following changes to pWSS is proposed:

• Some elements of $α$ that no longer change after many iterations, are eliminated: $‖ α_{t} - α_{t - 1} ‖ < δ_{α}$ , where $α_{t - 1}$ , $α_{t} \in B$ are two consecutive iterates, and $δ_{α}$ is a user defined threshold.

• Another introduced parameter is minAlphaOpt, which is the minimum number of times a data index is used in working set. After this threshold, if the value of $α_{i}, i = 1, \dots, m$ is 0 or C, then that data point index is no longer considered

• The last introduced parameter is maxAlphaOpt, which is the maximum number of times a training examples is used in the decomposition rounds.

$m i n A l p h a O p t = \infty$ and $m a x A l p h a O p t = \infty$ will result in considering every $α$ in every iteration.

3.3. Optimizing j Search Using MinAlphaCheck

Another modification to WSS_2nd is the introduction of parameter minAlphaCheck. This is used to reduce the search space of j in (13) to the first minAlphaCheck in the sorted I_low set. The experimental results show that this converges and is faster than WSS_2nd.

4. Implementation

4.1. Kernel Matrix Computation

The cache technique is used to store previously computed rows to access them when the same kernel value needs to be recomputed. The amount of cached kernel is a selectable parameter. The same Least Recently Used (LRU) LRU strategy that SMO implemented in Scikit-learn (Sklearn) was used, and it is a good basis for comparing the results obtained using the SVM training methods. Previous discussions [8] [10] have been held on the ineffectiveness of the LRU strategy. However, that is beyond the scope of this work.

To compute the kernel matrix, Intel MKL is used. The kernel matrix can be computed efficiently using ${‖ x_{i} ‖}^{2}$ and the elements of matrix XX', where X is the matrix with the rows made of the data vectors $x_{i}$ . Intel MKL’s cblas_dgemv

is used to compute $f = \frac{1}{2} α^{T} Q α - e^{T} α$ .

4.2. Computing the Lipschitz Constant L

The Fast Projected Gradient Method (FPGM) requires estimation of the Lipschitz constant L. Since $L_{μ}$ is of quadratic form with respect to $α$ ,

$L = ‖ \nabla_{α α}^{2} L_{μ} (\cdot) ‖ = ‖ Q_{B} + μ y_{B} y_{B}^{T} ‖,$

where the matrix spectral norm is the largest singular value of a matrix.

To estimate L first $‖ Q_{B} ‖$ was estimated. Since $Q_{B}$ is symmetric and positive semidefinite, $‖ Q_{B} ‖ = λ_{1}$ is the largest eigenvalue of $Q_{B}$ .

The largest eigenvalue $λ_{1}$ can be efficiently computed using the power method. After estimating $λ_{1}$ with a few power iterations, the upper bound estimates L as follows:

$L \approx λ_{1} + m μ .$ (16)

5. Experimental Results

For testing, 11 binary classification training data sets (shown in Table 1) were selected with a number of training examples ranging from 18201 to 98528 from University of California, Irvine (UCI) Machine Learning Repository [19] and https://www.openml.org/. One of the tests, Test 2, was the largest test used in the previous work [6]. All simulations were done using a desktop with dual Intel Xeon 2.50 GHz, 12 Cores processors sharing 64 GB of computer memory. In all cases, $C = 100$ was used for all tests. The data was normalized and radial basis kernel was used with $γ$ shown in Table 1. The scaling parameter $μ$ used for all the experiments is $μ = 0.1$ as suggested in [13]. The parameters in pWSS were taken as $δ_{α} = 10^{- 6}$ , $m i n A l p h a O p t = 5$ , $m a x A l p h a O p t = 20$ . The accuracy of solution was selected $ε = 10^{- 2}$ . The SVM training times by ALFPGM were compared with results obtained using SMO implemented in Scikit-learn (Sklearn) [20] for the SVM training. The same data inputs were used in SMO and ALFPGM.

Table 1. Data used for testing.

ALFPGM—Investigating of Using Different Number of Pairs p

The optimal number of pairs p needed for the training with ALFPGM is determined with numerical experiments using training Tests 2, 6 and 10, which are representatives of small, medium size, and large training sets. In all cases, the pWSS method was employed for the pairs selection. The results are presented in Figures 2-4. The results suggest that for ALFPGM the optimum number of pairs is $50 \leq p \leq 100$ .

There are several factors that come into play when varying the number of pairs p.

1) The size of p determines the amount of time it takes to find the size of the working set. The larger the size of p, the longer it takes to find the working set.

2) The smaller the size of p, the more the number of decompositions needed to solve the SVM problem.

3) The larger the size of p, the longer it takes to solve the SVM subproblems.

Figure 2. ALFPGM—Classification time for different number of pairs p Test 2.

Figure 3. ALFPGM—Classification time for different numbers of pairs p Test 6.

Figure 4. ALFPGM—Classification time for different numbers of pairs p Test 10.

4) The larger the size of p, the greater the probability that the Hessian matrix of the SVM subproblem may be degenerate.

Table 2 provides numerical results for training SVM with the ALFPGM and SMO. For each testing dataset, the table shows the number of training examples, the number of features. Then for both ALFPGM and SMO algorithms the table shows the training errors (the fractions of misclassified training examples), the training times, the number of performed decompositions (working set selections), the number of support vectors, and the optimal objective function values. To calculate training times, each simulation scenario was averaged over 5 runs. The results also are visualized in Figures 5-7. The training sets are arranged in the order of increasing the number of training examples. The figures demonstrate that for most of the cases the ALFPGM outperformed the SMO in training time for similar classification error.

Figure 6 shows that the training errors are similar demonstrating that both algorithms produced similar results. The same conclusion can be drawn by observing the similar values of the optimal objective functions for both algorithms in Table 2. Finally, Figure 8 shows the normalized training times i.e. the training times divided by the number of training examples. As one can see, the ALFPGM produces a smaller variance in the normalized times than that of the SMO algorithm.

Figure 5. Comparison ALFPGM and SMO results timings.

Figure 6. Comparison ALFPGM and SMO results-training classification error.

Figure 7. Comparison ALFPGM and SMO results.

Table 2. ALFPGM and SMO comparison.

Figure 8. Comparison ALFPGM and SMO normalized time results.

6. Conclusions

The manuscript presents the results on training support vector machines using an augmented Lagrangian fast projected gradient method (ALFPGM) for large data sets. In particular, the ALFPGM is adapted for training using high performance computing techniques. Highly optimized linear algebra routines were used in the implementation to speed up some of the computations.

A working set decomposition scheme for p pairs selection is developed and optimized. Since working set selection takes a large portion of the overall training time, finding an optimal selection of pairs is the key to using multiple pairs in the decomposition. Numerical results indicate that the optimal choice of the number of pairs in the working set p for ALFPGM is 50 - 100 pairs. It is shown that in some examples the selection of pairs can be truncated earlier without compromising the overall results but with significantly reducing training times.

Finally, the comparison of training performance of the ALFPGM with that of the SMO from the Scikit-learn library demonstrates that the training times with ALFPGM is consistently smaller than those for the SMO for the tested large training sets.

7. Future Work

The results demonstrate that the choice of the number of pairs p is important as it determines the number of decompositions. Even though the numerical experiments determined a critical range of values for the optimal value of pairs p, the pWSS can be further improved for multiple pair selection and the total training time will be further reduced.

One possible direction of achieving that improvement is to determine an optimal dependence of the number of pairs p on the size of the training set. Selecting optimal values of WSS_{minAlphaCheck} and minAlphaOpt used in pWSS₃ needs to be further explored. Finding the optimal values of ALFPGM parameters based on the input data can further improve the efficiency of the ALFPGM.

A use of graphics processing units (GPU), which are getting less expensive and with more capabilities than before, should be revisited in future. In the future, the most time consuming parts of the algorithm such as computing similarity kernel measures can be done using the GPU.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1]	Vapnik, V. (1982) Estimation of Dependences Based on Empirical Data (Springer Series in Statistics). Springer-Verlag, New York.
[2]	Vapnik, V.N. (1995) The Nature of Statistical Learning Theory. Springer-Verlag, New York. https://doi.org/10.1007/978-1-4757-2440-0
[3]	Smola, A.J. and Scholkopf, B. (2004) A Tutorial on Support Vector Regression. Statistics and Computing, 14, 199-222. https://doi.org/10.1023/B:STCO.0000035301.49549.88
[4]	Hastie, T. and Tibshirani, R. (1998) Classification by Pairwise Coupling. In: Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems 10, MIT Press, Cambridge, 507-513. http://dl.acm.org/citation.cfm?id=302528.302744
[5]	Keerthi, S. and Gilbert, E. (2002) Convergence of a Generalized SMO Algorithm for SVM Classifier Design. Machine Learning, 46, 351-360. https://doi.org/10.1023/A:1012431217818
[6]	Aregbesola, M. and Griva, I. (2021) Augmented Lagrangian—Fast Projected Gradient Algorithm with Working Set Selection for Training Support Vector Machines. Journal of Applied and Numerical Optimization, 3, 3-20. http://jano.biemdas.com/archives/1224
[7]	Platt, J. (1998) Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Corporation, Tech. Rep.
[8]	Lin, C. (2001) On the Convergence of the Decomposition Method for Support Vector Machines. IEEE Transactions on Neural Networks, 12, 1288-1298. https://doi.org/10.1109/72.963765
[9]	Keerthi, S.S., Shevade, S.K., Bhattacharyya, C. and Murthy, K.R.K. (2001) Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation, 13, 637-649. https://doi.org/10.1162/089976601300014493
[10]	Wei, W., Li, C. and Guo, J. (2018) Improved Parallel Algorithms for Sequential Minimal Optimization of Classification Problems. 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Exeter, 28-30 June 2018, 6-13. https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00033
[11]	Chen, P.-H., Fan, R.-E. and Lin, C.-J. (2006) A Study on SMO-Type Decomposition Methods for Support Vector Machines. IEEE Transactions on Neural Networks, 17, 893-908. https://doi.org/10.1109/TNN.2006.875973
[12]	Zhang, Q., Wang, J., Lu, A., Wang, S. and Ma, J. (2018) An Improved SMO Algorithm for Financial Credit Risk Assessment—Evidence from China’s Banking. Neurocomputing, 272, 314-325. http://www.sciencedirect.com/science/article/pii/S0925231217312328
[13]	Bloom, V., Griva, I. and Quijada, F. (2016) Fast Projected Gradient Method for Support Vector Machines. Optimization and Engineering, 17, 651-662. https://doi.org/10.1007/s11081-016-9328-z
[14]	Polyak, R.A. (2015) Projected Gradient Method for Non-Negative Least Square. Infinite Products of Operators and Their Applications. A Research Workshop of the Israel Science Foundation, Haifa, 21-24 May 2012, 167-179. https://doi.org/10.1090/conm/636/12735
[15]	Hestenes, M.R. (1969) Multiplier and Gradient Methods. Journal of Optimization Theory and Applications, 4, 303-320. https://doi.org/10.1007/BF00927673
[16]	Powell, M.J.D. (1978) Algorithms for Nonlinear Constraints That Use Lagrangian Functions. Mathematical Programming, 14, 224-248. https://doi.org/10.1007/BF01588967
[17]	Griva, I. (2018) Convergence Analysis of Augmented Lagrangian—Fast Projected Gradient Method for Convex Quadratic Problems. Pure and Applied Functional Analysis, 3, 417-428.
[18]	Fan, R.-E., Chen, P.-H. and Lin, C.-J. (2005) Working Set Selection Using Second Order Information for Training Support Vector Machines. Journal of Machine Learning Research, 6, 1889-1918. http://dl.acm.org/citation.cfm?id=1046920.1194907
[19]	Dua, D. and Gra, C. (2017) UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[20]	Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011) Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies