1. Introduction
Support vector machine (SVM) is a supervised learning method mainly used for classification problems [1]. It distinguishes data points of two different classes by finding an optimal hyperplane that maximizes the boundary between the two classes. SVM plays an important role in the field of machine learning for its excellent performance and generalization ability. Its applications span multiple subjects, including key areas such as text classification, image recognition, bioinformatics, and financial risk assessment.
The performance of SVM largely depends on the choice of its loss function. Different loss functions are suitable for different data distribution and noise conditions. The SVM with classic hinge loss (HSVM) [2] increases the distance between the nearest points of different categories, but the model is easily affected by noise. To overcome this limitation, the SVM with pinball losses (PSVM) [3] was proposed, which deducts points from correctly classified points and is insensitive to noise during resampling. However, it sacrifices sparsity and lacks flexibility in adjusting parameters that affect the loss value. In order to overcome these shortcomings, the SVM with generalized pinball loss (GPSVM) [4] was introduced. GPSVM reduces noise sensitivity while retaining sparsity and flexibility of parameter adjustment in the derived solution.
The above mentioned loss functions are convex but not differentiable, so the corresponding unconstrained minimization problems are not differentiable. Therefore, although some first-order and second-order optimization algorithms [5] are theoretically effective and have solid theoretical support, they cannot be directly used to solve such problems. In order to solve the limitation of traditional hinge loss function, researchers introduced smooth hinge loss function (SHSVM) [6] to improve the performance of the model. Similarly, to address the shortcomings of the pinball loss function, researchers proposed a smooth pinball loss function (SPSVM) [7]. Based on the above ideas, the smooth generalized pinball loss function (SGPSVM) [8] was given. These smooth loss functions are differentiable, so that the corresponding models can be solved directly by applying first order and second-order optimization techniques.
Although some progress has been made in the design of loss functions, the above mentioned loss functions are all unbounded, which makes the models face significant challenges when dealing with outlier data points. Outliers are those extreme values that differ significantly from the rest of the data set, and they can be caused by sampling errors or labeling errors. In classification tasks, it is critical to properly handle these outliers, as they can seriously affect the performance of the model. Many SVM models are particularly sensitive to these outliers because their loss functions may grow indefinitely on these points. This phenomenon has led to the development of bounded loss functions, such as the truncated hinge loss function in RSVM [9]. By truncating the loss function, the high values are reset to some constant(s), so that the impact of outliers is under control and the model performance is enhanced.
Recently, Zhu et al. [10] has proposed bounded huberied truncated pinball loss function which can reduce the effect of noise in training samples. Since quartic functions are more complex than quadratic functions, they may improve classifier performance, especially when data follows a nonlinear distribution. Based on this idea, we change the Huberized truncated pinball loss function by replacing the quadratic term with the quartic term. The newly proposed quartic truncated pinball function is bounded and differentiable. The Bregman modified second APG method by Wang et al. [11] performs well on such problems, so we use BAPGs to solve the support vector machine problem with the quartic truncated pinball function (QTPSVM).
The structure of this paper is as follows: In Section 2, some classical SVM models are reviewed. In Section 3, we introduce the QTPSVM model. Then we discuss the application of the BAPGs algorithm on QTPSVM in Section 4. Numerical experiments on real data are done to verify the effectiveness of QTPSVM in Section 5. The last Section is the summary of this paper.
2. Review on Some Support Vector Machine Models
In the SVM model for binary classification problem, we denote the training dataset as
, where
is a n-dimensional predictor variable,
is its label and m is the number of samples. The key to solving this problem is to identify a separating hyperplane represented by
,
,
, with the objective of maximizing the margin and the training error. The SVM classification can be expressed as an unconstrained optimization problem:
where
denotes the loss function and
is the penalty (regularization) term,
is the regularization parameter, used to balance the loss and the penalty. The decision function
is used to determine the class of
:
is classified as positive if the the decision function is positive, and negative otherwise.
There are several kinds of penalty terms commonly used in support vector machine problems:
1) The
norm penalty
. This is a traditional penalty term for various machine learning problems, but it does not perform well on sparse solutions.
2) The
norm penalty
. It has a good effect on sparsity, but it is limited by the number of samples and may over-prune in some highly correlated features [12].
3) The Elastic Net penalty combines the
norm and the
norm [13], integrates the advantages of both terms and overcoming their drawbacks. The commonly used forms are:
(2.1)
and
(2.2)
In SVM, the most popular and widely used loss function is the hinge loss function (Rectified Linear Unit function), which is denoted by
, and is defined as follows:
Another commonly used one is the pinball loss function
,
where
is the preset parameter. It is obvious that
is a special case of
when
. The classification models obtained by combining the above two loss functions with the
-norm penalty are referred to as HSVM [2] and PSVM [3] respectively.
Both hinge loss function and pinball loss function are nondifferentiable (at zero point) convex functions. In applications, non-differentiability restricts the choice of algorithms, while convexity makes the loss function value of outliers larger, thus greatly affects the model effect.
In order to overcome this weakness of convexity, the truncation method is a common idea, which is to redefine large values as some constant (s). For example, the truncation hinge loss function in RSVM [9] is
Such operation makes the model more stable when dealing with noisy data. However, the truncation method makes the loss function nonconvex, so it has higher requirements for the algorithms.
An important method for handling non-differentiability is to locally correct the function values near the non-differentiable points, to make the entire function differentiable. Huberization is a common correction approach that uses quadratic functions to handle non-differentiable points. For example, the Huberized hinge loss function used in HHSVM [14]:
where
is a pre-specified constant.
Based on the above methods, Zhu et al. [10] proposed the HTPSVM classification model:
where
are regularization parameters. The loss function is
where
,
,
are pre-specified constants. Here we note that
is obtained by truncating the pinball loss function and then applying Huberization. Experiments validate that HTPSVM not only has better performance in terms of noise resistance compared to other methods, but also demonstrates improved time efficiency in solving the problem using the APG algorithm.
3. Robust Support Vector Machine Classifier with the Quartic Truncated Pinball Loss (QTPSVM)
Inspired by the Huberization method and HTPSVM, we replace the quadratic term in HTPSVM with a quartic function to obtain the following quartic truncated pinball loss function:
where
,
,
are pre-specified constants. We notice that
is bounded, continuous and differentiable, but not convex.
The SVM classifier with the quartic truncated pinball loss (QTPSVM) based on the elastic net penalty is defined as:
(3.1)
where
are regularization parameters. The advantage of QTPSVM over HTPSVM lies in the fact that, compared to the uniform change of derivatives for quadratic functions, the derivative of a quartic function changes very slowly within each segment and more rapidly at the boundaries, with relatively smaller fluctuations in function values. This makes the impact of misclassified points near the separating hyperplane on the function value and derivative smaller, thereby enhances the noise resistance of the model. However, quartic functions may lead higher requirements on algorithms, so we need a suitable method to solve this problem, which is the BAPGS algorithm we introduce below.
4. The Bregman Modified Second APG Method for Solving QTPSVM Problem
The quartic truncated pinball loss is nonconvex. However, many algorithms for nonconvex optimization problems may not perform well on quartic functions due to the usage of the proximal operator.
So to solve (3.1), we adopt the Bregman modified second accelerated proximal gradient (BAPGs) method, which is a variant of the modified second accelerated proximal gradient (APGs) method by replacing the proximal operator by the minimization involving the Bregman distance.
4.1. The BAPGs Method
The BAPGs are used to solve optimization problems in the following form:
(4.1)
where
is a differentiable function on
,
is proper, convex and lower semicontinuous, and
is proper and convex.
Inspired by Nesterov's second acceleration method [15], Lin and Liu proposed a modified second accelerated proximal gradient (APGs) method for this type of problem. At each iteration, it depends on the resolution of the subproblem,
(4.2)
where
is a subgradient of
at
. Ren, Liu and Wang proved the convergence properties of this method when
is nonconvex [16]. To ensure global convergence of the iteration, both in [17] and [16],
need to be L-smooth, i.e.,
for all
.
Based on the good performance of APGs in many problems in the form of (4.1) and some improvement of introducing Bregman distance into existing algorithms, Wang, Liu and Ren combine Bregman distance with APGs to construct a new algorithm [11].
Definition (Bregman Distances [18]) Suppose that
is a proper, continuously differentiable, convex function, the following function
is called the Bregman distance corresponding to
:
The BAPGs algorithm are shown as follows
4.2. Settings for the BAPGs Method on QTPSVM
In order to apply BAPGs to the QTPSVM problems, we first organize problem (3.1) into the form of (4.1), (we always use
to denote
in the details of the algorithm)
where
,
According to the common setting of similar problems [19], we take
. In theory, parameter
should ensure that
is a convex function, namely for any
,
Noting that
,
, and
,
. Here we see that precisely calculating the range of
is very difficult. However, in actual algorithms, we only need
to be convex near the optimal point, and in the experiments, the variations of
and
near the optimal point are not significant. Therefore, we can approximately estimate
at the initial point (if the initial point is zero, we set
), and adjust it at
(where
is introduced next), then we restart the algorithm with new
from the original initial point.
For the sequence
, we let
, and for a positive integer
, let
when
, and
for
. The selection of
is flexibly determined in the algorithm according to the judgment method in [16].
Finally, the following result is used to obtain
:
Proposition For
, let
be the soft-thresholding operator, i.e.,
. Let
where
are positive constants. We have
when
, and
otherwise, where
is such that
.
Proof By the first order optimality condition for convex problems, we have
Let
, we have
, thus
. Noting that
, the conclusion is proved.
5. Numerical Experiments
To evaluate the performance of QTPSVM, we compare it with HTPSVM on real datasets. The HTPSVM is solved by the proximal gradient algorithm (APG) as in [20]. The experimental datasets are all taken from the UCI Machine Learning Repository [21]. Both models start from the zero vector for all experiments. The algorithm stops if both the following two conditions are met:
where
,
. The stopping tolerance
.
As is well known, the choice of parameters plays a crucial role in the effectiveness of machine learning [22]-[24]. The parameters for the APG are set as in [10]. For QTPSVM and HTPSVM, we take
, and
, and fix
because the algorithm seems not sensitive to
and
[10]. In HTPSVM,
, and
are set according to the original paper, that is, they are tuned through 10-fold cross-validation from
. In QTPSVM, we found by tenfold cross-validation on the same parameter combination set that experiments perform best and most efficiently on most dataset. Other combinations are not only less effective in most cases but also time-consuming, so we fix this parameter combination. After determining the parameter combination, for each dataset, we randomly divide it into training and test sets 50 times; for each division, 80% of the dataset is used for training, and the remaining 20% is used for testing.
All numerical experiments were conducted on an Intel(R) Core(TM) i5-8265U processor with 8.00 GB memory. The operating system is Windows 10, and the version of Matlab used is R2021a.
We compare QTPSVM with HTPSVM. To further study the robustness of QTPSVM, we randomly flip the labels of the training set, with flip levels of 0%, 5%, and 10%. The classification accuracy on the test set (averaged over 50 repetitions) is recorded in Table 1. We also record the average running time for each method (in seconds) in Table 2. To further evaluate the performance of the classifiers, the area under the ROC curve (AUC) is calculated and recorded in Table 3.
Through Tables 1-3, we can observe that the average running time of QTPSVM is less than that of HTPSVM. In terms of classification accuracy, QTPSVM has higher accuracy and AUC on many datasets compared to HTPSVM. This improvement in performance remains stable even when different levels of label flipping are applied.
Table 1. Classification accuracy on real data sets (%).
|
0% |
5% |
10% |
QTPSVM |
HTPSVM |
QTPSVM |
HTPSVM |
QTPSVM |
HTPSVM |
Rocks |
77.02 |
75.17 |
78.05 |
76.34 |
76.59 |
74.83 |
Bupa |
60.14 |
57.28 |
57.01 |
54.41 |
59.74 |
56.78 |
Fourclass |
75.28 |
75.64 |
75.57 |
76.23 |
75.29 |
75.77 |
German |
75.59 |
74.42 |
75.67 |
74.33 |
75.56 |
74.63 |
Glass |
92.62 |
91.67 |
91.14 |
90.05 |
91.95 |
90.81 |
Haperman |
73.41 |
72.30 |
73.28 |
72.69 |
74.49 |
73.80 |
Heart |
84.81 |
85.52 |
83.78 |
84.41 |
84.59 |
84.93 |
Hepatitis |
85.57 |
85.79 |
84.21 |
84.64 |
83.29 |
83.50 |
Liver |
67.07 |
66.32 |
65.30 |
65.36 |
64.61 |
63.62 |
Raisin |
85.51 |
84.38 |
84.01 |
83.17 |
85.84 |
84.86 |
Tic |
66.92 |
65.31 |
67.35 |
65.21 |
67.78 |
66.31 |
Table 2. Computation time on test data of real data sets (in seconds).
|
0% |
5% |
10% |
QTPSVM |
HTPSVM |
QTPSVM |
HTPSVM |
QTPSVM |
HTPSVM |
Rocks |
0.0278 |
0.5550 |
0.0215 |
0.6410 |
0.0226 |
0.4948 |
Bupa |
0.0150 |
0.4181 |
0.0137 |
0.5011 |
0.0131 |
0.5011 |
Fourclass |
0.0281 |
1.3818 |
0.0283 |
1.5580 |
0.0265 |
1.5142 |
German |
0.0331 |
0.6596 |
0.0320 |
0.6879 |
0.0285 |
0.8484 |
Glass |
0.0150 |
0.4931 |
0.0151 |
0.4548 |
0.0131 |
0.5007 |
Haperman |
0.0180 |
0.6065 |
0.0181 |
0.5975 |
0.0163 |
0.6039 |
Heart |
0.0145 |
0.0100 |
0.0164 |
0.0081 |
0.0136 |
0.0153 |
Hepatitis |
0.0165 |
0.0355 |
0.0163 |
0.0366 |
0.0148 |
0.0666 |
Liver |
0.0176 |
0.5612 |
0.01195 |
0.4868 |
0.0176 |
0.6065 |
Raisin |
0.0330 |
0.0204 |
0.0309 |
0.0180 |
0.0299 |
0.0176 |
Tic |
0.0310 |
1.6001 |
0.0294 |
1.6136 |
0.0303 |
1.5903 |
Table 3. AUC of four classification models on real datasets.
|
0% |
5% |
10% |
QTPSVM |
HTPSVM |
QTPSVM |
HTPSVM |
QTPSVM |
HTPSVM |
Rocks |
0.8539 |
0.8079 |
0.8721 |
0.8142 |
0.8440 |
0.8012 |
Bupa |
0.6328 |
0.6106 |
0.6375 |
0.5969 |
0.6378 |
0.6229 |
Fourclass |
0.8308 |
0.8299 |
0.8341 |
0.7789 |
0.7795 |
0.7755 |
German |
0.7899 |
0.7843 |
0.7849 |
0.7789 |
0.7795 |
0.7755 |
Glass |
0.9706 |
0.9745 |
0.9637 |
0.9652 |
0.9608 |
0.9677 |
Haperman |
0.6881 |
0.6863 |
0.6774 |
0.6796 |
0.6850 |
0.6713 |
Heart |
0.9183 |
0.9176 |
0.9042 |
0.9054 |
0.9114 |
0.9104 |
Hepatitis |
0.8954 |
0.8979 |
0.8845 |
0.8861 |
0.8674 |
0.8663 |
Liver |
0.6679 |
0.6635 |
0.6533 |
0.6487 |
0.6557 |
0.6523 |
Raisin |
0.9258 |
0.9266 |
0.9171 |
0.9182 |
0.9278 |
0.9285 |
Tic |
0.6172 |
0.6130 |
0.6265 |
0.6194 |
0.6116 |
0.6064 |
6. Conclusion
In this paper we propose a new robust support vector machine classification model QTPSVM, based on a quartic truncated pinball loss function, and solve it using the BAPGs method. By comparing the numerical results of QTPSVM and HTPSVM in real data, it is shown that this new classifier is effective and stable to noise. However, QTPSVM involves multiple parameters, and parameter selection requires cross-validation tuning, which increases the complexity and computational cost of model training. Therefore, QTPSVM may not be the most efficient choice for very large-scale datasets.
Use of Generative-AI Tools Declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.