Using Non-Additive Measure for Optimization-Based Nonlinear Classification

Over the past few decades, numerous optimization-based methods have been proposed for solving the classification problem in data mining. Classic optimization-based methods do not consider attribute interactions toward classification. Thus, a novel learning machine is needed to provide a better understanding on the nature of classification when the interaction among contributions from various attributes cannot be ignored. The interactions can be described by a non-additive measure while the Choquet integral can serve as the mathematical tool to aggregate the values of attributes and the corresponding values of a non-additive measure. As a main part of this research, a new nonlinear classification method with non-additive measures is proposed. Experimental results show that applying non-additive measures on the classic optimization-based models improves the classification robustness and accuracy compared with some popular classification methods. In addition, motivated by well-known Support Vector Machine approach, we transform the primal optimization-based nonlinear classification model with the signed non-additive measure into its dual form by applying Lagrangian optimization theory and Wolfes dual programming theory. As a result, 2 – 1 parameters of the signed non-additive measure can now be approximated with m (number of records) Lagrangian multipliers by applying necessary conditions of the primal classification problem to be optimal. This method of parameter approximation is a breakthrough for solving a non-additive measure practically when there are a relatively small number of training cases available ( ). Furthermore, the kernel-based learning method engages the nonlinear classifiers to achieve better classification accuracy. The research produces practically deliverable nonlinear models with the non-additive measure for classification problem in data mining when interactions among attributes are considered. 2 1 n m  


Introduction
Classic optimization-based methods formulate classification problems by modeling data with standard optimization techniques using objectives and constraints. Mathematical programming provides general solution to the zoptimization problem. For example, references [1,2] proposed two classification models based on reducing the misclassification through minimizing overlaps or maximizing the distance of two data points in a linear system. A method named Multiple Criteria Linear Programming (MCLP) [3,4] has been initialized to compromise the objectives of models in [1] and [2] simultaneously and achieved a better data separation in a linear system. Alternatively, a quadratic model can be used to deal with linearly inseparable situation [5]. The key idea of those approaches is to separate data when they are in different classes as well as pull data together when they are in the same class. Initiated by [6], another well-known optimization-based classification method is Support Vector Machine (SVM), which mathematically constructs hyperplanes by support vectors. Further more, SVM separates data nonlinearly by introducing so-called nonlinear kernel functions.
Although these optimization-based methods separate data linearly or nonlinearly, they do not consider contributions from the interaction among attributes. In this paper, we use a nonadditive measure to model data with interactions and propose new nonlinear classification models. Nonlinear integrals can be used as tools to aggregate unknown parameters in the non-additive measure and values of attributes. As one of nonlinear integrals, the Choquet integral [7] is chosen as the aggregation tool for data modeling for classification problem. In addition, we investigate the direction of constructing nonlinear objectives by developing kernel functions in nonlinear classification models, a technique taken by SVM.
The rest of this paper is organized as follows: In Section 2, an overview of classic optimization-based classification methods is provided. Section 3 reviews definitions of non-additive measures and the Choquet integral. In Section 4, a new optimization-based classification model with a non-additive measure is proposed. Section 5 describes the Lagrangian optimization approach to solve the issue of dealing limited training samples with the proposed nonlinear classification model. Section 6 shows performance of the proposed models in experimental results. Finally, Section 7 provides conclusions from this research.

Preliminary
In this section, we provide an overview of classic optimization-based classification methods.
Consider that a dataset consists of n attributes and m records. Let X = {x 1 , x 2 , ···, x n } denote the set of feature attributes and y be the class label, where   1,1 y   j for a two classes dataset. The dataset has a form as follows: where following elements are the values of attributes x 1 , x 2 , ···, x n for the j-th record in the dataset, denoted by f j , j = 1, ···, 2, m. Note that f j can be regarded as a vector. In addition, y j is the corresponding class label in the j-th record. The mathematical programming or optimization-based approach have been widely used for many applications. Particularly, numerous mathematical programming methods based on optimization techniques have been proposed for solving classification problem [1][2][3]6]. In classification, the concept of classes is generally expressed as , where w, f j , and b represent attribute weights, values, and classification critical value respectively. Therefore, wf j is the weighted sum of all the attributes. For a dataset with two classes, the decision function for the classes are defined as: where y j = 1 if the j-th record belongs to class 1 and y j = -1 if the j-th record belongs to class 2.
The two linear classification methods [1,2] based on the idea of reducing misclassification by minimizing the overlaps or maximizing the sum of distances in a linear system. One approach is to maximize the sum of minimum distances (MMD) of data from the critical value. Another approach separates the data by minimizing the sum of deviations (the overlapped distances between classes) (MSD) of data from the critical value. These two classic linear classification models can easily be described with a standard form of optimization, i.e.
where α j denotes the degree of the overlapping of the two classes and β j denotes the distances from the observation to the critical classification value b. The weights w are optimized by linear programming, a typical optimization technique. The critical value b is given as a constant non-zero real number.
The above two linear classification models provide the basic idea of data separation, which pulls the data apart from the boundary (maximize the sum of β j in MMD) or to make the smallest data overlapping area (minimize the sum of α j in MSD). However there are some optimization difficulties in those approaches. For example, the MMD model cannot be optimized because the value of β j can reach as large as possible since the goal is to maximize the sum of β j . Thus, in the implementation of MMD model, β j is bounded as β j ≤ β * , where β * is a given positive constant. The MMD classification model is only able to classify linearly separable dataset. Similarly, the α j in MSD model has to be bounded to a very small positive value α * as α j ≥ α * .
Efforts have been made to improve optimization-based linear classification for better dealing with linearly inseparable. For example, MCLP approach was initiated by compromising two objectives of MMD and MSD simultaneously and achieved a better classification within a linear system [3]. MCLP model compromises objectives as [3]: where β * is a positive constant to restrict the upper bound of β i . Another direction of improving optimization-based classification is to develop nonlinear models by constructing nonlinear objectives, such as the Multiple Criteria Quadratic Programming (MCQP), a nonlinear optimization classification [5].

Non-Additive Measures
A common characteristic of the methods described above is that the modeling is based on the assumption that contributions from all attributes toward classification are the sum of contributions of each attribute. None of those methods considers the interactions among attributes toward classification, which may provide a better understanding of the nature of classification and achieve more satisfactory results. In addition, the model should be able to represent the underlying phenomenon of applications such as classification in a more adequate manner because attributes are not completely isolated from each other. Such a model should have the potential of increased robustness, defined as the ability of maintaining effective performance on both training and testing results on a diversity of datasets. Particularly, a classification model is said to be robust when the performance of its testing results is not significantly distant from training.
The theory of non-additive measure can achieve increased robustness and better performance in classification. The bases of non-additive measures and the nonlinear integrals are briefly reviewed in the rest of this section.

Definition of Non-Additive Measures
The attribute interactions can be represented by a non-additive measure. The concept of non-additive measure (also referred to as fuzzy measure theory) was initiated in the 1950s and has been well developed since 1970s [7][8][9].
Let finite set X = {x 1 , ···, x n } denote the attributes in a multidimensional dataset. Several important types of non-additive measure are defined as the followings [8]: The values of non-additive measure μ are unknown parameters. The signed non-additive measure is adopted to develop optimization-based nonlinear classification models.

Choquet Integral
Nonlinear integrals are used as data aggregation tools to integrate the values of attributes with respect to a nonadditive measure. As one of nonlinear integrals, the Choquet integral is more appropriate for applications such as classification because it provides very important information in interactions among attributes [10].
, n is the number of attributes in the dataset.  Choquet integral may be calculated as [11]: is the fractional part of 2 i j and the maximum operation on the empty set is zero. Let j n , j n -1 , ···, j 1 represent the binary form of j, the i in formula 3 is determined as following:

Optimization-Based Nonlinear Classifiers with Non-Additive Measures
The idea of using non-additive measure in classification problem is not new. In the fuzzy measure community, non-additive measures have been utilized for modeling attribute interactions for data separation purpose. For example, reference [12] used the Choquet integral with respect to the non-additive measure on statistical pattern classification based on possibility theory, an optimization-based classification model was later proposed with a non-additive measure [13]. Reference [14] proposed the k-Interactive (k = 2) classification with feature selections based on a pattern matching algorithm similar to [12]. Classification can also be achieved by directly separating the date using the weighted Choquet integral projection [15] or using a penalized signed fuzzy measure [16]. A detailed discussion of geometric meaning of the contributions from feature attributes in nonlinear classification can be found in [17].
There are limitations on above methods, notably: (a) Impractical: Due to the complexity of the non-additive measure, the methods were only applicable for datasets with small number of attributes, generally less than 5. (b) Limited performance: the classification accuracy was not promising compared to other popular methods [13] due to lack of better learning algorithms for determining unknown parameters of a non-additive measure. For instance, classification model in [15] with the Choquet integral has infinite number of solutions and the proposed method can only determine one of them. To address these limitations, this current research intends to provide a more practical and powerful solution toward nonlinear classification with a non-additive measure.
In addition, early studies of non-additive measure for classification also show limitations on classification accuracy and scalability. For example, although classification model in [12] is well developed in theory (similar to Bayesian classifier), the classification did not show any benefits of using non-additive measure and it is even more difficult to obtain good results on small Iris dataset, a benchmark dataset from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/). An optimizationbased nonlinear classification model [13] with a nonadditive measure was later proposed and studied. The results show it even performs worse than linear classifier on iris dataset and only competitive to fuzzy k-NN classifier on other datasets. The research [13] suggests a better non-additive identification algorithm is needed.
An improvement for nonlinear optimization-based classifiers with non-additive measure might be the optimization process of the critical value for classification.
Inn MCLP model, the classification critical value b is not optimized but arbitrarily chosen. A better method to determine b could be updating b with the average of the lowest and largest predicted scores [15] during learning iterations. Alternatively, the critical value b in MCLP also can be replaced with soft-margin b ± 1 similar to SVM which constructs a separation belt instead of a single cutting line. With this technique, it is guaranteed to produce a unique solution to the model because the goal of the optimization is to find the cutting line which is most close to the misclassified data points on both sides. The MCLP model can be extended to a linear programming solvable problem with optimized b and the signed non-additive measure, as shown below:

Nonlinear Classification with the Signed Non-Additive Measure by Lagrangian Optimization
As mentioned, it is hard to optimize the non-additive optimization-based classification models when there are not enough observations ( ). The existing approaches such as hierarchical Choquet integral [18] and the k-Interactive measure [14] ignored some values of non-additive measure μ to some extent. As a solution, the Lagrangian optimization theory can be incorporated to transform the model into practically solvable form with the best approximation of parameters of non-additive measure μ. The Karush-Kuhn-Tucker (KKT) conditions [19] applied in the Lagrangian optimization process are the necessary conditions to guarantee an optimizationbased classification model to reach optimum. To develop a nonlinear classifier which can deal with this situation, a quadratic non-additive optimization-based model is constructed and transformed.

Lagrangian Theory for Optimization
The Lagrangian theory is intended to provide the neces-Copyright © 2012 SciRes. AJOR N. YAN ET AL. 368 sary conditions for a given nonlinear optimization problem to reach an optimal solution. The KKT conditions in the Lagrangian optimization provide the necessary conditions for the proposed classification model to have an optimal solution. Generally, an optimization problem can be presented as [20]: Definition 3. Given functions f, g i , i = 1, ···, k, and h i , i = 1, ···, m, defined on vector , the primal optimization problem is defined as: The generalized Lagrangian function corresponding to the definition 3 is

 
The first two conditions are also the necessary conditions for the optimization problem to reach optimal. The third condition is called KKT complementary condition. The first two conditions are also the necessary conditions for the optimization problem to reach optimal. The sufficient condition is true only if function (L) of w is convex. In this research, since the convexity of the primal problem is yet to be proved, only the necessary conditions can be considered. Since the constraints of the primal optimization problem does not have the condition of equality, only the first two conditions and i are to be applied. Thus, the lagrangian function for this particular optimization problem is described as: A necessary condition for a normal point w * to be a minimum of f(w) subject to g i (w) = 0, i = 1, ···, k, is Cris-tianini2000: for some values of λ * .

Quadratic Non-Additive Optimization-Based Classification
We extend model M1 to a quadratic programming form and rewrite to model M2, as follows:

 
is a constructed objective for modeling purpose. The constant C is normally set to be very large to minimize the impact from the constructed objective.

Nonlinear Classifier with the Non-Additive Measure
The optimization problem M2 can be transformed into its corresponding dual problem. Similar to the optimization process of Support Vector Machine Cristianini2000, firstly, the primal Lagrangian is: where λ i are the Lagrangian multipliers and C is a given relatively large positive constant. According to the general definition of the Choquet integral: , , denotes the inner product of z i and z j .

i j
The primal problem can be transformed into its dual problem according to Wolfe's dual programming theory, as the following shows: Since λ = C and elements in C are constants, and the model can be further simplified as: Model M3 can be regarded as a general optimizationbased nonlinear classifier with the signed non-additive measure. In addition, the inner product can be further replaced with kernel functions to deliver more accurate classification.
. It is observed that for constructing optimal separation in a feature space, one does not need to consider the feature space in explicit form, but only has to calculate the inner products between support vectors and the vectors of the future space [21]. Thus, the inner product operation can be replaced with kernel functions K, the function that corresponds to an inner product in the expanded feature space. Nonlinear kernel functions are able to map the data into hyper space to achieve better classification.
The three well-known kernel functions have been adopted by Model M3. They are linear, Polynomial and RBF kernel functions.
guage multipliers at each step, the objective function will be decreased and the convergence is guaranteed according to Osuna's theorem Osuna [23].
attribute interactions because the datasets were created based on features of the Choquet integral.
To better understand this nonlinear classification, we visually represent how Model M2 (the primal problem) to perfectly classify the two dimensional artificial dataset in Figure 1. The example was taken from fold-1 training set of the two dimensional artificial dataset. This training set contains 160 data points, including 85 in class 1 and 75 in class 2. Model M2 creates a three dimensional decision space (x 1 , x 2 , y), where x 1 , x 2 are the attributes of the two dimensional dataset and y is the decision score of M2. The model classifies data as class 1 when y > b, otherwise class 2. Figure 1 presents one solution from the cross validation. In Figure 1(a) data points belonging to In conclusion, we point out that the applied KKT conditions are the necessary conditions for the classification model to reach optimum. Model M2 is transformed into its dual form M3 during the Lagrangian optimization to deal with the case of learning with small training dataset ( m ). Through this compromised solution, the 2 n -1 parameters of the signed non-additive measure can now be approximated by m Lagrangian multipliers.

Applications
The proposed Model M3 has been applied on both artificial and UCI machine learning datasets for classification purpose and compared with performance of other methods. There are two artificial datasets with two classes randomly generated according to the definition of the Chouqet integral. One dataset has two dimensions (2D) and the other has three (3D). The 5-fold cross-validation is used for classification evaluation. Model M3 is also compared to other popular classification methods, such as SVMs, Decision Tree (J48), Logistic Regression and Naive Bayes. The average classification accuracy in percentage is summarized in Table 1 for evaluating testing sets in all 5 folders. As a result, M3 performs best on both artificial datasets when nonlinear kernel methods are used although performance of different kernel methods varies. The results confirm the theoretical assumption that models with non-additive measure can deal with  two classes are represented by asterisks (.) and dots (.), respectively. The data points shown on the bottom of the figure depict the original 2D data, which are apparently not linearly separable. After applying Choquet integral to create a third dimension y, the corresponding 3D data points are now located in two different 2D planes, and are now linearly separable. Figure 1(b) represents the same data set but provides a different perspective to view the data. The linearly inseparable two dimensional dataset x 1 , x 2 is lifted into a hyper space (x 1 , x 2 , y) by M2 and then can be easily classified by the decision boundary y = -14 (value of the critical value b).
In addition, the data cannot be perfectly classified by the linear model MSD as Figure 2 shows. After applying MSD model, the corresponding 3D data points are still located in one flat surface in the three dimensional space and the two classes cannot be linearly separated. In MSD model, the critical value b is set to 1 and the MSD classification model separates data by decision function y = w 1 f 1 + w 2 f 2 (y > 1 indicates class 1 and y < 1 for class 2), with the solution w 1 = 0.92, w 2 = 0.70.

Classification of UCI Datasets
The UCI's Pima Indian Diabetes and the Australian Credit Approval datasets were classified with model M3. The Australian dataset contains two classes (approval or not) and it has 14 attributes and 690 instances. Both datasets were transformed into [-1, 1] with z-score normalization and the 5-folder cross-evaluation was conducted for the application. The constant variable c was set to 100000 for all the experiments. Table 2 is the summarization of the results compared with the SVM classifier with RBF kernel.
The above results show that M3 outperformed SVM with RBF kernel on the Australian credit dataset which indicates the model is more robust when the dataset has more feature attributes, in the sense of that the performance of the testing is not significantly worse then the training. Our experiences also show that the use of Lagrangian optimization makes it feasible to solve nonadditive measure when the number of attributes is up to 14. The use of kernel functions also ensured the classification accuracy of the nonlinear model with the signed non-additive measure.

Conclusion
We have proposed a new classification approach based on optimization-based models while the attributes interactions are considered. The theory of non-additive measures were utilized to model the data with interactions.   Traditionally, nonlinear integrals are the aggregation tools for non-additive measures. The Choquet integral is good for data modeling purpose. We have demonstrated the value of using non-additive measure on optimizationbased classification and proposed a more efficient nonlinear model M3, which can classify data by solving less number of parameters. The 2 n -1 parameters of the signed non-additive measure can now be approximated by m Lagrangian multipliers. The optimization of the dual model M3 is guaranteed by KKT conditions, which are the necessary conditions for the nonlinear programming to be optimal. This method of parameter approximation is useful when the training set has limited number of samples. The proposed approach is thus suitable for classification applications where training sample is small comparing with the number of attributes. The experiment on the artificial dataset demonstrated the geometric meaning and profound theory of the nonlinear classification models. Applications on UCI datasets have shown that this nonlinear approach increases model robustness as the classification accuracy is stable and the accuracy of testing results is close to that of the training results. We are now in the process of applying our approach to various data mining applications.