Data Fusion with Optimized Block Kernels in Ls-svm for Protein Classification

In this work, we developed a method to efficiently optimize the kernel function for combined data of various different sources with their corresponding kernels being already available. The vectorization of the combined data is achieved by a weighted concatenation of the existing data vectors. This induces a kernel matrix composed of the existing kernels as blocks along the main diagonal, weighted according to the corresponding the subspaces span by the data. The induced block kernel matrix is optimized in the platform of least-squares support vector machines simultaneously as the LS-SVM is being trained, by solving an extended set of linear equations, other than a quadratically constrained qua-dratic programming as in a previous method. The method is tested on a benchmark dataset, and the performance is significantly improved from the highest ROC score 0.84 using individual data source to ROC score 0.92 with data fusion.


Introduction
Bioinformatics studies often involve analyzing large amount of data from various sources.Data fusion, in other words, how to combine various data sources in a meaningful way, is crucial to the success of extracting and selecting useful information and features for classification and prediction.Recent advances in kernel based methods have made them a tool of choice for many bioinformatics tasks.Although the latest developments show that kernel based methods can be amicable to combining data in straightforward ways, optimized data fusion in a kernel based framework remains challenging.
In [1], a statistical framework is presented for genomic data fusion.Specifically, the method is based on the algebra of kernels [2] to form a linear combination of individual kernels that characterize pairwise relationship of proteins from different data sources, such as sequence similarity, hydropathy profile, and protein interactions.These data sources contain different and thus partly independent and complementary information about proteins, and combining them is expected to further enhance the total information.Kernel method offers a very convenient way to resolve one key issue in data fusion: how to deal with heterogeneous data in various formats.As pointed out in [1], despite of various different formatsexpression data as vectors or time series, sequence data as strings of 20 alphabet, and protein-protein interactions expressed as graphs-evaluating the kernel on all pairs of data points yields asymmetric, positive semi-definite matrix known as the kernel matrix or the Gram matrix.Intuitively, a kernel matrix can be regarded as a matrix of generalized similarity measures among the data points.Ref. [1] shows that a linear combination of kernel matrices, each derived from a different data source, offers an effective way for data fusion, formalizing the metalearning task for the optimal weights as a quadratically constrained quadratic programming problem.Like Ref. [1], Ref. [3] uses weighted averaging to combine multiple kernels but develops faster algorithms relying on quadratically constrained linear programming.Ref. [4] treats a mix of base kernels as transformation learning from a mixture of transformations and solves the resulting non-convex with a semidefinite relaxation for an approximate global solution.
In this work, we developed an alternative approach to data fusion by forming an integrated kernel as a weighted direct sum of the individual kernels in the framework of Least-Square Support Vector Machine (LS-SVM), with the advantage of combining the model training and weight optimization altogether as solving a set of linear equations.Tested on a benchmark dataset of transmembrane proteins, we demonstrate that our novel method improves the classification performance significantly from individual kernels, up to a ROC score 0.92, comparable to what is reported in [1], and yet with the capability of removing the constraint requiring all individual kernel matrices to have the same dimension.

Method
As mentioned in the introduction, work in [1] bases its method on the fact that basic algebraic operations such as addition, multiplication and exponentiation preserve the key property of positive semi-definiteness for kernels [2].Therefore, for a given set of kernels K 1 , K 2 , ..., K m , the linear combination (1) also forms a kernel.
The authors in [1] show that this kernel can be optimized by minimizing with respect to µ i under additional trace and positive semi-definiteness constraints: subject to 0 ≤ α ≤ C, α T y = 0, and In this work, we develop an alternative approach by forming an integrated kernel as a weighted direct sum of the individual kernels in the framework of Least-Square Support Vector Machine, with the advantage of combining the model training and weight optimization altogether as solving a set of linear equations.Another benefit is that, unlike Equation (1), direct sum does not require all individual kernels to have the same dimension.
Suppose there are n examples with a binary classification, k x  , y k for k = 1,..., n, where y k , which can be +1 or −1, is the label for example k, and k x  is an m  -dim vector of attributes characterizing the example.The support vector machines (SVM) method solve the classification problem with a linear model, where w i are the weights and b is the bias, the x is classified as the sign of ( )  .In least-squares SVMs [5], the weights and bias are fixed by optimizing the margin subject to the equality constraints for the training examples: where e k is the slack variable and γ is a parameter regularizing the contribution from the "margin" term and the "error" term in Equation ( 4).
The optimization can be solved by introducing the following Lagrangian where α k are Lagrangian multipliers.The conditions for optimality can be derived from the stationary of the Lagrangian as the following.
Now suppose the vector for example k is a weighted direct sum of m vectors characterizing the example from m different data sources: where β i , for i = 1 to m, are the weights.Note that these m vectors do not have to have the same dimension.Let d i , for i = 1 to m, are the dimensions for these m vector spaces, m  = Σ i=1 to m d i .And the dot product in the direct sum vector space is thus induced as direct product we replace the dot product in each of the m vector spaces with its corresponding kernel function K i , and we introduce the weights for summation of the individual kernels.Therefore, the final kernel matrix K is composed of kernel matrices from individual sub vector spaces in diagonal blocks, as vector components from different data sources do not mix with one another in the direct product.A schematic illustration for the block kernel is shown in Figure 1.It is worth noting that although direct sum, as a way of data integration, is frequently used as concatenation of vectors from various data sources, a kernel defined directly on the total vector space is different from the block kernel, where it may include non-zero values for off-diagonal blocks, which indicate how "similar" the vectors from data sources compare to one another.The block kernel introduced here instead does not prescribe how to directly compare data from different sources for integration.By plugging the above two equations back into the Lagrangian, we obtain the following set of linear equations.
These linear equations are solved using standard procedures such as QR decomposition; the solution optimizes both the weights in the data fusion kernel and the α's, which together give rise to the maximum margin in the support vector machine.Note that, in Craig and Liao (2007) [6], an adaptive kernel is learned from weighted dot product, namely, each of the vector is individually weighted.Here, instead, all components from the sub vector space receive the same weight.

Results
The method is tested with a benchmark dataset as used in [1], primarily for the sake of convenient comparison.The dataset comprises proteins from the MIPS Comprehensive Yeast Genome Database (CYGD) [7].The CYGD assigns 1125 yeast proteins to particular complexes, of which 138 participate in the ribosome.The remaining approximately 5000 yeast proteins are unlabeled.Similarly, CYGD assigns subcellular locations to 2318 yeast proteins, of which 497 belong to various membrane protein classes, leaving 4000 yeast proteins with uncertain location.The data sources include sequence similarity from BLAST, sequence similarity from Smith-Waterman, Pfam domains, Hydropathy profile with FFT, PPI with linear kernel, PPI with Diffusion kernel, and gene expression with radial basis kernel.The individual kernels, which are centrally normalized by a procedure used in [1], are listed in Table 1.The sequence-based kernel matrices are generated using the BLAST [8] and Smith-Waterman (SW) [9] pairwise sequence comparison algorithms, as first described Liao and Noble [10].Both algorithms use gap opening and extension penalties of 11 and 1, and the BLOSUM 62 matrix.Because matrices of BLAST or Smith-Waterman scores are not necessarily positive semi-definite, we represent each protein as a vector of scores against all other proteins.The similarity between proteins is then computed as the inner product between the score vectors.The Gram matrix thus obtained for a set of n proteins is proved to be a valid kernel matrix [11].The Pfam kernel matrix K Pfam is defined similarly as the K B and K SW but by replacing the pairwise similarity scores with expectation values derived from hidden Markov models (HMMs) in the Pfam database [12].Details about these kernels and other kernels can be found in [1].Each data source is first used individually for training a LS-SVM using their corresponding kernel functions and then used in data fusion mode as described above, namely, forming a block kernel matrix.All trained models are tested with a ten-fold cross validation scheme.The performance is measured by the receiver optical characteristics (ROC) score, which is the normalized area under a curve that plots the number of the true positives as the number of false positives as predicted by the trained LS-SVM when a moving cutoff score scans from −1 to +1 [13].The ROC score is 1 for a perfect performance, whereas a random predictor, which will uniformly mix up positives and negatives, is expected to get a ROC score of 0.5.
Table 2 shows the ROC scores for classifying membrane protein category using the various data sources and the corresponding kernels, individually versus when all are combined together by data fusion (ALL).It is easy to see that the data fusion increases the performance, achieving a ROC score 0.917, which is a significant jump from the best ROC 0.835 using only one data source Pfam domain.This performance is very close to the best performance ROC 0.926 reported in [1].Note that the ROC score varies from individual data sources, and some of them are significantly lower than their counterparts in [1].While the exact causes for such discrepancies are not known, one possibility may be that these individual kernels are fine tuned for the regular SVMs, which use a margin defined differently from the least-square SVMs.
Given the poor ROC scores from individual data sources, it is even more remarkable how well the data fusion kernel performs.

Conclusion
We developed a method for combining data of various different sources in the of least-squares support vector machines.The method allows for weighting the various data sources for optimized learning with an induced block kernel matrix.By formulating the induced kernel as weighted by the corresponding subspaces, we can optimize the weights simultaneously as the LS-SVM is being trained, by solving an extended set of linear equations.The results from a set of benchmark data show significant improvement in classification performance from the integration, and are comparable to those from a similar approach based on quadratically constrained quadratic programming as a special case of semi-definite program.

Figure 1 .
Figure 1.Schematic illustration of blocked kernel induced from direct sum of sub vector spaces.