Prediction of Peptides Binding to Major Histocompatibility Class II Molecules Using Machine Learning Methods

In daily life, we are frequently attacked by infection organisms such as bacteria and viruses. Major Histocompatibility (MHC) molecules have an essential role in T-cell activation and initiating an adaptive immune response. Development of methods for prediction of MHC-Peptide binding is important in vaccine design and immunotherapy. In this study, we try to predict the binding between peptides and MHC class II. Support vector machine (SVM) and Multi-Layer Perceptron (MLP) are used for classification. These classifiers based on pseudo amino acid compositions of data that we extracted from PseAAC server, classify the data. Since, the dataset, used in this work, is imbalanced, we apply a pre-processing step to over-sample the minority class and come over this problem. The results show that using the concept of pseudo amino acid composition and applying over-sampling method, increases the performance of predictor. Furthermore, the results demonstrate that using the concept of PseAAC and SVM is a successful method for the prediction of MHC class II molecules.


Introduction
Major Histocompatibility (MHC) molecules play a significant role in graft rejection and T-cell activation.Binding between the antigenic peptide and the MHC molecule is a necessary prerequisite for recognition of antigens by the T cells and initiating an adaptive immune response [1].But, all the peptides cannot bind and only some of them can bind to MHC molecules.Prediction of which peptides can bind to MHC molecules is important to understanding the immune system response.The peptide that can bind to MHC and causes an immune response is called T-cell epitope.
Antigen processing and presentation take place by MHC class I and MHC class II pathways.It is clear that development of machine learning methods to predict the epitopes can reduce the number of the high-cost assay needed to identify T-cell epitopes.This prediction is important for vaccine design and immunotherapy for diseases such as cancer [2].
In this study, we use two machine learning methods to predict the binding between peptide and MHC class II molecule and apply these methods on the HLA-DRB1* 0301 data.
For machine learning approaches we need to extract features from amino acid sequences.Different computational methods were introduced for this purpose.One of them is Composition-Transition-Distribution (CTD).In this method, in order to apply machine learning method, peptides with different-lengths are mapped to fixed lengths [3].Another proposed method is k-spectrum kernel.If the similarity between two sequences is high, these sequences have a great k-spectrum kernel value.This means that they have many common k-mer subsequences [4].In another work, the Local Alignment (LA) kernel was suggested for prediction.In this method, local alignment with gaps is applied to sequences and a score is obtained.This score is used to measure the similarity between these sequences [5].
In this work, we calculate the pseudo amino acid compositions [6] for peptides using PseAAC server and then classify the peptides based on these extracted features.For the imbalanced dataset problem, we apply a preprocessing step to balance the data distribution.For this purpose we use Synthetic Minority Over-sampling Technique (SMOTE).Also, Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM) are used for the classification task.The results are compared with previous results in [7].In order to implement these methods, Weka machine learning workbench is used (www.cs.waikato.ac.nz).Figure 1 shows the significant steps in our approach.
The reminder sections are organized as follow, Section II describes the related techniques for our approach, Section III contains the brief information about dataset that are used in this study.Section IV contains the results and describes evaluation parameters.The conclusion is presented in Section V.

Generating Chou's PseAAC
To extract features from protein sequences and to avoid losing much important information hidden in protein sequences, Chou's PseAAC was proposed to replace the simple amino acid composition (AAC), the frequency of each amino acid within a protein, for representing the sample of a protein.For a summary about its new development and applications, such as how to use the concept of Chou's PseAAC to incorporate the functional domain information, GO (gene ontology) information, and sequential evolution information, among many others, see a recent comprehensive review [8].PseAAC is a flexible web server for generating various kinds of protein pseudo amino acid composition, which is available at http:// chou.med.harvard.edu/bioinf/PseAAC.PseAAC of a given protein sample is represented by a set of more than 20 discrete factors, where the first 20 factors represent the components of its conventional AAC whereas the additional factors incorporate some of its sequence order information via various modes.Typically, these additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combination of other factors as long as they can reflect some sort of sequence order effects one way or the other.Three different types of parameters are often used to generate various kinds of PseAAC: quantitative characters of AAs, weight factor and rank of correlation.
The following six AA characters are supported by PseAAC server to calculate the correlations between amino acids at different positions along the protein chain: (1) hydrophobicity, (2) hydrophilicity, (3) side chain mass, (4) pK1 (alpha-COOH), ( 5) pK2 (NH3) and ( 6) pI.The user can select any characters or combinations of characters as part of the input.The weight factor is designed for the user to put weight on the additional PseAA components with respect to the conventional AA components.The user can select any value within the region from 0.05 to 0.70 for the weight factor.The counted rank (or tier) of the correlation along a protein sequence is represented by λ [9].Calculations by PseAAC server for all six characters and their binary and ternary combinations have been considered (Table 1).

SVM
SVMs, an algorithm for the classification of both linear and nonlinear data, map the original data into a higher dimension, where we can find a hyper plane as a discriminant function for the separation of data using some instances called support vector.This discriminant function is represented as a linear function in feature space in the form of f(x) = w T ϕ(x) for some weight vector w∈F.Given a training set of instance-label pairs (x i , y i ), i = 1,2,3,…,l where x i ∈ R n and y i ∈ {1, −1}, to map the input data samples x i into a higher dimensional feature space ϕ(x i ), a set of nonlinearly separable problem is solved.The classical maximum margin SVM classifier aims to find a hyper plane of the form w t ϕ(x) + b = 0, which separates the patterns of the two classes.
In the case of noisy data, to avoid poor generalization for unseen data, a vector of slack variables Ξ = (ξ 1 ,ξ 2 ,…,ξ l ) T should be taken in to account.The problem can then be written as: The solution then yields the soft margin classifier.By introducing a set of Lagrange multipliers α i and setting the derivation of Lagrangian function equal to zero we obtain: where K(x i, x j ) = ϕ(x i ) ϕ(x j ), termed as kernel matrix, is an implicit mapping of the input data into the high dimensional feature space by a kernel function.In this paper, we focus on the RBF kernels: ( , ) ( ). ( ) exp 2 For this study the publicly available LIBSVM software with the radial basis function as a kernel is used [9].

MLP
MLP is a feed-forward artificial neural network model.This network consists of multiple layers of nodes so that, each layer completely connected to the next one.Each node is a neuron with a nonlinear activation function, except input nodes.MLP use back-propagation for training the network.

SMOTE
Using imbalanced dataset, usually a biased classifier is obtained so that the accuracy of majority class is higher than the minority class.Many methods have been proposed to solve this problem.One well-known method that is used to balance the class distribution is SMOTE [10,11].
SMOTE creates synthetic data in order to over-sample the minority class.In this method, k-nearest neighbors for each instance in minority class are considered and some instances are randomly selected from them according to the over-sampling rate.Determination of k-nearest neighbors is based on Euclidean distance.If a i is an in-stance from minority class and â i is one of the k-nearest neighbors, the synthetic data that is added to minority class is obtained using the following relation (4) so that β is a random number between (0,1).

Dataset
The dataset in this study are obtained from IEDB [www.immuneepitope.org]for HLA-DRB1*0301 MHC class II.In order to eliminate redundant sequence, three approaches were applied separately to binders and nonbinders.Therefore the estimates of performance of prediction methods were more realistic.In these dataset, UPDS are unique peptides.SRDS1 were obtained from UPDS dataset by applying a similarity reduction approach so that we ensure there are not two peptides with a common 9-mer subsequence in these data.SRDS2 were obtained by filtering the binders and nonbinders in SRDS1.In SRDS2 data, the identity of sequences among any pair of peptides is under 80%.Another data that are used in this study, is SRDS3 that were extracted from UPDS by applying similarity reduction method that proposed by Raghava [12].Then, the pseudo amino acid compositions for these peptides are calculated using PseAAC server.For this study, type 1 PseAAC, which is also called the parallel correlation type, λ = 1, and weight factor = 0.05 are applied.
The number of binders and nonbinders in dataset are shown in Table 2.

Result
Considering the dataset described above, 5-fold cross validation is used to examine the efficiency of predictor.In the 5-fold cross validation, the dataset are randomly divided into 5 subsets with equal samples.With these subsets, each time 4 subsets are used for training and 1 subset is used for testing.Therefore the training and testing are performed 5 times.Finally the average performance is calculated using the definition of accuracy (5):

Figure 1 .
Figure 1.Block diagram of the proposed approach.