Hilbert Huang transform for predicting proteins subcellular location

Apoptosis proteins have a central role in the development and homeostasis of an organism. These proteins are very important for the understanding the mechanism of programmed cell death, and their function is related to their types. The apoptosis proteins are categorized into the following four types: (1) Cytoplasmic protein; (2) Plasma membrane-bound protein; (3) Mitochondrial inner and outer proteins; (4) Other proteins. A novel method, the Hilbert-Huang transform, is applied for predicting the type of a given apoptosis protein with support vector machine. High success rates were obtained by the re-substitute test (98/98=100%), jackknife test (91/98 = 92.9%).


INTRODUCTION
promising, indicating that the subcellular location of Apoptosis, or programmed cell death, is a fundamenapoptosis proteins are predictable to a considerably tal process controlling normal tissue homeostasis by accurate extent if a good vector representation of proregulating a balance between cell proliferation and tein can be established.It is expected that, with a condeath [1].This process entails the autolytic degradatinuous improvement of vector representation methtion of cellular components, and is characterized by ods by incorporating amino acid properties, and by blebbing of cell membranes, shrinkage of cell volusing more powerful mathematics methods, some theumes, and condensation of nuclei [2], and is currently ory predicting method might eventually become a usean area of intense investigation.Cell death and ful tool in this area because the function of an renewal are responsible for maintaining the proper apoptosis protein is closely related to its subcellular turnover of cells, which ensures a constant controlled location.The present study was initiated in an flux of fresh cells.Programmed cell death and cell attempt to address this problem.proliferation are tightly coupled.When apoptosis Chou and Elrod made an extensive research in premalfunctions, a variety of formidable diseases can dicting subcellular location mainly based on the ensue: blocking apoptosis is associated with cancer amino acid composition.Subsequently, in order to a n d a u t o i m m u n e d i s e a s e , w h e r e a s u n w a n t e d take into account the sequence-order effects and apoptosis can possibly lead to ischemic damage or improved the prediction quality, Chou has further neurodegenerative disease [3].Apoptosis is considincorporated the quasi-sequence order effect [5] and ered to have a key role in these several devastating introduced the concept of "pseudo-amino-acid comdiseases and, in principle, provides many targets for position" [9].For example, Chou [10] classified memtherapeutic intervention [4].To understand the brane proteins into five different types and proposed

Hilbert Huang transform for predicting proteins subcellular location
Feng Shi , Qiu-Jian Chen & Na-na Li School of Science, Huazhong Agricultural University, Wuhan, Hubei, China.Correspondence should be addressed to Feng Shi (shifeng@mail.hzau.edu.cn).a covariant discriminant algorithm to predict the and the lower envelop (linked by local minima) are types of membrane proteins.Recently, Cai et al. [11] zero at every point.applied neural network to this problem.To improve The EMD proc ess is as foll ows.Acco rdin g t o t h e p r ed i ct i on qu a li t y, Ch o u [ 5 ] p r op o se d a ne w H i l b e r t -H u a n g t r a n s f o r m ( H H T ) [ 1 4 ] , o n c e t h e method in which the covariant discriminate algo-extrema of a time series x(t) are identified, all the rithm was augmented to incorporate the quasi-local maxima and minima are connected by two spesequence-order effect.This method uses the amino cial lines as the upper and lower envelopes respecacid composition and the sequence-order-coupling tively.Their mean is designated as m , and the differ-1 numbers (reflecting the sequence order effect) in ence between x(t) and m is x(t)-m =h .If h is not an that each protein can be represented by a vector, procedure k times until h is an IMF, that is h which is 20-D vector in Hilbert space with unified or with the higher sequence identity are closer on the it to the same sifting process above.Repeat this prosurface.The overall predictive accuracy could be cedure on all the subsequent r , i.e. r -IMF =r , rimproved from 3% to 5% for different databases [12] with this simply modification of the usage of the IMF =r , , r -IMF =r .amino acid composition.Recently, a series of new So the result is: powerful approaches have been developed by Chou and his co-workers [13].Encouraged by the great successes of the previous invertigators in the area, here we would like to use a different strategy, the support vector machines, to approach this very important but also very difficult problem in the hope that our 2.2.Hilbert transform approach can play a complementary role to the exist-Having obtained the intrinsic mode function compoing methods.
nents IMF (denoted as c ), one will have no difficulty in applying the Hilbert transform to each IMF compo-

HILBERT HUANG TRANSFORM
nent, The HHT consists of two parts: empirical mode decomposition (EMD) and Hilbert spectral analysis (HSA).This method is potentially viable for nonlinear and nonstationary data analysis, especially for time-frequency-energy representations.It has been in which the PV indicates the principal value of the tested and validated exhaustively, but only empirisingular integral.With the hilbert transform, the anacally.In all the cases studied, the HHT gave results lytic signal is defined as much sharper than those from any of the traditional analysis methods in time-frequency-energy representations.Additionally, the HHT revealed true physical Here, a (t) is the instantaneous amplitude, and (t) meanings in many of the data examined.Powerful as it is, the method is entirely empirical.In order to is the phase function, make the method more robust and rigorous, many outstanding mathematical problems related to the HHT method need to be resolved.In this section, a brief introduction to the methodology of the HHT will be given.Readers interested in the complete details should consult [14].and the instantaneous frequency is simply

. 1 . T h e e m p i r i c a l m o d e d e c o m p o s i t i o n method (the sifting process)
In this method any time series, including non-linear and non-stationary series, can be decomposed into a With the Hilbert Spectrum defined, we can also finite number of intrinsic mode functions (IMFs) define the marginal spectrum h(w) as through empirical mode decomposition (EMD) process.An IMF is a function which must follow two conditions: (1) the difference between the numbers of extrema and zero-crossings is of 1 ; and (2) the The marginal spect rum offers a measure of t he mean of the upper envelop (linked by local maxima) total amplitude (or energy) contribution from each nonstationary processes: it is based on an adaptive basis; the frequency is derived by differentiation rather than convolution; therefore, it is not limited by the uncertainty principle; it is applicable to nonlinear and nonstationary data and presents the results in time-frequency-energy space for feature extraction.
Support Vector Machine (SVM) is one type of learning machines based on statistical learning theory.A complete description to the theory of SVMs for pattern recognition is in Vapnik's book.[15].SVMs have been used in a range of bioinformatics problems including protein fold recognition [16]; proteinprote in interactions prediction [17]; prediction of protein subcellular location [17,18], protein secondary structure prediction,T-cell epitopes prediction, Classification of protein quaternary structure [19].
In this paper, we apply Vapnik's support vector machine for predicting the types of apoptosis proteins.We have used the OSU_SVM, a Matlab SVM toolbox (http://www.ece.osu.edu/~maj/osu_svm), which is an frequency value.This spectrum represents the accuimplementation of SVM for the problem of pattern recmulated amplitude over the entire data span in a ognition.probabilistic sense.
The combination of the empirical mode decompo-

rious harmonics to represent nonlinear waveform
In this research, we first translate every aminoacid deformations as in any of the priori basis methods, sequence s into a numerical sequence f by hydrophobicity and there is no uncertainty principle limitation on index, then, decompose it into a finite number of time or frequency resolution from the convolution intrinsic mode functions (IMFs) through empirical pairs based also on a priori basis.mode decomposition (EMD) process, we just select A comparative summary of Fourier, wavelet and the 2nd to 4th components (IMF2, IMF3, IMF4), HHT analyses is given in the : because first IMF just reflects the rand composition This table shows that the HHT is indeed a powerful and the last is just the trendences composition of the method for analyzing data from nonlinear and numerical sequence f.Then applying the Hilbert transform to each IMF component, we get the instan-When the re-substitution test was performed for the taneous amplitude a (t), then get the energy value current study, the type of each apoptosis protein in a i data set was in turn identified using the rule paramee = , (t=2, 3, 4).Next, get its energy ratio i ters derived from the same data set, the so-called training data set.As shown in , the overall suc-.Last every protein was represented as a cess rate thus obtained for the 98 apoptosis proteins point or a vector in a 23-D space.The first 20 compoin was 100%, indicating an excellent selfnents of its vector were supposed to be the occurconsistency.rence frequencies of the 20 amino acids in the protein However, during the process of the re-substitution concerned, the last three components were its energy test, the rule parameters derived from the training ratio times a weight, there, we set the weight is 0.2.
data set include the information of the query protein The computations were carried out on a PC.Also later plugged back in the test.This will certainly for the SVM, the width of the Gaussian RBFs is underestimate the error and enhance the success rate selected as that which minimized an estimate of the because the same proteins are used to derive the rule VC-dimension.After being trained, the hyper-plane parameters and to test themselves.Nevertheless, the output by the SVM was obtained.The SVM method is re-substitution test is absolutely necessary because it applied to two-class problems.In this paper, for the reflects the self-consistency of a prediction method, four-class problems, we have used a simple and especially for its algorithm part.A prediction algoeffective method: "one-against-others" method [16] rithm certainly cannot be deemed as a good one if its to transfer it into two-class problems.We first test the self-consistency is poor.In other words, the reselfconsistency and leave-one-out cross-validation substitution test is necessary but not sufficient for (jackknife test) of the method, followed by testing evaluating a prediction method.As a complement, a the method by prediction of an independent dataset.cross-validation test for an independent testing data As a result, the rates of self-consistency, crossset is needed because it can reflect the effectiveness validation of prediction were quite high.
of a prediction method in practical application.This In addition to the prediction algorithm, we also is important especially for checking the validity of a need to construct a training data set to complete the training data set-whether it contains sufficient inforestablishment of a statistical prediction method.To mation to reflect all the important features concerned realize this, based on the SWISS-PROT data bank, 98 so as to field a high success rate in application.apoptosis proteins (the date were taken from Zhou [7]) were classified into the following four subcellular locations: (1) cytoplasmic, (2) plasma membrane-4.2.Jackknife test bound, (3) mitochondrial, and (4) other ( ).
As is well known, the independent data set test, subsampling test, and jackknife test are the three meth-

RESULTS AND DISCUSSION
ods often used for cross-validation in statistical pre-By means of the SVM algorithm described in the last diction.Among these three, however, the jackknife section, a statistical prediction was performed for the test is deemed as the most effective and objective one 98 apoptosis proteins listed in .The predic-for a comprehensive discussion about this).During tion was conducted by two different approaches, the jackknifing, each protein in the data set is in turn sinre-substitution test and the jackknife test.The results gled out as a tested protein and all the rule parameters are given in .are calculated based on the remaining proteins.In o t h e r w o r d s , t h e s u b c e l l u l a r l o c a t i o n o f e a c h apoptosis protein is identified by the rule parameters

Re-substitution test
derived using all the other apoptosis proteins except The so-called re-substitution test is an examination the one that is being identified.During the process of for the self-consistency of a prediction method [7].
the prediction quality.Feng [12] pro-IMF, h is treated as the data and undergoes the pro-1 posed a new representation of unified attribute vector, cedure above, then h -m =h .Repeat this sifting 1 11 11 , all of proteins have their representam =h , thus the first IMF component is obtained, 1k 1k tive points on the surface of the 20-D globe.The repi.e. .Then separate IMF from the original time series 1 resentative points of the proteins in the same family by x(t)-IMF =r .Treat r as the new data and subject 1 1 1 sition and the Hilbert spectral analysis is also known 3. TRAINING AND PREDICTION as the "Hilbert-Huang transform" (HHT) for short.A c c o r d i n g t o t h e i r s u b c e l l u l a r l o c a t i o n [ 1 2 ] , Empirically, all tests indicate that HHT is a superior apoptosis proteins are classified into the following tool for time-frequency analysis of nonlinear and four types: (1) type I: Cytoplasmic protein; (2) type II: nonstationary data.It is based on an adaptive basis, Plasma membrane-bound protein; (3) type : Mitoand the frequency is defined through the Hilbert chondrial inner and outer proteins; (4) type : Other transform.Consequently, there is no need for the spuproteins (see ).

Table 2 Table1
[7]rived from SWISS-PROT data bank.b.Of the 12 other apoptosis proteins, five are located in nucleus, two in endoplasmic reticulum, one in microtubule, and one in lysosome[7]. a

Table 2 .
List of the acession numbers for the 98 apoptosis proteins classified into four categories according to their subcellular locations.(Type I: 43 Cytoplasmic proteins; Type II: 30 Plasma membrane-bound proteins; Type III: Mitochondrial inner and outer proteins ; Type IV: 12 Other proteins).

Table 1 .
Comparative summary of Fourier, Wavelet and HHT analyses.