Prediction of hydrophobic regions effectively in transmembrane proteins using digital filter

The hydrophobic effect is the major factor that drives a protein molecule towards folding and to a great degree the stability of protein structures. Therefore the knowledge of hydrophobic regions and its prediction is of great help in understanding the structure and function of the protein. Hence determination of membrane buried region is a computationally intensive task in bioinformatics. Several prediction methods have been reported but there are some deficiencies in prediction accuracy and adaptability of these methods. Of these proteins that are found embedded in cellular membranes, called as membrane proteins, are of particular importance because they form targets for over 60% of drugs on the market. 20% 30% of all the proteins in any organism are membrane proteins. Thus transmembrane protein plays important role in the life activity of the cells. Hence prediction of membrane buried segments in transmembrane proteins is of particular importance. In this paper we have proposed signal processing algorithms based on digital filter for prediction of hydrophobic regions in the transmembrane proteins and found improved prediction efficiency than the existing methods. Hydrophobic regions are extracted by assigning physico-chemical parameter such as hydrophobicity and hydration energy index to each amino acid residue and the resulting numerical representation of the protein is subjected to digital low pass filter. The proposed method is validated on transmembrane proteins using Orientation of Proteins in Membranes (OPM) dataset with various prediction measures and found better prediction accuracy than the existing methods.


INTRODUCTION
Proteins are important functional molecules in living organisms.Every protein assumes a specific shape and performs a specific function.A key characteristic of the protein is the three-dimensional structure into which linear chain folds which is referred to as tertiary structure.This structure results in electrochemical interaction domains of protein and gives it the ability to interact with other proteins and ligands to carry out specific functions [1].Of these proteins that are found embedded in cellular membranes, called as membrane proteins, are of particular importance because they form targets for over 60% of drugs on the market.20% -30% of all the proteins in any organism are membrane proteins [2].Knowledge of segments of transmembrane proteins, the bends in helices and the membrane buried regions help in the study of tertiary structure.Understanding the structure of a protein helps in understanding the role played by that protein.
It is widely known that amino acid sequences of proteins carry all the information needed to form their threedimensional structures [3].Thus, the protein structure theoretically can be predicted based solely on amino acid sequences.Structural or biological information such as secondary structure [4,5], kink [6,7] and hydrophobic regions [8] are derived by assigning a physicochemical index to an amino acid sequence.The knowledge of hydrophobic regions and its prediction is of great help in understanding the structure and function of the protein.
Several steps have been taken to predict these regions.The transmembrane structure and the membrane buried region are shown in Figure 1.
In sliding window averaging technique the physico- chemical value for each residue inside the frame is summed up and assigned in the middle of the window.
Then the window slides across the sequence [9].However, the relationship between segments and structure does not always correspond.As extracting structural information from amino acid sequences alone is difficult, various prediction methods have been developed using evolutionary information and neural network [10].
The span of window size is investigated in [11], which reflected the interior and exterior portions of proteins.Hydrophobicity profiles using the shortest window size were noisy, and size less than seven residues produce unsatisfactory result.On the other hand, long spans tended to lose structural segments.Thus the optimum choice between hydrophobic region and position of amino acid residues was obtained with a nine for globular proteins.The problem with this technique is noisy in the smoothed profiles, which makes it particularly difficult to find segments in case of globular proteins.
Recently signal processing methods play a major role in predicting hydrophobic regions.Fourier analysis has been applied to predict secondary structures from a sequential dotted hydrophobicity index [12].The utilities of the Fourier transform lie in its ability to analyze a signal in the time domain for its frequency content.The transform works by first translating a function in the time domain into a function in the frequency domain.The signal can then be analyzed for its frequency content because the Fourier coefficients of the transformed function represent the contribution of each sine and cosine function at each frequency.Although the Fourier analysis is useful for acquiring structural information, this method tends to cause positional error.
Wavelets are mathematical functions that divide data into different frequency components.This approach has advantages over traditional Fourier methods in analyzing data where the signal contains discontinuities or high frequency noise.Recently, the use of wavelet transform, both continuous and discrete in the Bioinformatics field is promising [13].Continuous Wavelet Transform (CWT) allows one-dimensional signal to be viewed in a more discriminative two-dimensional time-scale representation.CWT is calculated by the continuous shifting of the continuously scalable wavelet over the signal.In discrete wavelet transform (DWT) a subset of scales and positions are chosen, in which the correlation between the signal and the shifted and dilated waveforms are calculated.Consequently, the signal is decomposed into several groups of coefficients, each containing signal features corresponding to a group of frequencies.Small scales refer to compressed wavelets, depicted by rapid variations appropriate for extracting high frequency features of the signal.An important attribute of wavelet methods is that, due to the limited duration of every wavelet, local variations of the signal are better extracted and information on the location of these local features is retained in the constituent waveforms.DWT has been applied on hydrophobicity signals in order to predict hydrophobic cores in proteins [14,15].Protein sequence similarity has also been studied using DWT of a signal associated with the average energy states of all valence electrons of each amino acid [16].Wavelet transform has been applied for transmembrane structure prediction [17].DWT has been used to decompose the amino acids of TM proteins into a series of structures in different layers, then predicting the location of TMHs according to the information of the amino acids sequence in different scales [18].A method based on discrete wavelet transform has been developed to predict the number and location of TMHs in membrane proteins [19].
The existing methods have their limitations in terms of accuracy.As mentioned above, numerous attempts have been made by researchers to define the relation between interior and exterior regions directly from the amino acid sequence.However, it is difficult to divide interior and exterior position of amino acid residues by assigning a hydrophobicity threshold, because of unacceptable noise level.Hence there is a need to develop advanced algorithm for faster and accurate prediction of hydrophobic regions.This motivates to develop novel approach based on digital filtering method to effectively predict these regions in transmembrane α-helices.
The rest of the paper is organised as follows.Section-2 deals with the proposed method for prediction of hydrophobic regions in transmembrane α-helices.This paper focuses on the development of signal processing algorithms based on digital filter.Section-3 deals with discussion of simulation results of proposed methods using standard data set in terms of prediction measures.Section-4 presents the conclusions of this paper.

PROPOSED METHOD FOR PREDICTION OF HYDROPHOBIC REGIONS
The signal processing approach plays a major role in prediction of membrane buried regions in amino acid sequence and separate variations in signal from background noise.In this paper the membrane buried regions in transmembrane proteins is determined effectively using digital filter.Previously, methods for hydrophobic region prediction using only amino acid sequences have been reported [20,21] and it was shown that hydrophobicity tended to be low at the loop region.These studies suggested that hydrophobic residues were buried in the core of protein that could be predicted using a hydrophobicity index.The minimal hydrophobicity profile corresponded to the loop region, and the turn region could be predicted effectively.Highly hydrophobic regions tended to form a α-helix.Thus a hydrophobicity plot involving hydrophobicity index is useful for the purpose of prediction of hydrophobic regions in amino acid sequences.We made use of digital filter to extract low frequencies and detected the hydrophobic regions more effectively.
In this section a technique for identification of hydrophobic regions in transmembrane helices using digital filter is described.As it is seen that hydrophobic regions are observed in low frequencies, suitable digital low pass filter can easily filter low frequency components and thus hydrophobic regions can be extracted.A widely used family of low pass filters is the set of Butterworth filters [22,23].Butterworth filters are characterized by a magnitude response that is maximally flat in the passband and monotonic overall.A low pass Butterworth filter of order n has the following magnitude response.
Cutoff frequency ω n is that frequency where the magnitude response of the filter is 1 2 .For butter, the normalized cutoff frequency ω n must be a number be-tween 0 and 1, where 1 corresponds to the Nyquist frequency, π radians per sample.A butterworth low pass filter of order 2 with cut-off frequency ω n = 0.4 can perfectly select the hydrophobic regions of the protein sequences.The pole-zero plot and frequency response of second order low pass butterworth filter are shown in Figures 2 and 3 respectively.The filter is used to let pass a low frequency component of a signal and attenuates a high frequency component of the signal.
The Butterworth low pass filter gives rise to patterns that are distinct between the interior and exterior locations.To analyze the protein sequence for prediction of hydrophobic regions, it is first transformed into a numerical signal based on hydrophobicity indices of residues along a protein sequence.The numerical hydrophobicity indices of 20 amino acid residues obtained from Hyperchempro 8.0 software of HyperCube Inc., USA (Table 1) are assigned to the protein sequence.The resulting numerical sequence is passed through the proposed lowpass filter and plotted.The peaks observed in the plot indicate the hydrophobic regions.
The filter output y[n] is plotted for observation of peaks at low frequency regions.
A step-by-step procedure of the proposed method for prediction of hydrophobic regions is as follows: 1) Convert the protein sequence into numerical sequence using hydrophobic indices of each amino acid residue; 2) The resulting numerical sequence is passed through proposed low pass filter that would select low frequency; 3) Plot the magnitude response and determine the threshold to observe peaks where the low frequency regions are dominant; 4) Locate the hydrophobic regions by locating the energy peaks in the filtered signal.
The sliding window averaging technique and frequency domain and wavelet analysis are then compared with proposed method based on corresponding results.

Hydration Energy Index
In this section, we discuss a novel numerical representation of protein sequence generated by a physico-chemical property of amino acid residues called hydration energy to detect the membrane buried regions in transmembrane proteins.There are various physico-chemical properties namely; hydration energy, dipole moment, electron ion interaction pseudopotential (EIIP), polarizability, refractivity, molar surface area and molar volume which are frequently used for quantitative structure activity relationship (QSAR) of molecules.Of these properties, specifically the numerical sequence based on the hydration energy is found to produce sharp peak at membrane buried region when used in digital filtering.Hydration energy reflects the hydrophilicity (or hydrophobicity) of molecules.Hence it is correlated with the solution of the problem.The hydration energy indices of amino acids are obtained from Hyperchempro 8.0 software of HyperCubeInc, USA (Table 1).The transmembrane protein sequence is first transformed into a numerical signal by assigning hydration energy indices of residues along a protein sequence.The resulting numerical sequence is subjected to the proposed low pass filter and plotted.The peaks observed in the plot indicate the membrane buried regions.

RESULT AND DISCUSSION
We have used the proposed digital low pass filter to detect the hydrophobic regions using numerical representation based on hydrophobicity indices and hydration energy of amino acid residues.List of transmembrane proteins and their coordinate files were obtained from the Orientation of Proteins in Membranes (OPM) database at College of Pharmacy, University of Michigan (http://www.phar.umich.edu).Transmembrane proteins from OPM data sets are used as bench mark for this purpose.In a good number of cases the proposed method performed well.The performance analysis of various methods can be made by prediction measures such as accuracy (A), precision (P) and recall (R) which are defined in terms of four parameters true positive (t p ), false positive (f p ), true negative (t n ) and false negative (f n ) (Table 2).t p denotes the number of actual buried regions and are also predicted as buried regions, f p denotes the number of actually residues exposed but are predicted to be buried, t n is the number of actually exposed and also predicted to be exposed, and f n is the number of actually buried and predicted to be exposed.

Accuracy
The accuracy of prediction of hydrophobic regions in amino acid sequence is defined as the percentage of bur-ied regions correctly predicted of the total buried and exposed present.It is computed as follows: Number of correct buired predictions Total number of buired and exposed

Precision JBiSE
It is defined as the percentage of buried regions correctly predicted to be one class of the total buried predicted to be of that class.Precision is computed as:

Number of correctly predicted buired
Total number of buired predicted

Recall
It is defined as the percentage of the buried regions that belong to a class that are predicted to be that class.Recall is computed as: Number of correctly predicted buired Total number of actual buired We attempted to predict hydrophobic regions in transmembrane proteins, using digital low pass filter.The sliding window averaging technique and wavelet analysis were then compared with the proposed method based on corresponding results.
The models as well as sequence of the transmembrane proteins are obtained from PDBTM database.When sequence file in fasta format is submitted to TMHMM pred server then the sections for transmembrane regions (TM), residues buried (rbu) and residues outside exposed (rex) regions are identified.In transmembrane protein Bovine Cytochrome BC1 Complex with Stigmatellin bound having PDB Id: 2a06, proposed method have detected all membrane buried regions as shown in Figure 4.The figure shows the hydrophobicity plot of the original sequence, first order filter response and second order filter response.The hydrophobicity profile smoothed by the low frequencies extracted using digital low filter of second order shows buried residues indicating sharp peaks remained in the low frequency.Above the threshold shows the membrane buried regions which is embedded part the transmembrane helix (TMH).On the other hand, the profile of the sliding window averaging technique was hard to find segments corresponded to buried residues.Table 3 summarizes the prediction ac-curacy of hydrophobic regions by various methods such as the sliding window averaging technique, wavelet analysis and proposed filter method.Table 4 shows the average prediction accuracy of hydrophobic regions of 88 dataset by the various methods.It is found that the prediction accuracy of proposed method is the highest of all tested cases.Thus the proposed method shows improved accuracy in predicting hydrophobic regions.
The profile of assigning hydrophobicity index to amino acid sequence is inherently noisy.To eliminate noise from raw functions, various methods have been proposed.Although the sliding window averaging technique is in wide use, the precise region of the hydrophobic region is difficult to determine because of the averaging calculation.We extracted the low frequencies from raw data using digital filter and investigated the relationship between low frequencies of proteins.The efficiency of filtering analysis for interior/exterior prediction indicated the detection of segments related to a

Figure 1 .
Figure 1.Definition of transmembrane structures.[A;D] is a transmembrane helix (TMH) and [B;C] represents the transmembrane segment (TMS)which is the embedded part of TMH.
, where F c is called the 3-dB cutoff frequency of the filter.The magnitude response of butter filters with different filter orders are shown in Figure3.As the order increases, the magnitude response comes closer and closer to the ideal low pass characteristic.The transfer function of the low pass Butterworth filter of order n with radian cutoff frequency ω c = 2πF c can be expressed as follows.

Figure 2 . 2 Figure 3 .
Figure 2. Pole-Zero plot of Butterworth low pass filter of order n = 2

Figure 4 .
Figure 4. Hydrophobicity plot of Bovine Cytochrome BC1 Stigmatellin bound using (a) raw sequence; (b) low pass filter of 1 st order; (c) low pass filter of 2 st order.rbu and rex are residues buried within TM and residues exposed separated by threshold.

Table 1 .
Physico-chemical properties of amino acids.

Table 2 .
Contingency table for evaluation metrics.