Using Neural Networks to Predict Secondary Structure for Protein Folding

Protein Secondary Structure Prediction (PSSP) is considered as one of the major challenging tasks in bioinformatics, so many solutions have been proposed to solve that problem via trying to achieve more accurate prediction results. The goal of this paper is to develop and implement an intelligent based system to predict secondary structure of a protein from its primary amino acid sequence by using five models of Neural Network (NN). These models are Feed Forward Neural Network (FNN), Learning Vector Quantization (LVQ), Probabilistic Neural Network (PNN), Convolutional Neural Network (CNN), and CNN Fine Tuning for PSSP. To evaluate our approaches two datasets have been used. The first one contains 114 protein samples, and the second one contains 1845 protein samples.


Introduction
Bioinformatics involves the technology that uses computers for storage, retrieval, manipulation, and distribution of information related to biological macromolecules such as DNA, RNA, and proteins.The use of computers is absolutely essential in mining genomes for information gathering and knowledge building [1].
Protein structure prediction methods are categorized under bioinformatics which is a broad field that combines many other fields and disciplines like biology, biochemistry, information technology, statistics, and mathematics [2].
There are four different structure types of proteins, namely the Primary, Secondary, Tertiary and Quaternary structures.The Primary structure contains a sequence of 20 different types of amino acids.It provides the foundation of all the other types of structures.The Secondary structure refers to the arrangement of connections within the amino acid groups to form three different structured classes (H, E, and C) [3].
PSSP provides a significant first step toward the tertiary structure prediction, as well as offering information about protein activity, relationship, and function.
Protein folding, or the prediction of the tertiary structure from linear sequence, is an unsolved and ubiquitous problem that invites research from many fields of study, including computer science, molecular biology, biochemistry and other.
Protein secondary structure is also used in a variety of scientific areas, including proteome and gene annotation.Therefore, PSSP remains as an active area of research, and an integral part of protein analysis [4].
In this research, the authors have proposed five models of NN that has been used, including FNN, LVQ, PNN, CNN and CNN Fine tuning for PSSP.The main objective of this work is to gain an improvement of prediction accuracy (Q 3 ) so that the implementation results show that the proposed model (CNN Fine Tuning) performs better than the other models and looks promising for problems with characteristics similar to that problem (PSSP) by achieving prediction accuracy with Q 3 = 90.31%.

Related Materials
In this section, we will introduce dataset description, measures of prediction accuracy.

Data Set Description
The first dataset is obtained from matlab math work [5] and from thesis [3]

Measures of Prediction Accuracy
We have used one measuring method to evaluate the prediction accuracy of implemented models of NN.The three state accuracy ( 3 Q ) is defined as the percent of residues that have been predicted correctly: 3 where H N , E N , C N are the number of correctly predicted residues of type H, E and C, respectively and N T is the total number of residues in dataset.3 Q con- cise as useful measure to compare different prediction methods [6].

Methodology
In our work, we have used five different structures of Neural Networks including (Feed Forward NN, Learning vector quantization NN, Probabilistic NN).We used a sliding window of size 17 for each structure of NN that which moves through the protein sequence and the output of the network is attained for the residue in the middle of the window; so as a result, the input layer includes 17 × 20 = 340 neurons and output layer contain 3 neurons for each NN structure.
During training, it receives the input vectors along with the expected output vectors.When making predictions, it returns output vectors representing the likelihood of each residue being in (H, E or C). Figure 1

Feed Forward Neural Network
The first structure of NN used are (Feed Forward NN), by using one input layer and two hidden layer with 10 neuron for each layer and one output layer as shown in Figure 2 that illustrates implemented FNN structure using Matlab Version (R2015a).
In FNN the processing units in each hidden layer are fully connected to units in previous layer but not connected to units in the same layer.Only the outputs of the unit are connected to the units of next layer.Therefore there is no feedback in the system [7].

Probabilistic Neural Networks (PNN)
The second structures of NN used is (Probabilistic NN), PNN is defined as an implementation of statistical algorithm called Kernel discriminate analysis in which the operations are organized into multilayered feed forward network with four layers: input layer, pattern layer, summation layer and output layer [8], as shown in Figure 3.
It is usually much faster to train a PNN network than multilayer Perceptron Network but one of the disadvantages of PNN models compared to other networks is that PNN models has a large number of neurons in hidden nodes (pattern layer) due to the fact that there is one neuron for each training line [9].Our implemented structure for PNN including one input layer with 340 neurons, and pattern layer with 14,151 neurons (one neuron for each amino acid), and three neurons for both summation and output layers as shown in Figure 4 that illustrate implemented PNN structure using Matlab (R2015a).

Learning Vector Quantization (LVQ)
The Third implemented structure of NN is LVQ; its structure has two layers is a competitive layer and linear layer [11], as shown in Figure 5.

Convolutional Neural Network (CNN)
CNN is a multilayer perceptron designed specifically to recognize two dimen-   sional shapes with a high degree of invariance to translation, scaling, skewing, and other forms of distortion.Figure 7 shows the architectural layout of convolutional network made up of an input layer, four hidden layers, and an output layer.This network is designed to perform image processing (e.g., recognition of handwritten characters) [12].
In our work we implement two different structures of CNN; the first implemented structure of CNN has six layers.The first layer is the input layer (input: 21 × 15), the second layer is the filter layer (Filter: 21 × 4 × 30), the third layer is the convolutional layer (Convolution: 30 × 12), the forth layer is the pooling layer (Pooling 30 × 6), and the fifth layer is the classifier layer which is (softmax classification).Finally, the last layer is the output layer.

Results and Discussion
In this section we will display the result of five implemented models of NN, in addition previously are mentioned Q 3 as three state accuracy measures will use Q H , Q E and Q C are the percentage of correctly predicted residues observed in class E, H and C, respectively, as shown in Figure 9.

Conclusion
In so it has higher prediction accuracy (100%) and completely predicts correctly, and other classes (H, E) completely predict wrongly (0%) due to its limited capability to classify complex problem as our problem(PSSP).
LVQ is a method for training competitive layers in a supervised manner.The competitive layer learns to classify input vectors in much the same way as the competitive layers of Self-Organizing Feature Maps.The linear layer transforms the competitive layer's classes into target classifications defined by the user.The classes learned by the competitive layer are referred to as subclasses and the classes of the linear layer as target classes [10].Our implemented structure for LVQ includes one input layer with 340 neurons, and competitive layer with 10 neurons, and three neurons for both linear and output layers, as shown in Figure 6 that illustrates implemented LVQ structure using Matlab (R2015a).

Figure 8
describes CNN classifier general structure that is receiving several input vectors and returning predicted output vectors.The second implemented structure of CNN we used is a CNN fine tuning approach to tune the parameter of the whole model as step to increase the accuracy and find more accurate prediction for the secondary structure of the protein by replace the softmax activation function that is found in Figure 8 of the CNN first structure used by sigmoid activation function and do the back propagation approach for each epoch to tune the parameter.

Figure 9
Figure 9 visualizes comparison between the five implemented models of NN.This figure compares the average of Q H , Q E , Q C and Q 3 of these three structures and shows that in Feed Forward NN give higher prediction accuracy Q 3 than PNN and LVQ and more balanced prediction of three secondary structures and less difference between prediction accuracy of H, E and C. In PNN structure there is high difference between prediction accuracy of (H, E) and C in other side.This is because of class imbalance problem in protein secondary structure datasets which causes that the classifiers give more importance to majority class (C).In LVQ show that this structure can be train and predict only coil (C) so it has higher prediction accuracy (100%) and completely predicted correctly and other classes (H, E) completely predicted wrongly (0%).Finally it has been proven that CNN Non Fine Tuning and CNN Fine Tuning can obtain higher prediction accuracy than all other implemented structures by achieving prediction accuracy (61.694%) for Non Fine Tuning and (90.31%) for Fine Tuning model.

Figure 9 .
Figure 9. Final comparison of prediction accuracy for all models.