^{1}

^{2}

^{*}

Since a complete DNA chain contains a large data (usually billions of nucleotides), it’s challenging to figure out the function of each sequence segment. Several powerful predictive models for the function of DNA sequence, including, CNN (convolutional neural network), RNN (recurrent neural network), and LSTM [ 1] (long short-term memory) have been proposed. However, all of them have some flaws. For example, the RNN can hardly have long-term memory. Here, we build on one of these models, DanQ, which uses CNN and LSTM together. We extend DanQ by developing an improved DanQ model and applying it to predict the function of DNA sequence more efficiently. In the most primitive DanQ model, the regulatory grammar is learned by the regulatory motifs captured by the convolution layer and the long-term dependencies between the motifs captured by the recurrent layer, so as to increase the prediction accuracy. Through the testing of some models, DanQ has greatly improved in some indicators. For the regulatory markers, DanQ achieves improvements above 50% of the area under the curve, via the measurement of the precision-recall curve.

Previously, people raised some deep learning models to solve the prediction of DNA sequence’s [

Bi-directional long short-term memory (BLSTM) [

We use random dropout rate, which can make the trained neural network structure more flexible and converge the network quickly while ensuring the training accuracy. The iteration of neural network ensures the record of fast network with high upper limit, which greatly accelerates the training speed while keeping the accuracy unchanged.

Therefore, combining with the two models above, we chose DanQ. The works are elaborating the two main elements: DanQ model and random dropout, training the original DanQ and the improved one that the random dropout is added to, and analyzing the results, including its accuracy and speed. After that, we will evaluate the feasibility of the improvement, finding the optimal range of data’s size, and finally get the conclusion.

In our research, we try to realize the following aims:

1) Be familiar with CNN and BLSTM, which were proposed earlier.

2) Try to combine the two models together to form “DanQ”.

3) Train this new model proficiently and apply it to predict the function of DNA sequence.

4) Try to modify the DanQ model to make it more efficient.

DanQ

It’s a hybrid framework combined with CNN and BLSTM. The first step is to convolve the inputted hot coding to simplify it and use the max pooling layer to learn it, and then input the result to the BLSTM layer, and after that, enter the last two layers that are a dense layer of rectified linear units and a multi-task sigmoid (like f ( x ) = ( 1 + e − x ) − 1 ) output.

The DanQ model (see in

line for each convolution kernel and a column for each position in the input (minus the range of the kernel) was produced by a convolution layer with rectifier activation. The size of the output matrix though the dimensional axis was reduced by max pooling, retaining the count of channels. The orientations and dimensional distances between the motifs were deemed by consecutive BLSTM layer. The outputs of BLSTM were flattened into a layer as inputs to a completely connected layer of rectified linear units. A sigmoid non-linear alteration to a vector was applied by the final layer, which is assists as probability predictions of the epigenetic marks to be contrasted via a loss occupation to the true target vector.

Since DanQ is just a combined model of ordinary neural network, it has the same problem as other deep learning models-overfitting. The large neural networks are slow to use, and overfitting makes the input information even more difficult to operate. Dropout can be added to solve this problem (see in

The functionality and data of DeepSEA framework also applies to DanQ. Namely，the reference genome of human grch37 was divided into 200 bp bin without overlapping. By intersecting the 919 CHIP-seq, DNase-seq peaks with the uniformly processed encode and roadmap epigenomics data releases, the targets are calculated, thus 919 binary target vectors are generated for each sample [

In order to evaluate the performance of the test set, we calculate the prediction probability of each sequence as the average of the probability prediction of the positive and negative complementary sequence pairs.

DanQ Model

For additional details on the architecture and related parameters used in this research, seeing Supplement. It includes discarding, which is used to randomly set the proportion of neuron activation in the maximum pool and BLSTM layer to 0 in each training step, so as to normalize the DanQ model. The dropout rate was set to be random so as to improve the velocity of convergence, in the altered algorithm.

We change the scale of dataset to get the training much faster. 4,400,000 data turns into 40,000. But we will also train the complete dataset after the program to avoid inaccuracy.

Improved Method

In the original DanQ, we set the dropout rate to 0.5 in LSTM (Long Short-term memory) and that of max-pooling layer is 0.2. They are changed into two random numbers between 0.1 and 0.3 of max-pooling layer and between 0.4 and 0.6 of LSTM. This change makes the neural network’s structure more flexible.

The Website of Dataset

https://genome.cshlp.org/content/21/3/447/suppl/DC1.

Through the training of original DanQ model, we find that using both convolution and recurrent, DanQ is a practical and effective model with high accuracy. But to improve it, we add random dropout in this model and compare it with the original one to figure out if this modification can really improve DanQ model. For the training of the neural network that has a larger batch size, the random dropout rate has a tremendous improvement on the training speed. To be specific, the epoch of the random dropout training is 20 in average of 14 times. By the contrast, the original DanQ method has average 27 epochs in average of 6 times with less than 0.3% loss smaller than that of random dropout. Based on these training facts, we can get the preliminary conclusion that random dropout rate improves the DanQ’s training speed at large batch size.

Meanwhile, the random dropout and original DanQ has little difference when applying to a smaller batch size. Both of them have an average 12 epochs with similar accuracy and loss. However, compared to larger batch size, each epoch that with smaller batch costs three times more than that of larger batch size to get similar loss and accuracy. Hence, the total training time is still larger. Maybe this size is not in suitable range of random dropout (see in

We have trained the DanQ model through the code implementation and proved that it’s practical. Moreover, we use random dropout training and compare its

result with original DanQ’s, finding that it can improve the DanQ’s training speed. That means it’s possible to make DanQ model analyze the DNA sequence more effectively. However, although the modified model is practical when the batch’s size is small, it costs more when the batch’s size is large. So we still need to find the most suitable range of random dropout to make it feasible. In addition, we will try to look for other better methods to modify DanQ model.

1) Although the average training time of random dropout algorithm is smaller than the original one, the range of it is still unpredictable. The random dropout that is trained longest has 38 epochs, which is much larger than the normal one. We are trying to improve this algorithm to make it more stable.

2) To enlarge the data to all 4,400,000.

3) To have some “tricks” on the network. For example, if one part of regions is too hard to the network, it can restudy it for more times than others.

4) To find the best range of the dropout.

The author wish to thank Professor Manolis Kellis in Massachusetts Institute of Technology, TA Ying Zhang in Chongqing and TA Zihao Zhang in Tongji University for providing dataset in some previous articles and guidance of code implementation and article’s edition.

The authors declare no conflicts of interest regarding the publication of this paper.

Li, D.F. and Huang, X. (2020) An Improved Deep Learning Model for Predicting DNA Sequence Function. Intelligent Information Management, 12, 36-42. https://doi.org/10.4236/iim.2020.121003