Gene finding by integrating gene finders

Gene finding, the accurate annotation of genomic DNA, has become one of the central topics in biological research. Although various computational methods (gene finders) have been proposed and developed, they all have their own limitations in gene findings. In this paper, we introduce an integrating gene finder, which combines the results of several existing gene finders together, to improve the accuracy of gene finding. Four integration schemes, based on majority voting, are developed for the analysis of two datasets – the basic dataset and the testing dataset. The basic dataset consists of 1500 DNA sequences and the testing dataset consists of 103 DNA sequences. It is demonstrated that a simple integration (a simple voting for each nucleotide) can significantly improve the finding performance, and removing confusing gene finders, caused by poor performance or redundant results, is important for a further improvement of the integration. The best prediction results are obtained using weighted majority voting, aided by the mRMR (Minimum Redundancy Maximum Relevance) (Peng, 2005) method for the gene finder selection. The prediction accuracies are 84.16% and 90.06% for the basic dataset and testing dataset respectively, which are better than any individual gene finding software in our research.


INTRODUCTION
Genomes are sequenced very rapidly in recent years due to the advance of the sequencing technique.However, the interpretation of the sequences, including the accurate annotation of genomic DNA, has proved to be a much more difficult task.Although experimentation can be performed to detect genes, the process is lengthy and tedious.The most important and widely used methods today for gene finding are the computational ones.However, lack of accuracy is the main problem for gene prediction.These gene finders have various mechanisms for finding genes: some are based on evidence-based gene finding, i.e., using existing known genes or protein sequences to search and find the unknown genes, e.g.computational gene finders Genie [1] and GeneParser3 [2]; some are using the intrinsic genetic signals such as the splice sides, start and stop codons for the gene finding, e.g.Genie, Genmark [3], GeneID [4], and Genescan [5].As far as the algorithms are concerned, some use artificial neural network such as the Genie, many use the Hidden Markov Models (HMM) such as Genie, Genscan and Fgenesh [6], some use dynamic programming such as the Fgenesh and Genie, and some are aided by Blast [7] such as the Twinscan software [8].Though there is severe algorithm and strategy overlap between the gene finders, there is also huge discrepancy between them in the algorithms and the detailed operations.In the paper, we introduce a voting strategy to integrate the results of the gene finders to strengthen the gene finding and prediction capability.
The Condorcet Jury Theorem proved that the judgments of a committee are superior to those of individuals [9].Such a theorem also holds true for the prediction and classification algorithms since many investigations have found that integrating decisions from hybrid decision (classification) algorithms can significantly improve the prediction, classification and recognition performance.These investigations include character and handwriting recognition [10][11][12][13][14], image analysis and segmentation [11,15,16], automated credit card slip processing [17], speaker identification [18], and other applications [19][20][21].All these multi-predictor integrations make use of all the candidate predictors for their applications.Selecting all candidate predictors is simple and straightforward.However, similar predictors may strengthen each other and dominate the decisions, e.g. if the same algorithm is used twice, the decision will be biased towards that algorithm.Similarly, if too many algorithms from the same kind are involved in the decision, the decision will be biased towards that kind of algorithms.To avoid this problem, redundant gene finders (redundancy overtakes the complementary capability) should be removed from the integration.mRMR method [22] is originally developed for feature selection and analysis, which is transferred into the selection of gene finders in the research.On the other hand, better performed gene finder should be weighted higher, and vice versa.Therefore, we first select complementing gene finders through mRMR method, then use majority voting with weighting to integrate the filtered gene finders.The weight of each gene finder is set to be the AC values (defined in Material and Method), which is used to represent the finding accuracy.
In the research, we choose eight famous and classical gene-finding softwares from genefinding.org(http:// www.genefinding.org/software.html,see below for details).In order to compare simple integration system with the further refined integration system, four schemes are implemented, which are Simple Majority Voting with Simple Software Selection (SMV_SSS), Weighted Majority Voting with Simple Software Selection (WMV_ SSS), Simple Majority Voting with mRMR Software Selection (SMV_MSS) and Weighted Majority Voting with mRMR Software Selection (WMV_MSS).In our research, WMV_MSS performed best.Its prediction accuracies reached 84.16% and 90.06% for the basic dataset and testing dataset.The weight of the gene finders is set to be the gene finding accuracy defined below.Simple software selection purely takes the software with higher accuracy first without considering the redundancy, while mRMR software selection takes consideration of the software redundancy as well as the accuracy.These schemes are introduced in detail below.

Data Preparation
The datasets we study consist of two parts: the basic dataset and the testing dataset.The basic dataset is taken from The University of Maryland Center for Bioinformatics and Computational Biology (http://www.cbcb.umd.edu/research/genefinding.shtml#genedata), with 1500 sequences from human beings.767 sequences of them are translated using the forward chains, while the other 733 are translated using the reverse chains.The DNA sequences in the HMR195 dataset (http://www.cs.ubc.ca/~rogic/evaluation/dataset.html) are taken as the testing dataset.The testing dataset contains 195 gene sequences, among which 103 are from human, 82 are from the mice, and the rest 10 sequences are from the rats.All these 195 sequences are translated forward, and entered the Gen-Bank after August 1997.The basic dataset is used for testing and obtaining the TAC values (refer to "Accuracy Test" below) for each gene finder, while the testing dataset is used only for testing the integration.Because the TAC values are fed back for the integration through weighting in some integration schemes, the gene finding results might be biased when using the basic dataset.However, since the finding accuracy is rather stable especially with a large dataset, the bias is neglectable.For scrutiny, a testing dataset is independently used for testing by taking the gene finder accuracy from the basic dataset.

Data Encoding and Decoding
The predicted protein sequences need to be encoded with digital numbers so that the predicted protein sequences can be integrated.Because the protein sequences are translated by the DNA sequences, we code the DNA sequences so that the predicted protein sequence can be uniquely decoded back by the coded DNA sequences.In the coding method, number 0, 1, 2 and 3 are used to represent the state of every nucleotide in a sequence.Number 0 means the nucleotide is in the noncoding region, while number 1, 2 and 3 means this nucleotide is in a codon: number 1 means that the nucleotide is in the first position of the codon, number 2 means it is in the second position of the codon and number 3 indicates the last position.The encoded sequence can be easily decoded and translated back into a protein sequence: firstly, nucleotides marked with 0 are removed, then every codon (three consecutive nucleotides) marked with "1, 2, 3" are translated into amino acid.In summary, every DNA sequence is coded according to the predicted se-quences from every software, then the eight coded sequences from the eight software are integrated using the four integration schemes described below, finally the encoded integrated sequences could be decoded to obtain the final translated protein sequences.

Accuracy Test
In our study, we used ClustalW2 [25] to align the real coded protein sequences with its predicted coded protein sequences.Base on this alignment, the AC values, representing the accuracy of the gene finding, can be calculated as follows.
Assume the number of amino acids matched in the alignment is n, the length of real protein coded by this sequence is r l , the length of predicted protein is p l , and length of the longest protein that is possibly translated from a DNA with such a length is M l .The approximate correlation [26] is used to represent the gene finding accuracy in the research, which can be estimated as follows: The average conditional probability (ACP) is defined as: The Approximate Correlation (AC) of the gene finding is calculated as: AC = 2 (ACP -0.5).AC = -1, 0, 1 indicate a total negative, a random, and a perfect finding.
Assume the number of sequences in the data set is N, the total accuracy of prediction of the software of integration scheme is:

mRMR (Minimum Redundancy Maximum Relevance)
Maximum Relevance, Minimum Redundancy method (mRMR) [22] is originally developed by Peng for microarray data processing.mRMR method requires the input data to be numeric vectors-each vector is taken as a mRMR feature.mRMR ranks each feature according to both its relevance to the target (highly related to the prediction accuracy) and the redundancy between the features.A "good" feature is characterized by maximum relevance with the target variable and minimum redundancy within the features.Both relevance and redundancy are defined by mutual information (MI), which estimates how much one vector is related to another.MI is defined as follows: , l o g p x y I x y p x y dxdy p x p y   (1) x and y are two vectors; ( , ) p x y is the joint probabilistic density; ( ) p x and ( ) p y are the marginal probabilistic densities.
Let Ω denote the whole vector set.The already-selected vector set with m vectors is denoted by s  , and the to-be-selected vector set with n vectors is denoted by t  .
Relevance D of a feature f in t  with a target variable c can be computed by Eq.2.

 
Redundancy R of a feature f in t  with all the features in s  can be computed by Eq.3 To maximize relevance and minimize redundancy, mRMR function is obtained by integrating Eq.2 and Eq.3: where i f is vector from the best performed gene finder, and { , ,..., , ,..., } by excluding only i f .Eq.4 is used to obtain one vector by another in totally 1 n  rounds, resulting a vector list with the selection order ' ' ' ' 0 1 1 , ,..., ,...,  where h denotes at which round the feature is selected.
In this research, mRMR method is used to rank and select the 8 softwares.The predicted results are coded by number 0,1,2,3, as is described above, and become nu-meric vectors.The real coded protein sequence is regarded as the target vector, and each gene finder results in a numeric vector.From these vectors the 8 softwares can be ranked by the mRMR method.How to use these ranked softwares for the integration is described in the following sections.

Algorithm of Integration Scheme
Four integration schemes are introduced in the section.

SMV_SSS and SMV_MSS SMV_SSS (Simple Majority
Voting with Simple Software Selection), like its name, is the simplest integration scheme in the research.The coded real sequence and the coded predicted sequences are the input vectors of SMV_SSS.Therefore, there are 9 vectors, among which, 8 vectors come from the 8 softwares (called as software vectors), and the remaining 1 vector is the coded real protein sequence (called as real vector).AC (described above) can be calculated for each software vector.Firstly, a software list is obtained simply according to the AC values ' ' ' ' 0 1 1 , ,..., ,...,  where the higher position the software is in, the higher AC value the software obtains.Next, we initialize the software set as and then add software within the software list one by one to the initial software set, resulting a series of software set   , the state of each nucleotide (coded by 0, 1, 2 or 3), gaining the majority votes from the software in the software set, is set to be the state of the nucleotide.This can be formulated as follows: j p denotes the state of a nucleotide n predicted by software ' j f , then a counting function ( ) s X a can be defined as follows: where a and s are the state numbers 0, 1, 2 or 3.
The total voting counts for a state s is defined as The integrated state for nucleotide n using software set i S is set to be The process of SMV_MSS (Simple Majority Voting with mRMR Software Selection) is exactly the same as the SMV_SSS except that the algorithm list is provided by mRMR method (please refer to 2.5).Unlike SMV_SSS, SMV_MSS not only consider the software performance, but also the redundancy between the softwares.Thus the next software added into the software set is optimized by both software performance and software redundancy.

WMV_SSS and WMV_MSS WMV_SSS (Weighted Majority Voting with Simple
Software Selection) is similar to SMV_SSS except that the software is weighted by the AC value rather than being weighted equally.The total voting count in Eq.5 becomes , that is, each vote is weighted by the AC value.The rest is exactly the same as the SMV_SSS.
As for the WMV_MSS, the readers can easily deduce how it processes.WMV_MSS is the most refined integration system because it weights the software with the AC values after the software set is optimized by the mRMR method.

An Example for the Integration Methods
This section demonstrates how to encode and integrate the prediction of sequence AB010281 in HMR195 dataset with individual predictors, and finally obtain the amino acid sequence using our integration methods.Encoding the prediction results of all the 8 softwares, we get the state matrix of this DNA sequence, where each vector of the matrix is the coded protein sequence from a gene finder (see supplemental material 3).
Let us first demonstrate how to integrate the results using SMV_SSS.The votes for each nucleotide are first counted for each software set.For instant, the predicted states of the 28 th nucleotide are: where each state comes from each of the 8 softwares, respectively.If all 8 softwares are selected, we can see that 1 appears 5 times and 0 appears 3 times.So state 1 gets the majority votes and the integrated state of the 28 th nucleotide is set to be 1.In this way, we could get the whole state sequence using SMV_SSS scheme when all softwares are included.Similarly, the votes can be counted when 3-8 softwares involve.Notice that the AC values for the 8 softwares are 0.8000 (Fgenesh), 0.7437 (Fgenes), 0.6487 (HMMgene), 0.6224 (Genie), 0.8008 (Genscan), 0.7579 (Geneid), 0.7158 (Augustus), 0.7702 (Twinscan), and the software is added with descending AC value.For the WMV_SSS, each software is weighted by its AC value.Again, let us take the 28 th nucleotide as an instance: the score of state 1 is 3.5448, and the score of 0 is 2.3147 when all 8 softwares are included.So the integration state of this nucleotide is 1.In this way, we can get the whole state sequence using WMV_SSS scheme (see Supplemental Material 3).The mRMR software selection schemes process in the same way as the simple software selection schemes, except that the software is added in a different order (refer to Table 2 for the list).
With the integrated state sequence and the DNA sequence, we can get the predicted mRNA sequence.First remove all the nucleotides with state 0, then, remove the nucleotides with an incomplete reading frame.For instance, after removing all 0s in the SMV_SSS integration state sequence with 8 softwares, we can see a state fragment "…12312123…" corresponding with the DNA fragment "…GACAGATG".Obviously, the reading frame "AG" is incomplete, and they are removed.In this way, we can get the final predicted mRNA sequence (see Supplemental Material 3).The mRNA sequence is translated into a protein sequence by translating each codon into an amino acid.

Prediction Results the 8 Gene-Finding Softwares
The basic dataset (1500 sequences) and the testing dataset (103 sequences) are input into the 8 gene-finding softwares (described above), and produce the predicted protein sequences.The TAC values (defined above) are calculated to rate the performance of the gene-finding softwares.These TAC values are shown in Table 1.The TAC values, obtained by the basic dataset, will be used as the weight of the corresponding software when integrating by schemes WMV_SSS and WMV_MSS.

Result of mRMR
The feature vectors input into mRMR method consist of all the predicted results from the 8 software for the 1500 sequences-each sequence contains thousands of nucleotides-and the coded real protein sequence.By joining the coded sequences together, it results in nine 1405899dimensional vectors.Since the complementation capability of each gene finder is the main focus, the dimension is deleted using the command "uniq" in Linux if the related nucleotide in that dimension has the same state number for all the 9 vectors.After the deletion, 94700dimensional vectors remain as the input for the mRMR method (see Supplemental Materials 4 for details).
The mRMR program used in this contribution is downloaded from website http://research.janelia.org/peng/proj/mRMR/.As all of the input vectors are integer vectors, we specify the parameter 0 t  in the mRMR program to tackle the integral calculation.mRMR method outputs the list of the software (see Supplemental Material 5 for the parameter setting of mRMR program).The mRMR list is shown in Table 2.This list is used in the integration of the softwares as is described above.

Results of SMV_SSS, WMV_SSS, SMV_MSS and WMV_MSS
Seven TAC values are obtained for each of the integration schemes since each scheme integrates two to eight gene finders.The best TAC value of the Seven TAC values is taken as the TAC value of the integration scheme.Table 3 shows all the integration results.WMV_MSS performs best both in the basic dataset and the testing dataset since its TAC rates are 84.16% and 90.06% when six software are integrated.Figure 1 plots all the seven TAC values for the four integration schemes.

DISCUSSIONS
From Table 3, we can see that nearly all integrated gene finders are better than the best single gene finder, except that when four software are integrated using SMV_SSS on the testing dataset.Therefore, integrated gene finders are significantly better than single gene finder in gene finding.This indicates that gene finders are complementing each other for gene finding.
From Figure 1(a), WMV_MSS performs consistently better than other integration schemes on processing the basic dataset, and in Figure 1(b), in majority cases WMV_MSS performs better than other integration schemes.Therefore, WMV_MSS is the best integrator among the four.SMV_MSS is the second best integrator for the basic dataset since in Figure 1(a) it performs consistently better than the rest two integration schemes.As for the testing dataset in Figure 1(b), it is hard to tell whether SMV_MSS or WMV_SSS is better.We cannot tell whether WMV_SSS or SMV_SSS is better for the basic dataset from Figure 1.And SMV_MSS performs better than SMV_SSS consistently in Figure 1(a), and also in the majority cases in Figure 1(b).
Therefore, mRMR software selection does improve the integration results since WMV_MSS is better than WMV_SSS and SMV_MSS is better than SMV_SSS in the integration.Weighting contributes slightly towards the integration since WMV_MSS is slightly better than SMV_MSS in both basic and testing dataset, and WMV_SSS is slightly better than SMV_SSS in processing the testing dataset.In SMV_MSS and WMV_ MSS, when the last two softwares Twinscan and Augustus are added into the integration, the prediction performance is worse (see Table 3 and Figure 1).Twinscan is the 3 rd best software.However, this software can be treated as the combination of Genscan and BLAST (see Supplemental Material I for further detail).Therefore, Genscan and other softwares must have covered the prediction capacity provided by Twinscan, causing the redundancy of Twinscan.For Augustus, it is in the 5 th position in the software list according to the TAC values, and its algorithms are similar to some of the other softwares.So Augustus is also removed by the mRMR criteria.

CONCLUSIONS
We introduce some integrating schemes for some gene finder softwares in this investigation.The results indicate that these gene finders are able to complement each other and a simple combination of them performs significantly better than the best individual gene finder.mRMR (minimum redundancy and maximum relevance) method is proved to be very important for the further improvement of the prediction performance.Assigning higher weights to better performed gene finders can improve the integration results slightly, but not as significant as the mRMR software selection.A further improvement is expected if more complementing gene finders are included in the fusion, which should be investigated in a future research.

Figure 1 .
Figure 1.The performance of 4 integration schemes in (a) training set and (b) test set.It plots all the seven TAC values for the four integration schemes.

Table 1 .
The performance of 8 gene-finding software.

Table 2 .
The rank of 8 gene-finding software in mRMR.

Table 3 .
The performance of four predictors.