Analysis and prediction of exon , intron , intergenic region and splice sites for A . thaliana and C . elegans genomes

Although a great deal of research has been undertaken in the area of the annotation of gene structure, predictive techniques are still not fully developed. In this paper, based on the characteristics of base composition of sequences and conservative of nucleotides at exon/intron splicing site, a least increment of diversity algorithm (LIDA) is developed for studying and predicting three kinds of coding exons, introns and intergenic regions. At first, by selecting the 64 trinucleotides composition and 120 position parameters of the four bases as informational parameters, coding exon, intron and intergenic sequence are predicted. The results show that overall predicted accuracies are 91.1% and 88.4%, respectively for A. thaliana and C. elegans genome. Subsequently, based on the position frequencies of four kinds of bases in regions near intron/coding exon boundary, initiation and termination site of translation, 12 position parameters are selected as diversity source. And three kinds of the coding exons are predicted by use of the LIDA. The predicted successful rates are higher than 80%. These results can be used in sequence annotation.


INTRODUCTION
With the completion of the genomes sequencing, more and more efforts were being put into understanding the functional elements encoded in a genome [1,2,3,4,5,6].Annotation of gene structure in eukaryotic genomes currently involves both computational and experimental approaches [7,8,9,10].Driven by this explosion of genome data and a need to analyze draft data quickly, genefinding programs have also proliferated, particularly those that were designed for specific organisms [11,12,13,14,15].However, the accuracy was still far from satisfaction [16].
Gene prediction methods can be generally classified as composition-based and similarity-based methods.Composition-based methods, also called ab initio genefinding method, contain two important aspects: type of information and the algorithm.Most types of information measure either codon usage bias, base compositional bias between codon positions or splice site as well as periodicity in base occurrence.Several sophisticated algorithms that deduce the presence of a gene feature using signals and content information have been devised including GenScan [17], Fgenes [18], Genie [19] and MZEF [20].Although some satisfactory results were obtained by using above software, a considerable proportion of missing or incorrect exon and over predictions were found by using an experimentally validated dataset of some genomic sequences [21].On the other hand, most ab initio gene prediction programs performed prediction based on large parameters.For example, 12,288 parameters were needed by GeneMark [22].It will deduce unreliable prediction results for small genome [23].Similarity-based methods such as Genewise [24] and Procrustes [25] predicted a gene relied on homolog sequences.These methods showed a high sensitivity and specificity for predicting genes whose sequence is closely related to the known input sequence.But some species-specific genes are likely to be missed [7].In order to improve prediction, the programs of combing protein sequence similarity with ab inito gene-finding algorithms such as GenomeScan [26] were proposed.Despite great progress, the experiment highlighted errors with the various predictions and indicated that both types of gene prediction programs are currently unable to determine whole gene structures consistently [27].
Although programs for splice site and gene structure recognition have reached a high level of performance on internal coding exons, standard splice sites might not be sufficient for defining introns in the genomes [28].And prediction of splice sites in non-coding regions of genes is one of the most challenging aspects of gene structure recognition.The distinguishing intergenic region from intron should be very useful to understand the features of the noncoding and regulatory regions.In addition, finding first exons still remains a challenge, except where the true full-length mRNA sequences are available.Unfortunately, most of the available mRNA sequences are incomplete at their 5'ends and do not provide information about first exons.Apparently, the recognition of exon, intron and intergenic DNA at the meanwhile is very helpful for gene recognition.Specially, it is difficulty to distinguish intron from intergenic sequence in past algorithm.
In this paper, our goal is to provide a new computational method to predict gene structure base on least increment of diversity algorithm (LIDA).The diversity measure was first introduced and employed in biological classification [29].It is a kind of information description on state space and a measure of whole uncertainty and total information of a system derived from information theory.To compare the similarity of two sources, one defines the increment of diversity (ID) by the difference of the total diversity measure of two systems and the diversity measure of the mixed system.It can be proved that the higher the similarity of two sources, the smaller the ID.So, the increment of diversity of two sources is essentially a measure of their similarity level.
Here, according to the theory of diversity, we firstly predict coding exons, introns and intergenic sequences of A. thaliana and C. elegans based on the analysis of the compositional differences in near splice sites and conserved sequence segments of the three kinds of sequences (exons, introns and intergenic sequences) in the complete genome of these two model organisms.Subsequently, three kinds of coding exons (first coding exons, internal coding exons and last coding exons) are predicted by use of the least increment of diversity algorithm.It may be useful for improving the prediction of splice sites.

Data Sample
The A. thaliana and C. elegans genomic DNA sequences are obtained from Genbank.The coding exons, introns and intergenic sequences are respectively extracted from the above genomes.According to the length distribution, we divide all sequences of one chromosome into three types of subsets.The ranges of three subsets are respectively (30-200bp), (200-500bp) and (>=500bp) for exon and intron sequences, (30-2000bp), (2000-5000bp) and (>=5000bp) for intergenic sequences.
The 15609 first coding exons, 67408 internal coding exons and 15791 last coding exons are extracted from A. thaliana complete genome.The 10904 first coding exons, 87743 internal coding exons and 11035 last coding exons are extracted from C. elegans complete genome.The subsequences with 9 bases length flanking 5' boundary sites (from -5 th site to +4 th site) and 3' boundary sites (from -4 th site to +5 th site) are meanwhile extracted respectively from above genome sequences.

Least Increment of Diversity Algorithm (LIDA)
Due to increment of diversity (ID) can measure increment of whole uncertainly (or information) between two data sources, it has been widely applied in bioinformatics investigation, such as protein structural class prediction [30], subcellular location of apoptosis protein [31] and secretory protein prediction [32].For the purpose of improving prediction capability, ID combined with other predictive model was applied in exon/introns splice site prediction [33], human PolII promoter prediction [34] and protein predictions [35,36,37,38,39,40,41,42]. For reader's conveniences, the theory of diversity is introduced as follows.Definition 1.For a state space X{n 1 ,n 2 ,…,n s } consisting of s information symbols, if n i indicates the numbers of the i-th state, then the diversity for diversity source X:[n 1 , n 2 ,…, n s ] is defined as [30], . It is easily proved that the diversity equals N fold of information entropy [43].
Definition 2. If there are two sources of diversity in the same space of s dimension, X: we may define the increment of diversity as where D(X+Y) is the measure of diversity of the mixed source X+Y: is a function of two sources.It is easily proved that the increment of diversity [Eq.( 2)] is nonnegative and symmetry.Therefore, is regarded as a quantitative measure of the similarity level of two independence systems.

Prediction of Exon, Intron and Intergenic Sequence
One DNA sequence can be represented by a diversity source: X: [S i, , N jk , M lk ], where S i means the absolute frequency of the i-th trinucleotide in the sequence (i=1,2,…,4 3 ); N jk means the absolute frequency of base k at the j-th position from the beginning of 5' boundary (j=1, 2, …, 15), M lk means the absolute frequency of bases k at the l-th position from the end of 3' boundary, (l=-1, -2, …, -15).By calculating above 180 (4 3 +15×4+15×4) parameters of exons, introns and intergenic sequences in standard sets (training sets), we deduce three standard sources of diversity : [ ] in the state space of 184 dimensions.(here g  indicates respectively the exon, intron and intergenic sequence.)Three standard measures of diversity can be deduced by use of similar equations as Eq.( 1), namely where = (k=1, 2, …, 184), ( Suppose that X is a DNA sequence whose class is to be predicted.In the same state space, the measure of diversity of sequence X can be expressed as: where (k=1, 2, …, 184).
The increments of diversity between the diversity source X: [ ] and the three standard diversity sources : [ ], (here Sequence X can be predicted to be the class for which the corresponding increment of diversity has the minimum value, and can be formulated as follows. where ξ can be e, i or g and the operator Min means taking the minimum value among those in the parentheses, then the ξ in Eq.( 6) will give the sequence class to which the predicted sequence X should belong.

Prediction of Three Kinds of Coding Exons
For each coding exon, the following three kinds of codon positions are investigated to select optimal parameters.
1) The three bases before the 5 / boundary sites of exons (acceptor sites) and after the 3 / boundary sites of exons (donor sites) are chosen as information parameters of diversity source.
AGA GCA↑ATG G……A TGC↑GTA AGA 2) The three bases after the 5 / boundary sites of exons (acceptor sites) and before the 3 / boundary sites of exons (donor sites) are chosen as information parameters of diversity source.
AGA GCA↑ATG G……A TGC↑GTA AGA 3) The six bases flanking the 5 / boundary sites of exons (acceptor sites) and the 3 / boundary sites of exons (donor sites) are chosen as information parameters of diversity source.
AGA GCA↑ATG G……A TGC↑GTA AGA (where↑indicates the 5' or 3' exon boundary sites) By calculating the absolute frequencies of four bases in above positions near splice sites of first coding exons, internal coding exons and last coding exons, we deduce three standard sources of diversity :{ |j=1,2,3; a=A,C,G,T} in the state space of 12 dimensions (here corresponding to first coding exon, internal coding exon and last coding exon, respectively).Then, three standard measures of diversity for three coding exons can be calculated by Eq.( 1), namely: where (k=1, 2, …, 12).
Suppose that S is an exon whose class is to be predicted.In the same state space, the measure of diversity can be expressed as: According to Eq.( 2), the increments of diversity between source S and three standard sets are Exon (S) can be predicted to be the class for which the corresponding increment of diversity has the minimum value, can be formulated as follows where ξ can be f, i or l and the operator Min means taking the minimum value among those in the parentheses, then the ξ in Eq.( 9) will give the class to which the predicted coding exon S should belong.

Evaluating Predicted Performance of Proposed Method
In order to evaluate the correct prediction rate and reliability of a predictive method, the sensitivity (S n ), speci-ficity (S p ) and correlation coefficient (CC) are defined by  1.

The Prediction of
For a given sequence class , TP denotes the number of the sequences correctly predicted to be in  class sequences (true positive), FP denotes the number of the sequences incorrectly predicted to be in  class sequences (false positive), TN denotes the number of the sequences correctly predicted to be in non- class sequences (true negatives), FN denotes the number of the sequences incorrectly predicted to be in non- class sequences (false negative).Sensitivity shows the rate of correct prediction.Specificity shows the confidence level for predictive method.The correlation coefficient (CC) affects the entirely performance of the prediction algorithm.
Based on the Eq.( 6), the three classes of sequences are predicted by use of the 184 information parameters.In order to compare prediction quality of different information parameters, we perform our algorithm to predict exons, introns and intergenic sequences using 64 trinucleotides.The contrast results of test sets between 64 and 184 signals parameters for A. thaliana (A) and C. elegans (C) are shown in Table 2.The number outside the bracket denotes the predicted results for the 1 st subset.Two numbers in bracket, respectively, denotes the predicted results for the 2 nd subset and the 3 rd subset.

The Prediction of Three Kinds of Coding Exons
For predicting three types of coding exons, a total of 1000 first coding exons, 1000 internal coding exons and 1000 last coding exons are randomly selected as training sets from gene sequences of A. thaliana and C. elegans.
The remained sequences are regarded as the test sets.In order to eliminate the dependence of the predictive results on the training dataset, this selected procession repeat 10 times.According to Eq.( 10), three types of coding exons using different information parameters are predicted.The results are shown in Table 3.As seen from Table 3, the first parameter-chosen method achieve best results among three kinds of parameters.

DISCUSSION
The recognition results of the exon, intron and intergenic sequence show that the S n , S p and CC values with 184 parameters are higher than the results with 64 signals.For A. thaliana (A) and C. elegans (C), the average correct prediction rates of standard sets are 88.6% and 88.2%, the average correct prediction rates of testing sets are 93.6% and 88.4%, respectively.Overall correct prediction rates are 91.1% and 88.4%, respectively.
For evaluating performance of proposed method, exons, introns and intergenic sequences of D. melanogasters and S. cerevisiae were predicted using 184 parameters.The overall accuracies of 92.28% and 94.88% were achieved for D. melanogasters and S. cerevisiae, respectively.We also performed LIDA to predict coding regions and intergenic sequences of E. coli.The overall accuracy of 92.88% was achieved.
Despite great progress, however, gene prediction entirely based on DNA analysis is still far from perfect.In the recent comparison of gene-prediction programs, the best algorithms in two well-annotated regions could achieve sensitivities (a measure of the ability to detect true positives) and specificities (a measure of the ability to discriminate against false positives) of less than 95% and 90% for different genomes, respectively [44,45].
In our method, three kinds of sequences (exons, introns and intergenic sequences) are simultaneously predicted.If considering the random effect, the correct prediction rate for three kinds of sequences is only 2/3 of the correct prediction rate for two kinds of sequences (exons and introns).That is to say, if two types of sequences are simultaneously predicted, the random correction rate is 1/2; if three types of sequences are simultaneously predicted, the random correction rate is 1/3.Such as, 90% correct prediction rate for predicting two types of sequences is only same as 60% for predicting three types of sequences.So, same correct prediction rate in our result is higher than the correct prediction rate of two kinds of sequences in any other methods.
The results of the prediction for the three types of coding exons indicate that the sensitivity (S n ), specificity (S p ) and correlation coefficient (CC) are the best by use of three bases before the 5 ' boundary sites of exons and after the 3 ' boundary sites of exons in three selections.Especially, the correlation coefficient (CC) is apparently higher in first choosing method than that in second and third methods.It is consistent with the highly conserved sequences near the ends of introns and the conserved GT-AG rule.The three kinds of coding exons have not been studied in other methods.
In addition, according to the statistical analysis of se-quences in the region near splicing sites, we find there are some special preferences for certain bases.The results show that the sequence of the near splice site region is strongly conserved.Except the GT-AG rule, there is a strong bias of base G in the -4th site from the 3' term of introns for A. thaliana genome, but the base T is biased in the same site for C. elegans genome.The stop codons of the two model species bias TAA, and the bases GT and AT are biased in the two sites after the stop codon for A. thaliana and C. elegans genomes, respectively.It may be a possible signal for stopping translation.The base A is biased at positions -4, -2 and -1 before translation start sites.And the bases G and A are respectively biased in the 4-th site after translation start sites (TSS).These biases may be relative to the translation start signals.In addition, the base bias of the 1-st sites of the 5' term within internal coding exons and last coding exons is different for A. thaliana from C. elegans genomes.The base G is biased by the A. thaliana, base A is biased by C. elegans.By the further statistics of the base pairs in the boundary region of exons, the first coding exons and internal coding exons in A. thaliana and C. elegans genomes are generally ended by AG.The internal coding exons and last coding exons in A. thaliana genome are generally started by GT, but the two exons in C. elegans genome are generally started by AT.It is possible additional information for splice sites.These results may be very useful to improve correct prediction rate of splice sites.

CONCLUSIONS
This paper proposed a novel algorithm-increment of diversity for gene structure prediction.This algorithm may be deduced from information entropy.It is well known that the mutual information can describe how to extract information regarding b from source a if the conditional probability p(b|a) is known [33].But ID is different from mutual information.It can describe increment of complication between two informational sources.Our prediction results also exhibit that ID is a promising method.

Table 1 .
The length-distribution of three kinds of sequences in the chromosomes of the two model species.The results for test set with 64 and 184 signals of A. thaliana and C. elegans.

Table 3 .
The results of prediction for three kinds of exons in A. thaliana and C. elegans genomes.