A New Numerical Method for DNA Sequence Analysis Based on 8-Dimensional Vector Representation ()
1. Introduction
With the rapid growth in biological data, how to get more information from these big data is a challenge for scientists. For this purpose, an important problem is to find a suitable way to digitize these DNA sequences so that the sequence comparison can be applied. For computational time reason, beyond the traditional multiple sequence alignment (MSA), many alignment-free sequence comparison methods were introduced, for more details, please refer to [1] [2] [3] and the references therein.
To achieve this, one way is to use the graphical representation of DNA sequences so that the sequences can be compared by defining a suitable feature. The pioneering works were introduced by Hamori and Ruskin [4] [5] using the so-called H-curve representation of DNA sequence. Following these researches, many multi-dimensional representations were considered [6] - [10]. But these representational curves may degenerate, or may be not one-to-one mapping from DNA sequences. In order to overcome these defects, many new curves were introduced [11] - [19], while some new cluster methods were considered [20] [21] [22]. Some other representations were applied to the protein sequences [23] [24] [25] [26].
In [14] [27] [28], some new methods arrived based on the probabilistic framework. In particular, in [27], in order to obtain the eigenvector representing the zigzag curve, it was necessary to calculate the maximum eigenvalue of the related matrix. So it took a long time to compute this value for a huge DNA sequence. In [28], the polynomial curve of order 3 was used to fit the representation curve. But the choice of the order for the function was depended on their data sets. To improve these methods, we characterize the representation curve with the mean and the variance. Following some observations in [27] [28], we will provide a map from the space of DNA sequences to the 8-dimensional Euclidean space based on a 2D graphical representation of the sequence. By this mapping, the similarity/dissimilarity of the first exon of beta-globin gene of eleven species and 31 mammalian mitochondrial genomes will be studied respectively and very prospective results will be obtained.
The remainder of this paper is organized as follows. Section 2 presents the method of the graphical representation of DNA sequence, and explains the procedure of the similarity analysis among these sequences. Section 3 presents the similarity results among the coding sequences of the first exon of beta-globin gene of 11 species and 31 mammalian mitochondrial genomes. Section 4 discusses our results with other literates and shows the effectiveness of our method.
2. Methods
Utilizing the fact that A, T and C, G are two base pairs, Liu [27] [28] introduced two representations of DNA sequence by setting A, T and C, G to the same probability respectively. Following this idea, each nucleotide is assigned by a vector as follows.
Here the y-coordinates of A and T are assigned the same number with opposite sign for differing in the curve, so as to C and G.
For a DNA sequence, we can get a zigzag curve by jointing with all the vectors one by one. For example, the representation of sequence ATGCCTT can be read as follows (Table 1).
The representation curve corresponding to the sequence is shown in Figure 1.
Figure 1. The curve corresponding to ATGCCTT.
Table 1. Representation of sequence ATGCCTT.
The coordinate x of the curve is increasing, and different nucleotides have different y values, so this representation is a one-to-one map between the DNA sequences and the curves, without loss of information and degeneracy [11].
Based on the assignments of the four nucleotides over there, Liu [27] introduced a representation of DNA sequence-based on four horizon lines, then showed a map from the curve to a vector in R4 by the maximal eigenvalue of a related symmetric matrix. In the rest of this section, we will present a map from a DNA sequence to an 8D vector. For two DNA sequences, we will compute the Euclidean distance between the two corresponding vectors, which could be regarded as the similarity/dissimilarity between these two DNA sequences. Our method will be examined by two data sets ranging from small to medium size, as well as exons to genomes.
Given a DNA sequence with a length of n, we have a zigzag curve based on the map between the bases and numbers as assigned as above. Let (xi, yi) be the coordinates corresponding to the i-th nucleotide of the sequence, and
, the slope of the line joining the origin with the point (xi, yi). Then we can get the mean and the variance of the slopes respectively,
(1)
so to get a vector
.
On the other hand, similar to [27] [29], we could also assign A to −0.2, and T to 0.2, to get another curve, so as to the bases C and G, so that there are four curves for a fixed DNA sequence. Since every curve derives a vector
, we can get four vectors
and
. Putting them together, we can finally get an 8D vector
for a DNA sequence, which is defined by
. (2)
Up to now, given a DNA sequence, we can get an 8D vector. That is, we have found the novel DNA map from the space of DNA sequences to the 8-dimensional Euclidean space. Please note that the terminology of “DNA map” is different a little bit with in [30], where the map is from DNA sequence to the representation zigzag curve.
Once the feature vector is determined, one can compare two sequences. Given two DNA sequences, we can get two corresponding vectors
and
. Then the distance d between them can be regarded as a similarity/dissimilarity measure of these two sequences, where
.
We can see that if two DNA sequences are the same, then d is equal to zero. Therefore, if the value of d is smaller, then the two DNA sequences should be more similar.
3. Results
In this section, we study the similarities among the coding sequences of the first exon of beta-globin gene of 11 species and 31 mammalian mitochondrial genomes through the similarity/dissimilarity measure d.
Let us first consider the sequences of beta-globin gene, whose information is listed in Table 2 from GenBank, which updates the information of Table 3 in [1]. The result is shown in Table 3. The table shows that the values d of Human-Gorilla, Goat-Bovine and Gorilla-Chimpanzee are relative smaller, which indicates they are relative closer. In order to exam whether our method is effective, we want to compare our results with those of others. Therefore, we list some highly cited similarity results between human beings and other species, as shown in Table 4. Following the idea in [27] [28] [31], for convenience, we also use the index normalized by the Human-Goat ratio. From Table 4, most results display that the normalized values of Human-Gorilla and Human-Chimpanzee are smaller, which is consistent with ours.
Now we want to analyze 31 mammalian mitochondrial genomes and construct a phylogenetic tree. The GenBank information of these genomes can be found in [32], and the results with UPGMA are shown in Figure 2. In this figure, we can see that the groups Primates, Perissodactyla and Rodentia include the same species as in the results of Figure 3 in [33] and Figure 2 in [32], while Sheep-Goat, Dog-Wolf, Brown Bear-Polar Bear and Tiger, Cat and Leopard are
Figure 2. The phylogenetic tree of 31 mammalian mitochondrial genomes with UPGMA.
Table 2. The coding sequences of the first exon of beta-globin gene of eleven species.
Table 3. The similarity result (1.0e − 2) for the coding sequences of the first exon of beta-globin gene of 11 species.
Table 4. The similarity indexes between human and other species. All indexes are normalized to Human-Goat ratio.
also closing similar. Our results are also consistent with that in [28], where they considered 11 species of them.
4. Discussions
Our method provides a map from the space of DNA sequences to the 8-dimensional Euclidean space. We focus the slope of the line jointing the origin and representation point for the nucleotide, which reflects the speed of the change of y-coordinate.
Different from other probabilistic methods [14] [35], where they regarded the sequence as a sample space, we read a DNA sequence as a random result. Comparing to the method in [27], our method relies on the mean and variance of the slopes of the corresponding lines only, not the eigenvalues. These arrive at more pure statistics, and save computing time.
As its applications, we study the similarities among beta-globin genes of eleven species and 31 mammalian mitochondrial genomes respectively. In Table 4, the Human-Gorilla is the most similar, which is supported by all the results. Beside of it, our method and that in [29] shows that Human-Chimpanzee is the most similar, which is consistent with many existing results. But the results in [12] [34] indicate that Human-Rabbit and Human-Rat are closer than Human-Chimpanzee. While Figure 2 covers the corresponding results in [28]. This reflects the usefulness of our novel method.
In this work, we provide an alternative map from DNA sequence to a vector in R8 based on two basic statistical quantities. The idea of our method can be applied to analyze the protein sequences. Even the zigzag curve representation of DNA sequence is one-to-one, but not for the map from curves to R8. That is, two DNA sequences may have the same feature vector. In future research, we try to develop our method to study more biological data, for example, to find more suitable vectors so that it can keep more information of DNA sequence.