_{1}

^{*}

Background: The multiple sequence alignment (MSA) algorithms are the traditional ways to compare and analyze DNA sequences. However, for large DNA sequences, these algorithms require a long time computationally. Objective: Here we will propose a new numerical method to characterize and compare DNA sequences quickly. Method: Based on a new 2-dimensional (2D) graphical representation of DNA sequences, we can obtain an 8-dimensional vector using two basic concepts of probability, the mean and the variance. Results: We perform similarity/dissimilarity analyses among two real DNA data sets, the coding sequences of the first exon of beta-globin gene of 11 species and 31 mammalian mitochondrial genomes, respectively. Conclusion: Our results are in agreement with the existing analyses in our literatures. We also compare our approach with other methods and find that ours is more effective.

With the rapid growth in biological data, how to get more information from these big data is a challenge for scientists. For this purpose, an important problem is to find a suitable way to digitize these DNA sequences so that the sequence comparison can be applied. For computational time reason, beyond the traditional multiple sequence alignment (MSA), many alignment-free sequence comparison methods were introduced, for more details, please refer to [

To achieve this, one way is to use the graphical representation of DNA sequences so that the sequences can be compared by defining a suitable feature. The pioneering works were introduced by Hamori and Ruskin [

In [

The remainder of this paper is organized as follows. Section 2 presents the method of the graphical representation of DNA sequence, and explains the procedure of the similarity analysis among these sequences. Section 3 presents the similarity results among the coding sequences of the first exon of beta-globin gene of 11 species and 31 mammalian mitochondrial genomes. Section 4 discusses our results with other literates and shows the effectiveness of our method.

Utilizing the fact that A, T and C, G are two base pairs, Liu [

( 1 , 0.2 ) → A , ( 1 , − 0.2 ) → T , ( 1 , 0.3 ) → C , ( 1 , − 0.3 ) → G .

Here the y-coordinates of A and T are assigned the same number with opposite sign for differing in the curve, so as to C and G.

For a DNA sequence, we can get a zigzag curve by jointing with all the vectors one by one. For example, the representation of sequence ATGCCTT can be read as follows (

The representation curve corresponding to the sequence is shown in

sequence | x-coordinate | y-coordinate |
---|---|---|

A | 1 | 0.2 |

T | 2 | 0 |

G | 3 | −0.3 |

C | 4 | 0 |

C | 5 | 0.3 |

T | 6 | 0.1 |

T | 7 | −0.3 |

The coordinate x of the curve is increasing, and different nucleotides have different y values, so this representation is a one-to-one map between the DNA sequences and the curves, without loss of information and degeneracy [

Based on the assignments of the four nucleotides over there, Liu [^{4} by the maximal eigenvalue of a related symmetric matrix. In the rest of this section, we will present a map from a DNA sequence to an 8D vector. For two DNA sequences, we will compute the Euclidean distance between the two corresponding vectors, which could be regarded as the similarity/dissimilarity between these two DNA sequences. Our method will be examined by two data sets ranging from small to medium size, as well as exons to genomes.

Given a DNA sequence with a length of n, we have a zigzag curve based on the map between the bases and numbers as assigned as above. Let (x_{i}, y_{i}) be the coordinates corresponding to the i-th nucleotide of the sequence, and z i = y i / i , the slope of the line joining the origin with the point (x_{i}, y_{i}). Then we can get the mean and the variance of the slopes respectively,

m z = 1 n ∑ i = 1 n z i , v z = 1 n ∑ i = 1 n ( z i − m z ) 2 (1)

so to get a vector K = ( m z , v z ) .

On the other hand, similar to [

E = ( V 1 , V 2 , V 3 , V 4 ) . (2)

Up to now, given a DNA sequence, we can get an 8D vector. That is, we have found the novel DNA map from the space of DNA sequences to the 8-dimensional Euclidean space. Please note that the terminology of “DNA map” is different a little bit with in [

Once the feature vector is determined, one can compare two sequences. Given two DNA sequences, we can get two corresponding vectors E 1 and E 2 . Then the distance d between them can be regarded as a similarity/dissimilarity measure of these two sequences, where

d = ‖ E 1 − E 2 ‖ .

We can see that if two DNA sequences are the same, then d is equal to zero. Therefore, if the value of d is smaller, then the two DNA sequences should be more similar.

In this section, we study the similarities among the coding sequences of the first exon of beta-globin gene of 11 species and 31 mammalian mitochondrial genomes through the similarity/dissimilarity measure d.

Let us first consider the sequences of beta-globin gene, whose information is listed in

Now we want to analyze 31 mammalian mitochondrial genomes and construct a phylogenetic tree. The GenBank information of these genomes can be found in [

Species | Coding sequence |
---|---|

Human | ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGT GGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG |

Goat | ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGG CAAGGTGAAAGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAG |

Opossum | ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATC TGGTCTAAGGTGCAGGTTGACCAGACTGGTGGTGAGGCCCTTGGCAG |

Gallus | ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCT GGGGCAAGGTCAATGTGGCCGAATGTGGGGCCGAAGCCCTGGCCAG |

Lemur | ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGT GGGGCAAGGTGGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG |

Mouse | ATGGTGCACCTGACTGATGCTGAGAAGTCTGCTGTCTCTTGCCTGT GGGCAAAGGTGAACCCCGATGAAGTTGGTGGTGAGGCCCTGGGCAGG |

Rabbit | ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGT GGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGGCAG |

Rat | ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGT GGGGAAAGGTGAACCCTGATAATGTTGGCGCTGAGGCCCTGGGCAG |

Gorilla | ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTG GGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG |

Bovine | ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGCCTTTTGGGGC AAGGTGAAAGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG |

Chimpanzee | ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTG TGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCT GGGCAGGTTGGTATCAAGG |

Species | Human | Goat | Opossum | Gallus | Lemur | Mouse | Rabbit | Rat | Gorilla | Bovine | Chimpanzee |
---|---|---|---|---|---|---|---|---|---|---|---|

Human | 0 | 3.253 | 0.941 | 3.232 | 3.183 | 1.393 | 4.836 | 2.137 | 0.059 | 2.713 | 0.698 |

Goat | 0 | 3.674 | 1.245 | 2.936 | 3.755 | 2.649 | 1.189 | 3.200 | 0.574 | 2.689 | |

Opossum | 0 | 3.324 | 4.094 | 2.258 | 5.593 | 2.489 | 0.983 | 3.202 | 1.550 | ||

Gallus | 0 | 3.965 | 4.144 | 3.882 | 1.254 | 3.193 | 1.423 | 2.882 | |||

Lemur | 0 | 2.427 | 2.288 | 2.920 | 3.131 | 2.552 | 2.549 | ||||

Mouse | 0 | 4.529 | 2.902 | 1.378 | 3.183 | 1.301 | |||||

Rabbit | 0 | 3.474 | 4.777 | 2.727 | 4.138 | ||||||

Rat | 0 | 2.089 | 0.782 | 1.675 | |||||||

Gorilla | 0 | 2.659 | 0.640 | ||||||||

Bovine | 0 | 2.126 | |||||||||

Chimpanzee | 0 |

Methods | Goat | Opossum | Gallus | Lemur | Mouse | Rabbit | Rat | Gorilla | Bovine | Chimpanzee |
---|---|---|---|---|---|---|---|---|---|---|

Our work | 1 | 0.29 | 0.99 | 0.98 | 0.43 | 1.49 | 0.66 | 0.02 | 0.83 | 0.21 |

Chi & Ding [ | 1 | 3.71 | 0.82 | 2.73 | 0.69 | 0.50 | 0.48 | 0.07 | 3.59 | 0.58 |

Randic et al. [ | 1 | 2.43 | 1.79 | 1.43 | 1.37 | 0.69 | 0.70 | 0.34 | 1.38 | 0.28 |

Zhang [ | 1 | 2.49 | 2.42 | 1.05 | 0.93 | 1.12 | 1.11 | 0.55 | 0.76 | 2.01 |

also closing similar. Our results are also consistent with that in [

Our method provides a map from the space of DNA sequences to the 8-dimensional Euclidean space. We focus the slope of the line jointing the origin and representation point for the nucleotide, which reflects the speed of the change of y-coordinate.

Different from other probabilistic methods [

As its applications, we study the similarities among beta-globin genes of eleven species and 31 mammalian mitochondrial genomes respectively. In

In this work, we provide an alternative map from DNA sequence to a vector in R^{8} based on two basic statistical quantities. The idea of our method can be applied to analyze the protein sequences. Even the zigzag curve representation of DNA sequence is one-to-one, but not for the map from curves to R^{8}. That is, two DNA sequences may have the same feature vector. In future research, we try to develop our method to study more biological data, for example, to find more suitable vectors so that it can keep more information of DNA sequence.

The author declares no conflicts of interest regarding the publication of this paper.

Zhang, D.D. (2019) A New Numerical Method for DNA Sequence Analysis Based on 8-Dimensional Vector Representation. Journal of Applied Mathematics and Physics, 7, 2941-2949. https://doi.org/10.4236/jamp.2019.712204