A New Method to Digitize DNA Sequence

The global description uses composition, transition and distribution to describe an amino acid sequence and has been widely adopted in various fields. Here we integrate it with properties of nucleic acid and form a new method to digitize DNA sequence. Through this method we can use a 39-dimession vector to represent a DNA sequence. We use the exon-1 of β-Globin genes of eight species to verify this method and compare with other methods. A similar result with other method proves that this method is persuading. This method provides a new strategy to digitize DNA sequence and generates DNA sequence descriptor vector. It is different from other methods and this method only produces a 39-dimession vector and not depends on the length of DNA sequence.


Introduction
As the sequencing method is widely used in various researches the stupendous DNA sequence data is producing.The 1000 genomes project [1] have reported completion of the project, having reconstructed the genomes of 2504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping.It's urgently necessary to analyze these data with the mathematical method rapidly and accurately.
In order to generate the mathematical descriptor how to characterize DNA sequence is the most important work.In the earlier study many methods have been proposed to characterize DNA sequence.Graphical representation [2] was adopted by the majority of reported methods.In graphical representations it is believed that sequence character can be quickly obtained by corresponded plot of DNA sequence.With this theory researchers developed various techniques for plotting the DNA sequence, such as 2-D, 3-D, 4-D and other graphical representations.2D method is the mainly adopted form of graphical methods.Gates [3] used the four cardinal directions to represent A, T, C and G and a DNA sequence can be plot points on the graph.In Nandy's method [4], if the base was an adenine the graph will be added one step in the negative x-direction and guanine, cytosine and thymine are separated represented by one step in positive xdirection, positive y-direction and negative y-direction.The Gates method [3] and Lenong Morgenthaler method [5] prescribed the GTCA and CTAG reading clockwise.The 2-D method have already been used in various researches by Nandy [4] [6] [7], Raychaudhury and Nandy [4], Nandy and Basak [8], Wu, Liew, Yan, and Yang [9], Yao, Nan and Wang [10] and Ghosh, Roy, Adhya and Nandy [11].But almost all of these methods produced overlapping paths that caused degeneracy.
In order to avoid or diminish, researchers proposed some mutated methods.He and Wang [12] introduced a no degeneracy method.In their method the four bases were divided into different groups by purine-pyrimidine, amino-keto, weak H-bond-strong H-bond.And they prove that a DNA primary sequence is uniquely determined by any pair of its three characteristic sequences.
The global description of amino acid sequence [13] has been widely used in various fields.It uses three descriptors composition, transition and distribution that are deprived for the physicochemical properties to generate the sequence descriptors.
In this work we integrate the characteristic sequence and the global description.First we characterize DNA primary sequence with three characteristic sequences.Then according to the global description method a 39-dimission descriptors are produced from the three characteristic sequences.Finally we compare this method with other methods and verify this method.

The Method to Generate the Characteristic Sequences
He and Wang [12] developed method that can produce characteristic sequences for DNA primary sequence with different properties.As we know the four-base can be divide into different group with different chemical structures.In He and Wang's work they proposed three properties that is purine and pyrimidine, amino and keto and weak H-bond and strong H-bond.First it can be divided into purine R = {A, G} and pyrimidine Y = {C, T}.And then considering the amino and keto the four bases are divided into amino group M = {A, C} and keto group K = {G, T}.At last the strength of the hydrogen bond that is weak Hbonds W = {A, T} and strong H-bonds S = {G, C} also can be the classification standard.
In order to make it easier to compare sequences it used 0 to replace R = {A, G} and 1 to replace Y = {C, T}.In the similar operations with the other two classifications the rest characteristic sequences are represented by 0 and 1.Table 1 lists all the characteristic sequences with 0 and 1.And it proves that three characteristic sequences contain all information of the primary DNA sequence [12].

The Global Descriptors for DNA Sequence
The global descriptor was first proposed for amino acid sequence [13].It used three descriptors: composition, transition, and distribution to describe the global composition, the frequencies of property changes and the distribution pattern of the property of a given amino acid sequence respectively.The composition (C) is the number of amino acids of a particular property divided by the count of the bases in a protein sequence.In this work where n 0 , n 1 and N is the number of 0, 1 and total number of the characteristic sequence.Transition (T) characterizes the frequency with which a property is followed by a different property.T = (t 01, 10 ) and the t 01, 10 represents the transition from 0 to 1 and 1 to 0.
Distribution (D) calculates the first, 25%, 50%, 75%, and 100% of each property of a characteristic sequence.elements in all: 2 for C, 1 for T and 10 for D. According the above method Table 2 lists the corresponded descriptors.

Method Validation
In this section, a 39-dimension vector is constructed consisting of all the descriptors in the three characteristic sequences for the exon-1 of β-Globin genes of eight species in Table 3.Using these vectors we analyze the relationship for the eight sequences.If two sequences are similar their vectors will be close to each other in the 24 dimensions space.
The Euclidean Distance is used to calculate the similarity between sequences.
And if the value is smaller the two sequences will be more similar.All the values are calculated and list in Table 4.As the values denote in Table 4, the gallus is dissimilar to others among all the eight species.Human-rabbit, lemur-rabbit, human-mouse, mouse-rat and mouse-rabbit is more similar to each other.In earlier works the similar results have been reported by different methods [12] [14] [15].
Table 2.All the 13 descriptors for each characteristic sequence.

Results and Discussion
In this work we combine characteristic sequence with global descriptors and form a new method to generate a 39 demission vector for DNA sequence.We also compare our method with other strategies and the similar results are given.
The global descriptor has been widely adopted in protein sequence and we first introduce it into DNA sequence.For protein sequence various properties are used to divide the 20 amino acids into different groups.Such as: hydrophobicity, polarity, polarizibility, charge, secondary structures and Van der Waals volume.
How many properties does it need at least if we plan to uniquely represent a protein sequence or a DNA sequence.As above three properties is enough to uniquely represent a DNA sequence at least.But protein sequence will need more properties and the way choosing useful property is the key work for protein sequence.Our method has enumerated all the possible characteristic sequence and can uniquely characterize a DNA sequence.We only introduce the global descriptor into DNA sequence directly and do not alter the way to produce vector.We think it is necessary to make the effort to modify the global descriptor method and make it more suitable for DNA sequence.
and P 100 is the location of the first and last property in the characteristic sequence.There are 13

Table 1 .
Transform human L1 putative promoter sequence into three characteristic sequence.

Table 4 .
The euclidean distance of the sequence in Table3.