Research on the Repeated Sequences among tRNA Sequences

Many theories thought that present-day tRNA sequences evolved from some short RNA hairpins which contain a simple stem-loop structure. To find out these significant fragment sequences, the repeated sequences of different length within 3420 tRNA sequences are counted and analyzed. The results show that: 1) the probability of occurrence P(i) with the given repeated sequences i follows a power-law distribution when the length K of repeated sequences is longer than four bases, and in this case, the total number N(n) of occurrence with the repeated times n follows a power-law distribution too; 2) the sequence of the length K which repeats the largest times is just only sequence of the length K-b wobbling b bases on its left or right side (b varies between 1 and K-1); 3) the same repeated sequences are found nearly at the identical site in different tRNA sequences as the length K of repeated sequences is longer than five bases. Then a hypothesis of the origin and evolution mechanisms of tRNA sequences is proposed and discussed.


Introduction
As we know, the repeated sequence is widespread existed in the genome, which accounts for a large proportion especially in the eukaryotic genomes.Studies have shown that the repeated sequence is of great biological significance.It is important for chromosomes to maintain their space structure, gene expression and gene recombinant [1] [2].In recent years the studies of repeated sequences turn into a hot issue, many molecular biologists are trying to reveal the structure, function and evolution mechanisms of genes by researching on the repeated sequences.And the repeated sequences have been applied to many fields, e.g. the short tandem repeated sequences are expected to become the second-generation molecular markers.
All modern tRNA sequences evolved from some common ancestral short RNA hairpin [3]- [6], but their evolutionary mechanism remained an open question.Normally, the gene has the important function must be conservative, then to try to find out the distribution and content of these important fragments of modern tRNA sequences, 3420 tRNA sequences are put as a whole in this paper, and then the repeated sequences of different length within all tRNA sequences are counted.By the analysis of the repeated sequences of all tRNA sequences, the origin and evolution mechanisms of tRNA sequences are discussed further.

The Source of tRNA Sequences
The tRNA sequences database was created in 1998 by Sprinzl [7] at first, and it updates constantly day by day.More and more tRNA sequences were collected in this database.All the tRNA sequences used in our paper are downloaded from the database (http://trna.bioinf.uni-leipzig.de/DataOutput/).There are 3719 tRNA sequences in this database altogether which include 61 different anticodons and 429 different species which belonged to three kingdoms respectively: Archaea, Bacteria and Eucarya.Considering the variable loop of the tRNA sequence, then there are 99 bases in each tRNA sequence, and the missing bases are replaced with the line "-" in the tRNA sequence by Sprinzl et al.

The Method of Counting Repeated Sequences
Firstly, we compare all the 3719 tRNA sequences and remove the high similar or identical sequences, and then just only 3420 tRNA sequences are left and used in our paper.Selecting a fixed length of K string sequences, then various K string sequences of true appearance are counted by us among 3420 tRNA sequences.Considering that there are overlapping within K string sequences and every three bases may represent code information, so we choose three bases as a step when count the K string sequences, such as counting begins with the first base, and once again every three bases until the end of each tRNA sequence.And all the repeated sequences of different length are counted and analyzed.

The Repeated Sequences of Different Length with the Highest Occurrences among tRNA Sequences
For the convenience of analysis, just the repeated sequences of the highest occurrences are listed in Table 1.As shown in Table 1, obviously, we can observe that: 1) the highest occurrences of repeated sequences decrease with the increase of the length K.This is because the total number of K string sequences of true appearance within all the tRNA sequences decrease with the increase of the length K. 2) the repeated sequence of the highest occurrences is "TT" which occurs 7183 times when the length K = 2, the repeated sequence "GTT" occurs the most times (3282 times) when K = 3, and the repeated sequence "GTTC" occurs the most times (2080 times) when K = 4.If observe carefully, we find the repeated sequences of the length K with the highest occurrences is just the highest occurrences repeated sequences of the length (K-b) adding b bases on its left or right side (b varies between 1 and (K-1)).This seems to indicate that all the repeated sequences of different length with highest occurrences are at the same site of tRNA sequences and tRNA sequences may put one of the core fragments as a primer to amplify during their formation process.
3) The location of the highest occurrence sequences can be observed in the any arm of tRNA sequence when the length K varies between 1 and 3 bases.However the location of these highest occurrence sequences is nearly observed at the same site when the length K is between 4 and 6 bases, and it fully situates at the same site if only the length K is bigger than 6 bases (see Figure 1).What's more, the locations of various repeated sequences in the tRNA sequences are counted in this paper.And our results suggest that the same repeated sequences are found nearly at the identical site in the tRNA sequences when the length K of repeated sequences is longer than five bases.4) The repeated sequences accounts for approximately 82.22% in all the tRNAs.And the longest repeated sequence (AAGATTACCCAAGTCCGGCTG  AAGGGATCGGTCTTGAAAACCGAGAGTCGG: containing 51 bases) occurs two times, whose anticodon is "TGA" and derived from the Mycoplasma.
In Figure 1, the location of the most repeated sequences is clearly observed in the secondary structure of tRNA sequences.We find that the location of these repeated sequences mainly lies in the anticodon arm and T ψ C arm of tRNA sequence.It seems that the repeated sequences take the anticodon arm and Tψ C arm as cen- ter and expand towards both directions with the increasing of the length K of repeated sequence (see Figure 1, and the arrow represents the direction of the expansion of repeated sequence).This may indicate that the anticodon arm and Tψ C arm are more significant for tRNA in their evolution process.

The Power-Law Behavior of the Repeated Sequences
The power-law behavior is frequently observed in different fields, such as the population distributions, the social interactions [8], the World Wide Web [9] and so on.It is also known as Zipf's law [10], it was first widely recognized for word usage in text documents.Previous studies [12]- [14] have suggested that the number of distinct parts with a given genomic occurrence followed a power-law distribution.The power-law behavior is observed in our studying of repeated sequences among the 3420 tRNA sequences.
The occurrence frequency of one given repeated sequence i divided by the total number of repeated sequences of true appearance may be taken as the probability of appearance of the repeated sequence i among all the tRNA The repeated sequence with 16 bases sequences.As it is shown by Figure 2, the abscissa denotes the repeated sequence i, and the ordinate denotes the probability P(i) of i. Taking into account the paper's space, therefore we only insert two diagrams into this paper with the length of repeated sequences K = 6 and K = 10.Clearly, Figure 2 shows that the probability P(i) with the given repeated sequence i follows a power-law distribution as the length of repeated sequences K = 6,and K = 10 which means that a few repeated sequences are occurring many times and most occurring infrequently among all the tRNA sequences.What's more, our researches suggest that the probability P(i) with the given repeated sequence i always follows a power-law distribution when the length of repeated sequences K is longer than four bases.
In Figure 3, the abscissa n denotes the repeated sequences occurring n times, and the ordinate denotes the total number N(n) of repeated sequences which occur n times.As Figure 3 shows, the total number N(n) of repeated sequences which occur n times with the occurrences n follows a power-law distribution too when the length of repeated sequences K = 6 and K = 10.It displays that a few repeated sequences occurring many times and most occurring few times among all the tRNA sequences in these cases as well.Also, the total number N(n) with the occurrences n always follows a power-law distribution when the length of K is longer than four bases.

Conclusion and Discussion
The repeated sequences of different length within all the tRNA sequences are counted.Our results show that: 1) the probability P(i) with the given repeated sequences i follows a power-law distribution when the length K of repeated sequences is longer than four bases, and in this case, the total number N(n) of repeated sequences which occur n times with the occurrences n follows a power-law distribution too; 2) the highest occurrence sequence of the length K is just only the result of the most repeated sequence of the length K-b wobbling b bases on its left or right side (b varies between 1 and K-1 ); 3) the same repeated sequences are found nearly at the identical site in the different tRNA sequences when the length of repeated sequences is longer than five bases.Many views have been put on studying the evolutional relationship of tRNA sequences, such as a new tRNA gene may survive through a point mutation in the anticodon sites [15].Subsequently, the complementary duplication mechanism is also presented as the primary mechanism and point mutation are supporting mechanisms for modern tRNAs' evolution [16] [17].So many repeated fragment sequences distribute in the tRNAs, if it hides important information of its evolution?How they arise?We hypothesize that modern tRNA sequences are formed by some fragment sequences acting as primers to duplicate for amplification in their formation process.Supposing that there were only a few fragment sequences in the earliest stage, later the few fragment sequences amplified after replication, and then tRNA sequences didn't form a stable structure and stopped to amplify until up to their length, or that they could not survive as their length shorter or longer than the length of modern tRNA sequences.Considering the fragment sequences can be affected by the natural environment or suffered AT/GC pressure in the evolution process, the fragment sequences may experience random mutations (such as bases substitution, bases deletion, bases insertion and so on) during evolution, and then the new fragment sequences can be generated [18] [19].Apart from mutations, Ragan [20] thinks the lateral gene transfer can also be a source of new fragment sequences.Similarly, the new fragment sequences can be used as core primers to duplicate for amplification.And each tRNA sequence must randomly select some fragment sequences as the core primers to duplicate for amplification at first before it turned into a stable molecular structure.In this way, naturally, the higher occurrences of fragment sequences, the more chance of being choose as a core primer to replicate.These fragment sequences underwent selective evolution so a long period that they had resulted in a few repeated sequences occurring many times and most occurring infrequently among all the tRNA sequences and all the tRNA sequences with high similarity in their functions and structures.And the repeated sequences occurring many times may be closer to the earliest fragment sequences.
Our hypothesis of tRNA sequences on the one hand supports the theory that a primitive tRNA consists of seven bases presented by Crick et al. in 1976 [21] and verifies the possibility that tRNA molecule chooses a hairpin RNA as the precursor of tRNA [6]; on the other hand, our hypothesis not only sustains the view that a hairpin structure is via indirect duplication and then produces another hairpin structure which evolves though base changes, insertions and deletions into the tRNA molecule proposed [5], but also better supports the model based on a direct duplication of a hairpin structure [22].

Figure 1 .
Figure 1.The location of the most repeated sequence in the secondary structure of tRNA sequences.

Figure 2 .Figure 3 .
Figure 2. The probability P(i) of one given repeated sequence i versus the given repeated sequence i.(a) K = 6; (b) K = 10.

Table 1 .
The repeated sequences of different length with highest occurrences.In

Table 1 ,
Ac represents acceptor arm, D represents D arm, An represents anticodon arm, E represents extra arm, and T represents Tψ C arm.