Tandem repetitions in transcriptomes of some Solanaceae species

Characterization of occurrence, density and motif sequence of tandem repeats in the transcribed regions is helpful in understanding the functional significance of these repeats in the modern genomes. We analyzed tandem repeats present in expressed sequences of thirteen species belonging to genera Capsicum, Nicotiana, Petunia and Solanum of family Solanaceae and the genus Coffea of Rubiaceae to investigate the propagation and evolutionary sustenance of these repeats. Tandem repeat containing sequences constituted 1.58% to 7.46% of sequences analyzed. Tandem repetitions of size 2, 15, 18 and 21 bp motifs were more frequent. Repeats with unit sizes 21 and 22 bp were also abundant in genomic sequences of potato and tomato. While mutations occurring in these repeats may alter the repeat number, genomes adjust to these changes by keeping the translated products unaffected. Surprisingly, in majority of the species under study, tandem repeat motif length did not exceed 228 bp. Conserved tandem repeat motifs of sizes 180, 192 and 204 bp were also abundant in the genomic sequences. Our observations lead us to propose that these tandem repeats are actually remnants of ancestral megasatellite repeats, which have split into multiple repeats due to frequent insertions over the course of evolution.


INTRODUCTION
The extent of repetitiveness in nucleotide base sequences varies remarkably across genomes and generally exceeds the statistically derived expected values [1].Taking into account some direct and indirect influence on the survival of the organism [2,3], it is not unusual to expect repetitive DNA constituting major portion of the present day genomes [4,5].Tandem repeats are ubiquitous in a broader sense as they occur at telomeres, centromeres, genic regions, intergenic regions and even at interspersed sites [6].A deeper analysis of the eukaryotic genomes suggests a non-random distribution of tandem repeats [5,7].Comparative genomics focusing on the tandem repeats lying within or close to genes helps in understanding the functional significance of these repeats in modern genomes.Comprehensive experiments involving tandem repeats may be instrumental in generating valuable information about various other biological features related to C-value paradox, organization and evolution of genomes, transcription, etc. [5].
Genome analysis of a number of plant species representing the important family Solanaceae has revealed striking similarities in terms of gene content and organization [8,9].The wealth of sequence information pertaining to the members of Solanaceae has expanded rapidly in recent times.Currently, genome projects are underway for many members of the Solanaceae including Capsicum annuum (Pepper), Nicotiana benthamiana (Benthamiana tobacco), Nicotiana tabacum (Tobacco), Solanum bulbocastanum, Solanum demissum (Hexaploid Mexican wild potato), Solanum lycopersicoides (Wild nightshade), Solanum lycopersicum (Tomato), Solanum melongena (Brinjal), Solanum peruvium (Wild tomato) and Solanum tuberosum (Potato) (see database "genome projects" at http://www.ncbi.nlm.nih.gov/genomeprj/?term=Solanace ae).Such sequence resources provide an opportunity to get insights into the evolutionary history of closely related species.That is, if the sequences are identical between two species, chances are that the two species might have diverged from each other fairly recently.Points of disagreement in the sequence homology indicate a longer evolutionary distance between the given species, also reflected in their taxonomic positions.These lines are explored in this paper by comparative analysis of the organization and distribution of tandem repeats in unigenes and EST sequences of thirteen members of family Solanaceae and two members of a closely related family, Rubiaceae.We believe that such studies will be helpful in addressing some of the most interesting questions in the field of genomics and transcriptomics concerning the patterns and significance of tandem repetition of sequences, and the factors that maintain and propagate these tandem repeats over the generations.

Sequence Resources and Initial Processing
The unigene sequences of potato, tomato and tobacco were downloaded from unigene database of NCBI.Similarly, EST sequence data for twelve species (Table 1) were downloaded from dbEST of NCBI (http://www.ncbi.nlm.nih.gov/nucest/).All the sequence data were downloaded in fasta format.ESTs were clustered using the CAP3 program [10].Subsets of this data were further randomly clustered based on sequence homology using the standalone version of BLASTn at various stages during the study.The purpose of including the latter step was to construct cross-species clusters of EST-SSRs.NCBI descriptions thus obtained were retained for the best hit as long as E-value was less than 1e-10 and alignment score was >200.
In addition, 5 Mb and 90 Mb of potato and tomato genomic sequences, respectively, available in the public domain were also analyzed for the presence of tandem repeats.

Identification of Tandem Repeats and Cross-Species Comparisons
The identification of tandem repeats was performed by using the search tool Tandem Repeats Finder [11] according to the parameter value scores of 2, 7, 7, 80, 10, 50 and 500 for match, mismatch, indels, matching probability, indel probability, minimum alignment score and maximum period size, respectively.As TRF detects more than one repeat on the basis of alignment score at the same site, we rectified this anomaly by only recognizing the repeat with smallest motif.Wherever there was a tie on the basis of motif size, longer sequence was considered.If the tie was observed in terms of length span also, then lower entropy was given a preference.As entropy stands for randomness in thermodynamics, higher entropy would mean randomness (or less orderliness in the sequence of nucleotides) in terms of sequence analysis.Lower entropy automatically means ordered occurrence of nucleotides, thereby leading to the formation of repeats.Repeats with motif size of 2 -6 bp were identified as microsatellites and rest of the sequences were termed as minisatellites.Considering the fact that a number of stretches of (A/T)n would actually be non genomic poly-A tails, mononucleotide repeats were excluded from the present analysis, if they occurred in the end of the sequences.The microsatellite repeats were grouped into different classes according to Jurka and Pethiyagoda [12].
To predict the cross-species transferability of these repeats, all the sequences were also scanned by VNTRfinder [13].This exercise limited the output only to the PCR amplifiable transferable repeats showing length polymorphism, when compared with another species.The conservation of repeats across the species was also studied using BLASTn according to the parameters described above.
Synteny mapping between potato and tomato contigs was carried out using glocal algorithm [14] in Vista Genome Browser (http://pipeline.lbl.gov/cgi-bin/gateway2)[15] in an all versus all patterns.Output of genome vista browser was retrieved through e-mail.

Abundance of Tandem Repeats in Solanaceae Transcriptomes
Occurrence of tandem repeats in the transcriptomes analyzed showed variation on different accounts depending upon the species concerned.As evident from Table 1, tandem repeat containing sequences ranged from a minimum of 1.58% in Nicotiana sylvestris to a maximum of 7.46% sequences in Petunia axillaris.In terms of transcriptome coverage, most of the species showed 0.5 -0.6% of sequences harbouring tandem repeats (Table 1).
The average GC content of tandem repeated sequences remained ~ 41%.Among the tandem repeats with longer motifs, mononucleotide A was the most common followed by T. Minisatellite repeats essentially occurred either in the exonic regions or overlapped with the exonic regions.Tandem repeats with smaller unit size, in general, were more abundant than the repeats with longer repeat unit.Interestingly, a marked dominance of tandem repeats with repeat unit size (bp) in the multiple of three was noticed (Figure 1).In fact, 64% of all the repeats identified in this study showed this characteristic.Among all the repeats mined, repeats with motif size of 2, 15, 18 and 21 bp were more abundant.Tandem repeats with repeat unit size of 2 bp were extraordinarily abundant in Capsicum constituting 14% of all the tandem repeats.Interestingly, 27% of all the dinucleotide repeats reported in Solanaceae transcriptomes under study originated from Capsicum sequences.A similar dominance of dinucleotide repeats was also prevalent in Coffea canephora.Repeats of unit sizes 21 and 22 bp also represented the most abundant tandem repeats in genomic sequences of potato and tomato, and also in rice and humans (our unpublished data).When the repeat richness of unigene sequences was compared with genomic sequences in potato and tomato, no definite trend could be observed, except that a higher frequency of tandem repeats was observed in genomic sequences.The repeats with unit size ranging from 15 to 22 bp were markedly more abundant in genomic sequences as seen in Figure 2. Evidently, tandem repeats with motif sizes between 7 and 30 bp account for the maximum number of loci and longer arrays both in the genomic as well as transcribed sequences of Solanaceae.

Cross-Species Comparisons
While the cross-species conservation within a genus was more visible (Table 2), the probability of finding an orthologue in a different genus was quite low.For many tandem repeats, the encoded repetitive peptide sequence was found longer than that expected using ORFpredictor   (data not shown).Interestingly, more number of orthologous pairs of tandem repeats were observed using BLASTn than predicted by ePCR module of VNTRfinder.For example, in the tomato-potato pair, more than 50% of the microsatellite containing sequences had an orthologous match in the other species database, however, not all of those contained a microsatellite.A similar observation was drawn for the N. tabacum and N. benthamiana pair.As the VNTRfinder predicts the cross-species PCR amplification based on a number of parameters and not merely the sequence similarity, it is quite possible that most of the orthologues fail to cross-amplify under optimal PCR conditions.Although the exact composition of a tandem repeat could not be traced in the orthologous sequences in some instances, but considerable sequence similarity and the reading frame may still be preserved.With the available data, it was not possible to conclude which of the alleles among the orthologues was the ancestral one.Identifcation and study of a common ancestor (or its direct descendent) could be partially useful.Synteny mapping between tomato and potato genomes for tandem repeats revealed different trends, for example, few of the tandem repeats were conserved between the two genomes, while others were found showing variations in the otherwise conserved genomic regions.The mapped synteny was not absolute and indels as well as micro-inversions have frequently occurred since the divergence of potato and tomato (Figure 3).The overall repetitive sequence content in potato and tomato was comparable in terms of the genomic coverage (Figure 2).Most of these tandem repeats could not be characterized, as except for a single instance of accumulation of telomeric/centromeric heptanucleotide repeats, no other telomeric or centromeric repeats could be identified.

Tandem Repeat Richness and Motif Length
While searching for tandem repeats in this study, we had set an upper length limit of 500 bp for motif size.Surprisingly, in majority of the species, tandem repeat motif length did not exceed 228 bp.Further, among all the repeats with unit length longer than 100 bp, repeats with unit lengths in the multiple of 114 (114×) and particularly 228 bp were most abundant.As shown in Figure 4,  except for Coffea arabica and Petunia × hybrida, the longest repeat belonged to 114× category.Moreover, repeats with unit length 228 bp, not only showed a marked abundance among the repeats with longer motifs, but also spanned much longer in length (Figure 5).Interestingly, repeats belonging to the family 114× could not be traced into genomic sequences of potato and tomato, indicating that they were split over two or more exons.Repeats with motif sizes 180, 192 and 204 bp were more abundant in genomic sequences.Similar abundance of 180 bp and 192 bp motif size long tandem repeats was also seen in rice (our unpublished results), and by using BLAST, such repeats were annotated as transposable element proteins.Another interesting feature of genomic contigs of potato and tomato revealed a marked accumulation of tandem repeats with same sized motif lengths causing a significant deviation in the values of mean and mode of repeat lengths within these contigs (Table 3).Following sequence comparison of the repeat units of these tandem repeats, a high level of similarity (>90% identities in the aligned sequences) was observed.

DISCUSSION
Tandem repeats represent a considerable proportion, and yet remain a poorly understood component of the eukaryotic genomes.Opinions differ on their structural and functional significance in the genomes [3].Various roles have been proposed for tandem repeats highlighting their effect on chromatin organization, crossing over, regulation of gene activity, etc. [16].Some data is available on the distribution of microsatellites in various genomes [6,17], but virtually no information is available till date on genomic distribution of minisatellites and satellites.Our experience of working with microsatellites, i.e., tandem repeats with shorter repeat motifs [7,18] suggests that the structure of tandem repeats may be regulated by their neighbouring components of the genome, as also reported for their mutability [19].However, coding and non coding regions of a genome are regulated by differrent constraints and thus the fine genomic environment at these sites differs from one another.On the same lines, repeated sequence motifs are tolerated in transcriptomes obviously in accordance with the requirements of the ulti-mate products in the system.Study of tandem repeats present in the transcribed sequences thus makes an interesting area of contemporary research.In the present study, following dynamics and conservation of tandem repeats in genic regions of some members of Solanaceae, we obtained certain interesting insights about their existence in transcriptomic sequences, previously not reported on this scale and also on periodicities in the anticipated protein sequences.The frequency with which tandem repeats occur in ESTs offers a new area for exploration due to the associated translation into protein sequences and thereby providing different abilities to the proteome of an organism.
In the present study, we found that the repeat containing transcriptomic sequences are slightly lesser than what have been reported earlier, and also slightly lesser than the genomic coverage values for angiosperms [5][6][7].Poor GC content of tandem repeats might be reflected in the functional utility of these tandem repeats.For example, repeat motifs AG and AAG generally occur in the 5'-UTR regions of the genes and have been suggested to form non-B-DNA, potentially playing important roles in the regulation of gene activity [20].(CTT)n repeats, complementary to (AAG)n, are also potential sites of cytosine methylation, and therefore, provide candidate sites for inhibiting transcription elongation in plants [21,22].Hypervariability of these regions in exonic regions might lead to novel amino acid sequences that may in some cases lead to several disorders, as known widely in humans [23].Nevertheless, till date no specific function could be assigned to amino acid sequence expansions and it would not be unwise to think that this process might be a contributor to the evolution of newer genes.De Grassi and Ciccarelli [24] on the basis of their studies on "internal tandem repeats" in genes lying in duplicated regions of human genome observed that modifications in tandem repeats always occurred in terminal exon of the genes.The event is favourable, as this would not affect the original composition of proteins [24], and will make the gene available for alternative splicing.In fact, the effect of polymorphisms at tandem repeat sites on gene expression is slowly getting established [23], even if the tandem repeat polymorphism is generally confined to introns [25].When the tandem repeats occur at intron-exon boundry, novel introns may be formed due to modifications of their length or sequence, leading to formation of alternative transcripts [24,26].

OPEN ACCESS
The marked dominance of tandem repeats with unit lengths in multiples of three may be considered as an extension of the observation that trinucleotide repeats are predominantly present in genic sequences, particularly in exons [7,27].However, such an observation also contrasts the trend seen in the genomic sequences where the abundance linearly falls with increasing length of the repeat unit size [5].Predominance of repeats with unit size 2, 15, 18 and 21 bp was interesting.While occurrence of dinucleotide repeats in 5'-UTR could be explained by their expected participation in the transcription machinery as transcription factor binding sites [20], more intriguing was the abundance of tandem repeats with repeat unit size of 21 bp as the second most abundant class in transcriptomes under study (Figure 1).However, given the universality of such abundance, we believe that they have some important function, for which they are retained in the genomes.In general, tandem repeats with unit lengths in multiples of three are more abundant in genomes and certain genomic forces have facilitated their longer iterations.Nevertheless, abundance of tandem repeats with unit sizes 7 -30 bp over the longer ones is in accordance with that reported earlier by Navajas-Perez and Patterson [5] for other plant genomes.Brandstorm et al. [28] suggested that these sequences serve as hot spots of recombination.Sharma and Raina [29] also demonstrated that tandem repeats of various types represent species-specific and chromosomespecific heterochromatin patterns.
Conservation of tandem repeats and their evolution in plant genomes is likely to be dictated by the features such as the length and sequence of the basic repeat unit [30].However, Richard and Dujon [31] also reported the transferability of minisatellites across genera.Thus, despite prevalent insertions, deletions and substitution events, tandem repeats in genes are still under positive natural selection.Evidences in support of such proposition are made available from studies in humans [32,33].Jordan et al. [34] also endorsed similar observations and conclusions on the basis of their cross-species comparisons in Neisseria spp., and suggested the significance of this phenomenon in providing adaptability to the host.This view later also got support from Verstrepen et al. [35] and Levdansky et al. [36] following their studies in yeast and Aspergillus fumigatus, respectively.
A combination of polymerase slippage and point mutations [37] can either elongate or shorten a tandem repeat.A longer allele, if considered ancestral, can get shortened in two ways-either a mutation event occurs at one of the ends of the locus thereby reducing the repeat number of the locus or a mutation occurring in the middle of a locus breaking the locus into two smaller loci.If a shorter allele is considered ancestral, it can get elongated either by the joining of two nearby loci or by increasing its length by one repeat at one time [38].Tandem repeats have probably undergone a complicated set of mutational events altering their length and have maintained high mutation rates even in expressed regions [39].Trifonov [40] suggested that microsatellites in genes have an adaptive advantage against stress conditions.Longer repeat sequences modulate the expression of genes under stress.The ESTs harboring microsatellites, and those where a cross-generic orthologue is conserved, might have a range of functions such as coding for signaling proteins, kinases or transcription factors or a MADS box gene.Fujimori et al. [41] found 46.5% of translation-related housekeeping genes in plants having a microsatellite region in their predicted 5'-UTR.Microsatellite repeats in untranslated regions probably regulate gene expression by making certain DNA-protein interactions [42,43].While the mutations occurring in these repeats may reduce the repeat number, genomes adjust to these changes by keeping the translated products unaffected.Since their occurrence is prevalent in conserved housekeeping genes, it is suggested that these repeats might have been inherited from a common ancestor and due to vitality of their functions; these repeats or their remnants can distinctly be identified.We do not over rule the possibility of harbouring mutations in these genes by organisms in response to ecological or environmental stress, as each of these species has faced different environmental and domestication requirements.These issues probably require further investigations in vertical lineage instead of horizontal comparisons among different species.
Occurrence and abundance of repeats with longer units also raised curiosity.According to De Grassi and Cicarrelli [24] tandem repeats with 30 bp repeat units prevailing at least four times more frequently causing modifications in human genes in duplicated regions of the genome.Tandem repeats with longer units according to De Grassi and Ciccarelli [24] are more variable than repeats Figure 6.A generalized mechanism leading to "accumulated" tandem repeats with identical repeat unit from an ancestral "mega-satellite" of the past.This leads to the accumulation of repeats with similar size in genomic contigs.
with higher repeat number.If translated, these repeats would induce periodicities in protein structures.This might well be a possible situation exploited by the cellular machinery in preferring single subunit proteins that play the roles of multi-subunit proteins.The energetics and kinetics of TR-containing proteins provide new insights into folding rates and protein stability [44].The understandable benefit of a single subunit protein is its ensured availability independent of stoichiometry.In fact, presence of tandem repeats in protein sequences is well recorded [39] with most of them displaying a smaller repeat unit of 5 -20 amino acids.Repeated domains in proteins are known to be associated with a variety of functions [39].Kashi and King [3] also suggested that repeated sequences may result in open reading frames (ORFs) of substantial length, integrated into an actively transcribed region.Richard and Dujon [31] reported minisatellites containing genes to be associated with genes encoding cell wall proteins.
Since a high level of similarity was observed in the sequence of clustered with long units (those which were discovered from the same contigs), we propose that these tandem repeats are actually remains of an ancestral megasatellite repeat, which has split into multiple repeats due to frequent insertions during the course of evolution.Each of the broken unit too has accumulated a number of indels and substitutions over a period of time downgrading them to "nearly identical" to each other from "identical" units of the past.A generalized mechanism creating such accumulation of repeats in genomic regions is depicted in Figure 6.
There have been certain suggestions that tandem re-peats might have served as a mode for evolution of novel genes [24,45], simply by altering the number of times a sequence motif is repeated.In the process, tandem repeats might have contributed to the fitness of the organism in the prevailing environment.Marcotte et al. [46] suggested that repeat expansion shaped many protein domain families like leucine rich repeats family, and this is an important mode of evolution of eukaryotic genomes [47].Vergnaud and Denoeud [48] used the method similar to ours, but different definition to analyze minisatellites in human chromosome 22, Arabidopsis thaliana chromosome 4, and Caenorhabditis elegans chromosome 1 by the use of the TRF software and reported the preferential occurrence of these repeats near telomeric and centromeric regions of the genomes.Richard et al. [43], however, maintained that there is no such bias, when complete genomic sequences are analyzed.Nevertheless, at this stage any of the conclusions would be pre-mature as minisatellites are less studied genomic constituents than microsatellites [49,50].

Figure 1 .
Figure 1.Abundance of tandem repeats with different repeat units in Solanaceae.

Figure 2 .
Figure 2. Occurrence of tandem repeats in unigene and genomic sequences of tomato and potato.A different pattern of distribution of genomic versus transcriptomic tandem repeats with motif sizes 7 bp -30 bp is clearly visible.

Figure 3 .
Figure 3.A small region displaying degree of synteny between tomato and potato genomes and various repetitive sequences present in this region.

Figure 4 .
Figure 4. Repeat unit sizes of longest repeat in different species.Noticeably, 228 bp is a preferred length among the longer repeats in majority of the species.

Figure 5 .
Figure5.Average repeat number for all the tandem repeats with repeat unit length higher than 100 bp in different species.

Table 1 .
Summary of dataset analyzed for the occurrence of tandem repeats in transcriptomic sequences of Solanaceae and the extent of repetitiveness present.

Table 2 .
Cross-species PCR transferability in Solanaceae, as predicted by VNTRfinder.

Table 3 .
Tomato genomic contigs showing accumulation of tandem repeats with similar repeat units.