In Silico Mining of EST-SSRs in Jatropha curcas L . towards Assessing Genetic Polymorphism and Marker Development for Selection of High Oil Yielding Clones

In recent years, Jatropha curcas L. has gained popularity as a potential biodiesel plant. The varying oil content, reported between accessions belonging to different agroclimatic zones, has necessitated the assessment of the existing genetic variability to generate reliable molecular markers for selection of high oil yielding variety. EST derived SSR markers are more useful than genomic markers as they represent the transcriptome, thus, directly linked to functional genes. The present report describes the in silico mining of the microsatellites (SSRs) using J. curcas ESTs from various tissues viz. embryo, root, leaf and seed available in the public domain of NCBI. A total of 13,513 ESTs were downloaded. From these ESTs, 7552 unigenes were obtained and 395 SSRs were generated from 377 SSR-ESTs. These EST-SSRs can be used as potential microsatellite markers for diversity analysis, MAS etc. Since the Jatropha genes carrying SSRs have been identified in this study, thus, EST-SSRs directly linked to genes will be useful for developing trait linked markers.

purging nut/Barbados nut.This plant belongs to the family Euphorbiaceae and is a native of Mexico and Central America and was later on introduced in many parts of tropics and subtropics.J. curcas is commonly known to be a poisonous plant.It is a semi-evergreen shrub or small tree reaching a height of 6 mt (20 ft).It can survive arid conditions; therefore, can be grown on drylands and wastelands.The seeds of this plant are highly toxic but produce oil that can be used as biodiesel after transesterification, besides that, in soap and candle making.Being traditionally considered as a weed, its oil has recently started gaining importance as "fuel of the future" or "green fuel" and has been in news, with transport companies eager to run trains, cars and aeroplanes using biodiesel to cut down both on cost and pollution.
The oil content in Jatropha curcas is reported to be varying between accessions belonging to different agroclimatic zones (40% to 58% in kernels) of India [1]- [3].In recent years, emphasis has been laid on producing high oil yielding Jatropha plant which can be achieved through genetic selection and crop improvement methods.As a means to this end, it is necessary to assess the existing genetic variability and generate reliable molecular markers for selection.
DNA markers are not typically influenced by environmental conditions, therefore, can be used to describe patterns of genetic variation among plant populations and to identify duplicated accessions within germplasm collections [4].To assess the genetic diversity, several types of popular PCR based markers like, RAPD (Random Amplified Polymorphic DNA) [5], ISSR (Inter Simple Sequence Repeat) [6] [7] and AFLP (Amplified Fragment Length Polymorphism) [8] [9] are routinely used due to the advantage of no requirement of prior sequence information [3].
The existing information regarding the extent and pattern of genetic variation in J. curcas population is limited [10].Common molecular markers like AFLP [3] and, RAPD and ISSR [10] [11] have been used to assess the genetic diversity of J. curcas.The assessment of genetic diversity using molecular markers disclosed low interaccessional variability in local J. curcas germplasm [12].Basha and Sujatha [11] used RAPD, ISSR and SSR markers to study the diversity between J. curcas accessions from different countries, which revealed low genetic variability between accessions from same country and maximum divergence between Indian accessions and a non-toxic Mexican accession.They also developed SCAR markers to differentiate Indian accessions from non-toxic Mexican accession.
There are less popular but extremely useful markers like SSRs (Simple Sequence Repeats) and SNPs (Single Nucleotide Polymorphisms) [13] which can be used for genetic diversity profiling.Of these markers, SSRs [14], also known as Microsatellites or Tandem repeats are short repeating nucleotide sequences in DNA that provide greater confidence for the assessment of genetic diversity and relationship [15].These are the markers of choice for plant genetics and breeding applications [16] [17] as the data generated by these markers can be used for selections during backcross breeding programs [15], and also because of their reproducibility, multiallelic nature, codominant inheritance, relative abundance and good genome coverage [17].Marker Assisted Selection (MAS) has proved to be the best resource for improvement of many crops [18].SSRs have been used for MAS in crops like rice [19] and common bean [20].
The traditional methods of developing SSR markers are usually time consuming and labor-intensive [21] [22].In contrast to this approach, in silico mining of SSRs from available ESTs in public databases, with an increasing data accumulating at a fast rate, is an expeditious and cost effective alternative [21].The search of SSRs in ESTs (representing genes or coding region) becomes more attractive in wake of report of abundance of SSRs in single or low-copy rather than in repetitive or non-coding sequences as assumed earlier [23].Therefore, molecular SSRs can be searched in EST databases and employed for designing locus-specific primers [24].Such markers are termed as EST-SSRs.By convention, the EST sequences containing SSRs are generally referred to as SSR-ESTs, whereas the markers developed from SSR-ESTs are called EST-SSRs [17] [25], the same has been followed throughout this paper.
Expressed Sequence Tags (ESTs) are generated by end sequencing of large number of randomly picked clones from cDNA library constructed using mRNA isolated from specific tissue or specific developmental stage of an organism.EST-derived SSR markers are generally less polymorphic than genomic SSRs [26] due to an associated lower polymorphism of coding regions in contrast to non-coding ones [27].There are also reports of moderate [28] to very high polymorphism associated with EST-SSRs [29] [30].In spite of contrasting reports about the level of polymorphism related to EST-SSRs, there are several advantages of using expressed sequences compared with genomic sequences as genetic markers.As the EST derived markers represent the functional component of the genome and are transferable across species [31], they can serve as efficient tool for gene discovery and genetic mapping of genes [32] [33].Therefore, EST-SSRs enhance the role of genetic markers by assaying variations in transcribed and known function of genes [21] [26] [34].In spite of several studies, till date no genetic map of Jatropha has been reported [22] and there is a very recent report of SNP-based linkage map by Wang et al. [35].There is also a need to develop molecular markers for MAS for high oil yielding variety and assessing the genetic diversity.
The present report describes the in silico mining of the microsatellites (SSRs) using the J. curcas ESTs from various tissues viz., embryo, root, leaf and seed available in the public domain of NCBI.At the time of mining, a total of 13513 ESTs were available and downloaded.From these ESTs, 7552 unigenes were obtained, and 395 EST-SSRs were generated from 377 SSR-ESTs.The EST-SSRs obtained through computational method in this study can be used as potential microsatellite markers for various studies like diversity analysis, MAS etc. Since, the Jatropha genes carrying SSRs have been identified in this study, thus, EST-SSRs directly linked to genes will be useful for developing trait linked markers.

Search for EST-SSRs and Primer Designing
EST sequences of J.curcas were downloaded from NCBI's dbEST database (http://ncbi.nlm.nih.gov/)[36] which contains sequences generated from different tissue specific cDNA libraries of embryo, root, leaf and seed.These sequences were arranged in a single FASTA file, which was used for the sequence analysis using different softwares and Analysis Tools.
The SSR search was carried out for repeat motifs (ranging from mono-to hexa-nucleotides).For each repeat motif the parameters were: Mononucleotide repeat-20, Dinucleotide repeat-10, Trinucleotide repeat-07, Tetranucleotide repeat-05, Pentanucleotide repeat-04, Hexanucleotide repeat-04 (the numbers indicating repeat unit i.e. minimum number of times the motif was repeated at a stretch); Space between SSRs-100, Space between imperfect SSRs [<=]-05.After obtaining the motifs, the sequence complementarity was taken into consideration and accordingly the complementary motifs like AG and CT or AC and GT or AAC and GTT motifs were grouped into a single class under mono-, di-, tri-, tetra-, penta-or hexa-nucleotides, respectively.After getting SSRs, the primers were designed from the flanking regions using the same software as for SSR search.The parameters provided in the software for primer designing are given in Table 1.
EST Sequences, which have credit in the primer designing, were searched for their gene annotations using BLASTX at The Arabidopsis Information Resource (TAIR) (http://www.arabidopsis.org/index.jsp)[39].This data was used to get the Gene Ontology (GO) Annotations and functional categorization of ESTs using locus identifiers at Bulk Data Retrieval System of TAIR (http://www.arabidopsis.org/tools/bulk/go/index.jsp) [40].

Assembling of ESTs as Unigenes
The size of the available EST data used in this study has been calculated in accordance with the size of the  [41].The ESTs of J. curcas generated from tissue specific cDNA library of various tissues (viz.embryo, root, leaf and seed) available in the NCBI's public database dbEST, were downloaded and pooled.These downloaded ESTs were inclusive of the seed specific ESTs generated in our laboratory.The pooled set consisted of 13513 ESTs (~6.2 MB) in all, which comprised of 9844 ESTs of embryo, 1000 of leaf, 1304 of root and 1375 of seed library.Using the EGassembler, all the sequences were categorized into singletons and contigs.The EGassembler segregated 13513 ESTs into 6098 singletons and 7415 redundant sequences.Then it assembled the redundant sequences into 1454 contigs.Therefore, through the software, the total ESTs were categorized into contigs and singletons, which were together grouped as 7552 (~3.8 MB) Unigenes.These data showed that the 45% of the total ESTs, downloaded from the database, were singletons and the rest 55% were assembled into contigs (Figure 1).The assembling of the redundant ESTs into contigs was beneficial in reducing the errors in sequence analysis in addition to removing the redundancy so that only the unigenes were used for SSR mining and for annotation.As reported by Raji and coworkers [18], these unigenes, when used for the mining of SSRs result in a realistic estimate of the microsatellite repeat frequency and ensures that non redundant EST-SSR markers that correspond to unique loci in the genome are obtained.Therefore, in this study the unigenes were used for SSR search.The mining of the EST-SSRs starting with downloading of all the Jatropha ESTs is outlined in Figure 1.

Occurrence and Frequency of Microsatellites
For searching the SSRs, the repeat motifs in the software, were selected from mono-to hexa-nucleotide as going above this motif range, the frequency of occurrence of SSRs is drastically reduced.Thus, the SSRs were obtained in the form of repeat motifs ranging from mono-to hexa-nucleotides.Out of the 7552 unigenes searched for SSRs, 395 SSRs (Table 2) were generated from 377 unigenes.These 395 SSRs can be termed as EST-SSRs and 377 unigenes possessing SSRs can be termed as SSR-ESTs according to the convention.The 377 SSR-ESTs amounted to approximately 5% (inclusive of the mononucleotide repeat motif) of total unigenes and 2.78% of total downloaded EST data set.The various studies show a representation ranging from 2.65% -16.82% [25] to 26.84% [42] in dicot species and 7% -10% [43] in cereals or monocots.The workers [17] [21] [25] who have carried out similar studies are of the view that the variation in the percentage may be due to variation in sample size, search criteria, size of database, and the tools used for EST-SSR development.The percentage of SSR-ESTs in the present study could be owing to more stringent preset parameters for EST mining compared to other similar studies [21] [42] that reported a higher percentage of SSR-ESTs.

Distribution of Microsatellite Classes and Motifs
The overall analysis of the distribution of the microsatellites into various classes of the repeat types (mono-, di-, tri-, tetra-penta-and hexa-nucleotides) showed that the number of the microsatellites decreased with increasing motif size (Figure 2, Table 3).It was observed that mononucleotide repeats were the most abundant (representing 54% of the total microsatellites), followed by dinucleotide (27%) and trinucleotide (11%).The least frequent were tetra-, penta-and hexa-nucleotides (2% -3%).The abundance of mononucleotides is in accordance with several previous reports [23] [25] [44] and also that these contributed to nearly half of all the SSRs, is similar to those in certain species of dicots analysed previously [25].The dinucleotides were the second most abundant class as reported across most of the dicots investigated by Kumpatla and Mukhopadhyay [25], suggesting an over-representation of UTRs (un-translated regions) compared with ORFs (Open Reading Frames).
The non-dominance of trinucleotides compared to other classes, by virtue of which the decreasing trend of various classes with increasing motif size, is in contrast to several earlier studies but in concurrence to that reported for several dicots [25].These observations about the abundance and therefore, the dominance of one SSR motif category over other categories, holds significance in the chances of fixation of mutations against selection pressure [45].The trinucleotides have more chances of getting fixed against mutation pressure due to selection against frameshift events [45] The prevalence of di-over tri-nucleotide in this study could be attributed to 1. increased stringency of preset parameters in this study compared to previous studies [21] [22], so as not to compromise on polymorphism level and thus their utility as markers.The results were also computed with relaxed preset parameter of repeat length which gave a higher percentage of total SSRs especially trinucleotides (data not shown).But, the results reported here are those obtained with more stringent parameter of minimum repeat length 2. a bias in representation of 5' and 3'UTRs in the EST dataset used for mining.A lowered representation of tetranucleotides, as also observed in this study, is also suggestive of under representation of 3'UTRs [25].
The various classes of repeat motifs, when analyzed further, showed that some motifs in each category were more abundant than others (Table 4), e.g.among the dinucleotide repeats, the AG/CT motif was the most common (33%) followed by the motifs GA/TC (31%) and, the least common was AC/GT (0.94%).The abundance of AG/CT/GA/TC motifs are in concurrence with previous studies [25] [43] [44] where ESTs were used for mining SSRs, in contrast to abundance of AT motif when genomic data was used for mining SSRs [44] [46].Thus, abundance of the motifs is attributed to systematic bias resulting from the use of ESTs (coding sequences) instead of genomic sequences (non-coding) as a source for SSR mining [43].The CG motif was found to be totally absent, which is in concurrence to earlier studies, where it has been observed to be either the least [43] or absent [44].Among the trinucleotide repeats, the most common motif is AGA/TCT subclass amounting to    26.6% and rest of them ranging from 2% -13% of the total microsatellites in this class.The CCG/CGG motif is reported to be the rarest motif in dicots [23] [25] and was observed to be absent in this study.In the tetranucleotide repeats, most of the motifs were AT rich.The most common motif was AAGA/TCTT (22%) and the rest of them were each ~11% of the total microsatellites in this class.In the pentanucleotide class of motifs the most common one was AAGAA/TTCTT and TATTT/AAATA (18% each) and others were each 9%.The hexanucleotide class TTTCTC/GAGAAA (30%) formed the most abundant subclass and the rest of them were 10% each.In general, the motifs were observed to be AT rich and less of GC rich motifs, similar to that observed for dicots [25].The analysis of repeat units under each motif class revealed a varying range of repeat units in each of the classes of repeat motifs.It was observed that, in dinucleotide motif, repeat units ranged from 10 -45; in trinucleotide motif, from 7 -13; in tetranucleotide, from 5 -6 units; in pentanucleotide, from 4 -6; and hexanucleotide motif was represented by a single class of 6 repeat units only.Further analysis of the number of repeat units in every class of the SSRs, especially tri-, tetra-penta-and hexa-nucleotides, showed that the number of the microsatellites decreased with increasing repeat unit length with little variation, e.g. for trinucleotide motif, SSRs with 7 repeats were represented by 42.2% while 2.2% by 13 repeat units.Amongst the pentanucleotide SSRs, the category with 4 repeat units shared as much as 45.5% of the total class in comparison to 9% for repeat unit of seven (Figure 3).Therefore, it can be said that as the class of the SSR motif size increases, like tetra-, pentaand hexa-nucleotide, higher number, rather 100% of microsatellites were found in the category of <10 repeat units (Table 2) which is similar to that observed by Varshney and co-workers [43].These results clearly indicate the effect of increased stringency of parameters which were maintained during this study to retain the polymorphism level and utility of the SSRs as markers because the probability of polymorphism increases with increasing length of SSRs [47]- [49] and, a higher number of repeat followed by shorter stretches would be beneficial for marker development [48].The polymorphism reported in Jatropha in earlier studies was very low, therefore, the parameters for mining the SSRs were kept more stringent in this study, which lead to lower frequency of SSRs but with a longer repeat length; as in the case of trinucleotide repeats, keeping the minimum repeat length of 7 resulted in it not being the most abundant class, as reported in other similar studies.

Designing of Primers towards Marker Development
For the use of SSRs as markers, it is necessary to design the primers.The SSRs commonly used for marker development are those belonging to di-, tri-and tetra-nucleotides [25].The mononucleotides are useful for population genetic analyses of chloroplast genomes [50] and can also be useful in filling gaps in linkage maps created by di-, tri-, and tetra-nucleotide repeats [25] but, at the same time they cause difficulties in accurate sizing of polymorphisms [18].Therefore, to design the primers for potential SSR markers, the mononucleotide repeats were not included.Thus, out of 395 EST-SSRs generated from 377 SSR-ESTs, the primers were designed for only 181 SSRs.
For each of the SSRs, a pair of reverse and forward primer was designed from the flanking regions of their respective SSR-ESTs by the software.181 SSRs generated from 172 SSR-ESTs were used for primer designing and yielded 79 SSR mediated primer pairs (data not shown).These 79 primer pairs were designed from 76 SSR-ESTs as some of these contained more than one SSR e.g.JES 56 and 57 (Supplementary Table A).Thus, 76 SSR-ESTs having credit in primer designing have been termed as ESTs-PD and were further annotated.The primers could not be designed for some of the EST-SSRs from their respective SSR-ESTs.As reported by Varshney and coworkers [42], this could be due to any or all of the following reasons, (a) SSR-ESTs are too short, (b) EST-SSRs are too close to the cloning site of the SSR-ESTs, or (c) the flanking sequences are not unique, as was also observed for some of the SSRs in this study.

Functional Annotation of the ESTs-PD
The GC level of the genome of J. curcas is typical of core dicots, therefore, it should be easy to annotate by sequence comparison with Arabidopsis [41], hence, ESTs-PD were searched for their gene annotations using BLASTX at TAIR.The Gene Ontology (GO) Annotations and functional categorization of ESTs-PD obtained using locus identifiers are given in Supplementary Table A.
The data showed that most of the ESTs-PD are expressing functional proteins and still there are some for which the protein is not yet predicted.On the basis of the functions related to the predicted protein, the ESTs-PD were classified into three major classes viz.Cellular Component, Biological Process and Molecular Function (Figure 4).In the limits of the available data in the public database for the ESTs of J. curcas, it was found that one of the ESTs-PD (Contig1345) containing SSR (JES35) expresses gene of oil biosynthesis pathway (AT1G48750).

Conclusion
The in silico mining of EST-SSRs of Jatropha was carried out in this study taking advantage of the availability of enormous EST data in the public database, the importance of ESTs in SSR mining and, the potential of modern bioinformatics tools combined with their speed and ease.The stringency of the preset parameters was kept high so as not to compromise on the level of polymorphism in potential EST-SSRs, thus, their utility as markers, more so in this study, as low levels of polymorphisms have been reported in Jatropha.The functional annotation of the SSR-ESTs showed that most of them are associated with expressed proteins and therefore, trait linked genes.Thus, in this study, the genes of Jatropha carrying SSRs were identified.The EST-SSRs generated would be useful for developing trait linked markers.As the expressed sequences are highly conserved, the SSRs developed from the ESTs are characterized by transferability across species.Owing to this characteristic, these SSRs could also be useful as markers across closely related species like Ricinus, thus, saving time and resources in reiteration of SSR mining or; for related species with limited or no sequence information.EST-SSRs like JES35 generated from EST expressing gene of fatty acid biosynthesis pathway (AT1G48750) would be of utmost importance towards marker development in Jatropha.With more data being submitted at a rapid pace to the public database, more such SSRs can be looked for in comparative genomic studies and, the knowledge generated in this study is a step towards development of markers in this plant and also related species.

Figure 1 .
Figure 1.Overview of the study indicating the major steps and the statistics leading to generation of the EST-SSRs.

Figure 2 .
Figure 2. Distribution of SSRs into various classes.

Figure 3 .
Figure 3. Distribution of SSRs as per repeat unit size in different types.

Figure 4 .
Figure 4. Functional categorization of ESTs-PD by loci A: Cellular component, B: Biological process, C: Molecular function.
regulation of amino acid export proc c

Table 1 .
Parameters for primer designing.
genome of J. curcas (C = 416 Mb) reported by Carvalho and coworkers

Table 2 .
Categorization of SSRs by repeat units and repeat motif.

Table 3 .
Abundance of SSRs of various types.

Table 4 .
Most abundant motifs and their relative abundance in each of the SSR types.