Analysis of Simple Sequence Repeats Information from Floral Expressed Sequence Tags Resources of Papaya (Carica papaya L.) ()
1. Introduction
Papaya (Carica papaya L.) is an edible fruit crop of the family Caricaceae, originally native to Central and South America and distributed in tropical and subtropical regions worldwide. It is diploid species (2n = 18) and dicotyledonous plant. It has small genome size of 372 Mbp [1] [2] [3] . It is short lived, semi- woody, herb-like and perennial tropical plant. The fruit production starts after nine to ten months from germination period [4] [6] .
According to the percentage of US recommended daily allowances, papaya fruit ranked first among 35 most commonly used fruits. It is highly nutritious and contains antioxidant vitamins (A, C and E), thiamine, folate, riboflavin, niacin, potassium, iron, calcium and fibre. It contains no starch and low in calories [5] . A proteolytic enzyme, papain (EC: 3.4.22.2) is extracted from the latex of unripe fruit which is commonly used in food processing such as in tenderization of meat, to clarify beer and juice and in industry for making soap, shampoo, lotions, skin care products and toothpastes [6] . It can also be used in several medical applications such as for digestion improvement and in treatment of fever, ulcers, muscular dystrophy and osteoporosis [7] .
It is trioecious species with three types of sex: female, hermaphrodite and male. The hermaphrodite plants are widely grown as every plant of hermaphrodite produces fruits. Female plants are commercially important for papain production, while male plants have no use except pollination [8] . Female plants needed 6% - 10% male plants in the field for the purpose of fruit production [9] . Since the use of seeds produces seedlings of unknown sex, farmers have to plant seedlings in large amount and thin out the female or hermaphrodite plants after 3 to 4 months when it is possible to identify the sex of the seedlings from their floral buds [10] . If the sex of papaya is identified before their transplantation to the field at seedling stage, then a desired ratio of male and female plants (5% males: 95% females) would be achieved for cultivation and resources like planting space, fertilizers and water could be devoted to female and hermaphrodite plants. Papaya is considered as fruit model crop for genetic, genomics and molecular studies owing to their several features such as short generation time, small genome size, primitive sex chromosomes and efficient breeding system [11] .
Microsatellites or simple sequence repeats (SSRs) are consisting of one to six (bp) tandem repeats (mono-, di-, tri-, tetra- and penta-, hexa-nucleotides), and are found in all genomes including prokaryotes and eukaryotes [12] [13] . They are also termed as simple sequence length polymorphisms [14] , microsatellite [15] , short tandem repeats [16] . They are located in both coding and non-coding regions of the genome [17] . SSRs are most important over other PCR-based molecular markers like random amplified polymorphic DNA (RAPD), inter simple sequence repeats (ISSR) and amplified fragment length polymorphism (AFLP) due to their sequence-specificity, multi-allelic nature, co-dominant inheritance, high distribution in the genome, easy detection by PCR, high rate of transferability, hyper-variability and high reproducibility [13] [18] [19] [20] . The polymorphic nature of SSR was observed by Litt and Luty (1989) which is generated due to variation in repeats number. The origin and evolution of microsatellites occur due to slippage of DNA strand which creates mispairing [21] and repetitive errors generated during replication of DNA [22] , or unequal recombination between sister chromatids during meiosis [23] . The principle of polymorphism detection involves the designing of primers from flanking sequences near the portion of microsatellite repeat motif. Amplification of genomic DNA with specific primers flanking the SSR motifs is performed using PCR and running agarose or denaturing polyacrylamide gel for visualization of variations in alleles. There are two types of SSRs on the basis of their location: 1) SSRs that are distributed throughout the genome are called genomic-SSRs, 2) SSRs that are found in genic or expressed portion of the genome is called as genic-SSRs or Expressed Sequence Tags-SSRs (EST-SSRs). Genic-SSRs act as functional molecular markers because “putative function” can be determined by publically available databases via computational approaches.
There are two traditional methods for the development of genomic SSR mar- kers, 1) SSR-enriched genomic library and 2) nonenriched genomic library construction. Both the methods involve construction of genomic DNA library, following the hybridization with tandemly repeated oligonucleotides probes, cloning and sequencing of candidate clones [24] , which makes these methods of development very tedious, time consuming, expensive and labor-intensive [25] . On the other hand, with the advancement of modern genomics, genic or EST-SSRs are comparatively easier to develop as large numbers of ESTs of various organisms are available in various data banks. Availability of these large amounts of freely accessible data makes possible to develop EST-based SSR markers through database mining. The development of EST-SSRs or genic-SSRs through in silico approach is a fast, efficient, requires less cost, time and labor as compared to the development of genomic-SSRs [26] [27] .
ESTs are the short (200 - 800 bases), and single pass random sequence reads of cDNAs derived from cDNA libraries. EST-SSRs are more advantageous than the genomic SSRs due to less time consuming, easily available, cheapest to develop, detect variations in expressed portion of the genome and sequence-specificity. Moreover, EST-SSRs show high rate of transferability, which means EST-SSR markers isolated from one species, can be transferred to other related species/ genera or within the same family due to conserved genic regions [27] [28] . Therefore, EST-SSRs have been utilized in several plants for various applications such as to study genetic diversity [29] , cross-transferability [30] , comparative analysis [31] and in linkage map construction [32] . In papaya, several microsatellite markers have been developed for the study of genetic diversity [33] [34] and marker-assisted selection (MAS) [35] , but most of these SSRs are genomic in nature.
Complete papaya genome has been sequenced by Ming et al. [5] , which generated enormous amount of ESTs and other DNA sequences which are freely accessible at NCBI (http://www.ncbi.nlm.nih.gov) and the availability of several SSR mining tools like MISA [13] , TROLL [36] , SciRoKo [37] , Msat commander [38] , etc., makes it possible to utilize available ESTs for the development of genic SSRs which could be applied for papaya crop improvements. Only few studies of microsatellite analysis from genomic sequences [39] and from ESTs [40] have been performed in papaya. Moreover, only limited genic or EST-SSR markers, which emerge from transcribed portion of the genome, therefore becomes more important, are available in C. papaya. Therefore, the present study was undertaken to develop genic SSRs by utilizing the available EST database of C. papaya. The study has following two objectives: 1) In-silico approach to mine SSRs from the available papaya ESTs from the NCBI database and, 2) to develop EST-SSR primers. These developed primers could be used for estimation of genetic diversity, cross-transferability across species and genera, in comparative-genomics study and in identification of sex specific markers in papaya.
2. Materials and Methods
The methodology of in silico mining and development of EST-SSR primers from papaya floral ESTs are shown in (Figure 1).
2.1. Retrieval of Floral Papaya EST Sequences
EST sequences of C. papaya are available at NCBI (www.ncbi.nlm.nih.gov/nucest/). A total of 75,846 papaya floral EST sequences (male, female and hermaphrodite flower) before meiosis and after meiosis stage were retrieved from EST database (dbEST) of NCBI in FASTA format. These EST sequences were submitted by Ming et al. [5] .
2.2. EST Sequences Processing
ESTs are single pass DNA sequences so, they are more error prone. EST sequences may contain vector/adaptor contaminations, low complexity sequences and poly-A/T tails. Therefore, EST sequences were initially screened using DDBJ VecScreen tool (http://ddbj.nig.ac.jp/vecscreen/) for identification of vector con- tamination. It detects vectors, adaptors and other suspect contaminations by NCBI’s UniVec core vector/adaptor library. EST sequences were then processed using SeqTrim NEXT [41] with its default parameters. The program takes a FASTA format sequence file as an input. It removes vector/adaptor contamination, low complexity regions and trimming of poly-A, poly-T tails from the EST sequences according to the given parameters.
2.3. Assembly of Floral Papaya EST Sequences
All the processed floral EST sequences were assembled using SeqMan DNA- STAR Lasergene ver. 9.0 program with its default parameters (minimum matching
![]()
Figure 1. Flowchart showing methodology of in silico mining and development of EST-SSR primers from papaya floral ESTs.
percent = 80%). This software provides contig, singletons and statistical information. The sequences which cannot be grouped due to their low similarity to other ESTs results in singletons. Contigs and singletons constitute non-redun- dant dataset therefore were used for SSRs identification.
2.4. Detection of Genic Microsatellite
The potential SSRs were detected in the assembled floral ESTs by submitting the sequences to a SSR mining tool, SciRoKo 2.1 version. The minimum repeat unit was defined as 4 for mono- and di-nucleotide, 3 for tri-, tetra-, penta- and hexa- nucleotides, respectively [42] (the numbers here indicating repeat unit i.e. minimum number of times the motif was repeated). Imperfect SSR analysis was done under the mismatched and fixed penalty search mode of SciRoKo tool. This program takes a FASTA formatted sequence file as an input and produces an output file with sequence name, counts of SSR, SSR type, SSR motif, repeat number, the length of the sequence and GC content. SciRoKo is freely available on internet which can be downloaded and installed in the PC.
2.5. Primer Designing
Microsatellites containing floral EST sequences were used to design flanking forward and reverse EST-SSR primer pairs using online software BatchPrimer3 v1.0 with default parameters (http://probes.pw.usda.gov/cgi-bin/batchprimer3/batchprimer3.cgi). BatchPrimer3 is a primer design tool based on Primer3 [43] that can accept in input up to 500 sequences at a time. The major criteria for primer designing were as follows: primer length (18 - 23 bp, with optimum value 20 bp); Tm (57˚C - 63˚C, with optimum value 60˚C); GC content (40% - 60%, with the optimum value 50%); maximum Tm difference between forward and reverse primer 1.5˚C and product size range (100 - 300 bp with optimum value 150 bp). Twenty eight primer pairs were custom synthesized from these designed primers by Eurofins Genomics, Bangalore, India.
3. Results
3.1. Retrieval, Processing and Assembly of Papaya Floral ESTs
A total of 75,846 papaya floral ESTs were downloaded from NCBI in FASTA format. All EST sequences were screened by DDBJ VecScreen for identification of vector, adaptor contaminations, low complexity sequences and poly-A/T tails. EST sequences were processed using SeqTrim NEXT for the removal of these contaminations. A total of 59,522 floral EST sequences were obtained after processing (Table 1). Processed floral EST sequences were assembled using SeqMan DNASTAR Lasergene ver. 9.0 program with its default parameters. A total of 26,039 floral unigenes (7960 contigs and 18,079 singletons), were generated after assembly of papaya floral EST sequences (Table 1). These assembled floral unigenes were further utilized for mining of SSRs.
![]()
Table 1. Summary of in silico mining of EST-SSRs from papaya floral EST database.
3.2. Frequency, Distribution and Characterization of SSR Repeat Types
SSRs in the floral ESTs were mined using the SSR mining tool, SciRoKo. The mined EST-SSRs were classified into three types on the basis of repeat sequences; perfect SSRs, containing single motif; imperfect SSRs, with a pair of bases are present within the repeat motif that does not match the motif sequence; and compound SSRs, containing more than two adjacent different motifs [35] . Mining from floral unigenes resulted in 433,782 perfect SSRs with average density of 3610.84 SSR/Mb, 204,968 compound SSRs with average density of 50.45 SSR/ Mb and 6061 imperfect SSRs with average density of 5118.52 SSR/Mb, respectively (Figure 2). The frequency distribution of mined perfect SSR repeat types is presented in Figure 3 and Figure 4. It was observed that mononucleotide repeats (411,156; 94.7%) were the most abundant perfect repeat type, followed by trinucleotide repeats (13,792; 3.1%), and dinucleotide repeats (7697; 1.7%) in floral unigenes, respectively. The frequencies of tetra-, hexa- and penta-nucleo- tide repeat types accounted for only (772; 0.17%), (203; 0.04%) and (162; 0.03%) in floral unigenes, respectively (Figure 3, Table 2).
3.3. Frequency, Distribution and Characterization of SSR Repeat Motifs
During standardization, the reverse complements of microsatellite motifs were considered, and similar microsatellite motifs are grouped together, for example,
![]()
Figure 2. Abundance of different SSR types in papaya floral unigenes.
![]()
Figure 3. Frequency distribution of perfect SSR repeat types in papaya floral unigenes.
![]()
Figure 4. Frequency distribution of different repeat motifs in papaya floral unigenes.
a poly-A repeat is equivalent to a poly-T repeat on a complementary strand, AC is equivalent to CA in different reading frames and to TG, GT on a complementary strand. Similarly in trinucleotide an AGC motif is equivalent to CGA and GCA in different reading frames and to GCT, TCG and TGC on complementary strand. Thus, there are two possible combinations for mononucleotide motifs, four possible dinucleotide motifs, ten possible trinucleotides, 33 possible tetranucleotides, 102 for pentanucleotides, and 350 for hexanucleotide motifs (Table 2). Frequency distribution of abundant mono-, di-, tri-, tetra-, penta- and hexanucleotides perfect SSR motifs is presented in Figure 5. In mononucleotide repeats, the most abundant SSR motif was A/T (69.3%), while G/C was accounted only (30.6%) in floral unigenes. Among dinucleotide repeats, most frequent motif was AG/CT (61%), while GC/CG (2.1%) was least frequent in floral unigenes. In trinucleotide repeats, most abundant motif was AAG/CTT (31%), while GGC/ GCC (2.3%) was least frequent trinucleotide motif. Among tetranucleotides repeats, AAAG/CTTT (21.3%) was most frequent motif. Among pentanucleotide repeats, AAAAG/CTTTT (17.9%) was most abundant motif. In hexanucleotide repeats, AAAAAG/CTTTTT (7.8%) was most frequent in floral unigenes.
A total of 176 different types of motifs of imperfect SSRs were identified in floral unigenes. In imperfect SSRs, mononucleotide repeats 3426 (56.5%) was most abundant (Table 3).
3.4. Primer Designing
In this study, a total of 3807 primer pairs for floral papaya ESTs (except mononucleotide repeats) were successfully designed using BatchPrimer3 v1.0. Twenty eight primer pairs were custom synthesized from these designed primers. The details of EST-SSR primers along with their Tm, product size, GC%, corresponding SSR motifs are listed in (https://goo.gl/sTJUdn).
![]()
Table 2. The occurrence of different perfect SSR motif types in papaya floral unigenes.
![]()
Figure 5. Frequency distribution of abundant mono-, di-, tri-, tetra-, penta- and hexanucleotides perfect SSR motifs in papaya floral unigenes.
![]()
Table 3. Summary of the imperfect SSR identified in papaya floral unigenes.
4. Discussion
4.1. Identification and Characterization of Papaya Floral EST-SSRs
In this study, a total of available 75,846 C. papaya floral ESTs were downloaded from NCBI, USA. Assembly generated a total of 26,039 floral unigenes. All three types namely perfect, imperfect and compound SSRs were identified using the SciRoko program. The perfect SSR types (433,782) were most abundant, followed by compound (204,968) and imperfect SSRs (6061) in papaya floral unigenes. The amount of perfect EST?SSRs (433,782) mined in the present study is higher as compared to the earlier report of papaya in which 10,688 SSRs were identified [39] . In the present study, the average frequency or density of perfect, imperfect and compound SSR were identified as 3610.84 SSR/Mb, 50.45 SSR/Mb and 5118.52 SSR/Mb respectively which is higher than the previous studies of papaya in which density of perfect SSRs reported were 1340 SSR/Mb [39] , 746 SSR/Mb [44] and 656 SSR/Mb [45] . Such variations in the density of identified microsatellites are usual among different reports, mainly due to differences in the algorithms, parameter settings, minimal repeat length, the SSR search criteria, the size of the dataset for analysis, and the database-mining tools [46] .
Characterization of SSR analysis revealed that, the monoucleotide repeats were most abundant 411,156 (94.7%) in papaya floral unigenes which is similar to previous reports on several plant species namely pea [47] , olive [48] , Camellia sinensis [49] , tobacco [50] and Taxodium zhong shansa [51] , but in contrast to those from pineapple [52] , forage legume [53] in which di- and tri-nucleotides were identified as the most abundant repeats. The frequency of mononucleotide was found to be highest (94.7%) as compared to other repeat types in our study which is similar to previous report of papaya in which mononucleotides contributed maximum (69.1%) [40] . The abundance of mononucleotide repeats in assembled ESTs suggests that they are present within the expressed regions and not at the end of the mRNA sequences [54] . Mononucleotide repeats have been used to study the population genetic in chloroplast genomes [55] . Trinucleotides (13,792; 3.1%) were the second most abundant repeats followed by dinucleotides (7697; 1.7%). Rest of the SSR types including tetra-, penta- and hexa-nucleotides were found in low frequency only 772 (0.17%), 203 (0.04%), 162 (0.03%) respectively.
In mononucleotides repeats, the most abundant motif was A/T (69.3%), which is similar with previous reports on Allium sativum [54] and Humulus lupulus [56] . In dinucleotides, AG/CT motif (61%) was most abundant in this study. The similar trend of AG/CT was also found in several plant species such as coffee [57] , Madagascar periwinkle [58] and European hazelnut [59] . AG/CT is a com- mon dinucleotide motif among plant genome [12] . One possible explanation is that AG/CT motifs frequently occur in 5' UTRs and involved in gene regulation [46] . GC/CG motif (2.1%) was least frequent in floral unigenes, indicating selective pressure against this class of repeats. In trinucleotides, AAG/CTT motif (31%) was most abundant and AGT and GGC were the least frequent which is in agreement with earlier reports on Pisum sativum [60] and Salix, Eucalyptus [61] but in contrast to finger millet [62] , in which CGG motif was most abundant indicating that the abundance of EST-SSRs usually varies between different plant species. According to Morgante et al., [63] , AAG/CTT is most common trinucleotide motif among dicotyledonous plants, while CCG/CGG is a specific feature of monocot genome [12] .
In this study, imperfect SSRs were also mined in which mononucleotides repeats 3426 (56.5%) were most abundant. The variations in frequency, distribution and abundance of SSRs among different plant species depend on various factors such as, the SSR search criteria, the size of the dataset for analysis, and the database-mining tools [46] .
4.2. Development of Floral EST-SSR Primers
Microsatellites are usually characterized by the presence of conserved flanking sequences. In this study, a total of 3807 primer pairs for floral papaya ESTs were successfully designed. The remaining SSR containing sequences either fail to generate primer-pair due to any or all of the following reasons, 1) flanking sequences are too short, 2) due to unavailability of flanking site for primer designing, or 3) it did not match the primer designing criteria of BatchPrimer3 v 1.0 software [46] .
5. Conclusion
In papaya, a large amount of EST database has facilitated the identification of genic SSRs. The present study examined the frequency, type and distribution of microsatellites in floral ESTs of papaya and highlights the development of EST- SSR primers in papaya. The development of EST derived SSRs via in silico saves both costs and time and less labor-intensive approach. A total of 26,039 floral unigenes were generated after assembly from papaya floral EST sequences. 433,782 perfect SSRs, 204,968 compound SSRs and 6061 imperfect SSRs were identified in floral unigenes. 3807 primer pairs for floral papaya ESTs were designed and 28 primer pairs were custom synthesized from these designed primers. The floral EST-derived SSR primers reported in this study are being used in genetic diversity analysis, its cross transferability analysis among Carica and its related genera, and identification of sex specific markers among female, hermaphrodite and male plants of various papaya varieties. These primers could also be useful in comparative mapping to study the order of genes among closely related C. papaya species, and for markers-assisted selection of desirable traits (disease resistance) in papaya.
Acknowledgements
The financial assistance in the form of research projects sanctioned by Department of Science and Technology (DST), Government of India, New Delhi and Council of Science and Technology (CST), Uttar Pradesh, India and Junior Research Fellowships (JRFs) to Priyanka and Dileep Kumar by Department of Biotechnology (DBT), Government of India, New Delhi and University Grants Commission (UGC), Government of India, New Delhi, respectively, are gratefully acknowledged.