Genetic Diversity and Population Structure of Tomato (Solanum lycopersicum) Germplasm Developed by Texas A&M Breeding Programs

Genetic variation developed in plant breeding programs is fundamental to creating new combinations that result in cultivars with enhanced characteristics. Over the years, tomato (Solanum lycopersicum) breeding programs associated with the Texas A&M University system have developed morphologically diverse lines of tomatoes selected for heat tolerance, fruit quality, and disease resistance to adapt them to Texas growing conditions. Here we explored the intraspecific genetic variations of 322 cultivated tomato genotypes, including 300 breeding lines developed by three Texas A&M breeding programs, as an initial step toward implementing molecular breeding approaches. Genotyping by sequencing using low coverage whole-genome sequencing (SkimGBS) identified 10,236 high-quality single-nucleotide polymorphisms (SNPs) that were used to assess genetic diversity, population structure, and phylogenetic relationship between genotypes and breeding programs. Model-based population structure analysis, phylogenetic tree construction, and principal component analysis indicated that the genotypes were grouped into two main clusters. Genetic distance analysis revealed greater genetic diversity among the products of the three breeding programs. The germplasm developed at Texas A&M programs at Weslaco, College Station, and by Dr. Paul Leeper exhibited genetic diversity ranges of 0.175 0.434, 0.099 0.392, and 0.183 0.347, respectively, suggesting that there is enough variation within and between the lines from the three programs to perform selection for cultivar development. The SNPs identified here could be used to develop molecular tools for selecting various traits of interest and to select parents for future tomato breeding. How to cite this paper: Kandel, D.R., Bedre, R.H., Mandadi, K.K., Crosby, K. and Avila, C.A. (2019) Genetic Diversity and Population Structure of Tomato (Solanum lycopersicum) Germplasm Developed by Texas A&M Breeding Programs. American Journal of Plant Sciences, 10, 1154-1180. https://doi.org/10.4236/ajps.2019.107083 Received: May 29, 2019 Accepted: July 22, 2019 Published: July 25, 2019 Copyright © 2019 by author(s) and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/ Open Access

modern molecular breeding techniques for population management, including methods to obtain desired genetic heterogeneity in the end-product cultivars.
One of the first steps in implementing molecular breeding approaches is to estimate the genetic variation within the breeding lines. Genetic heritable variability is indispensable in plant breeding aimed at developing new cultivars that express desirable characteristics generation after generation [13]. Furthermore, the development of improved varieties is enhanced when parents are selected based on genetic heterogeneity [14], making genetic variation estimation necessary in breeding programs to allow the selection of parental lines either to increase breeding population variation or to develop hybrids for cultivar release [15].
Genetic variation between breeding lines can be effectively determined through the use of molecular markers. In tomato, genetic diversity has been extensively studied using a wide range of molecular data. Miller and Tanksley (1990) [16] used restriction-fragment-length polymorphism (RFLP) markers for genetic diversity analysis of self-incompatible and self-compatible tomato species. To unveil the genetic variations that underlie fruit sugar and organic acid production, Zhao et al. (2016) [17] conducted a genetic diversity analysis of 174 tomato accessions using simple sequence repeat (SSR) markers. To gain insight into the morphological traits of fruits, Sacco et al. (2015) [18] performed a genetic diversity analysis of 123 tomato genotypes using single-nucleotide polymorphisms (SNPs). Similarly, Lin et al. [19] and Aflitos et al. [20] performed an evolutionary study of tomato and its wild relatives involving SNPs.
The advent of next-generation sequencing technologies coupled with bioinformatics has led genetic diversity studies into a new era. Sequencing of tomato has resulted in the discovery of large numbers of SNPs distributed throughout the genome [20] [21] [22]. Furthermore, cultivated tomato genome has been fully sequenced [23] and the genotyping by sequencing (GBS) has emerged as a powerful tool for sequencing large populations. The availability of large numbers of SNPs distributed throughout the genome, a reference genome, and the GBS technique [23] [24] [25] has made large intraspecific studies possible. This is important as most prior studies focused on interspecific variations and only a few intraspecific studies have been performed [19] [20] [26] [27]. The SNPs postulated from such intraspecific studies offer better clues to the genetic control of agronomic traits and can be used to deduce phylogenetic relationships. Parent selection based on such genetic information can greatly enhance breeding efficiency and help to achieve breeding goals such as high quality (flavor, color, shape), long shelf life, disease resistance, and heat tolerance.
In the present study, we used three representative sets of tomato breeding A&M breeding programs possess a high level of genetic diversity that, upon selection, can be used to develop high-yielding adapted cultivars for Texas production. Furthermore, intraspecific SNPs identified in the present study could be used to identify economically important traits in cultivated tomatoes. Finally, based on the results of phylogenetic and genetic distance analyses, hybridization strategies can be developed to increase diversity and optimize hybrid development within and between breeding programs.

Plant Material
A total of 322 tomato (Solanum lycopersicum) genotypes were evaluated in this study. Among them, 300 genotypes were developed by three independent tomato breeding programs in the Texas A&M University (  (Table S1). These genotypes were developed by hybridizations of Texas A&M germplasm with a diverse set of parents including accessions from the USDA National Germplasm System and other public breeding programs mentioned below and subsequent selfing up to the F 9 generation. Pedigree information for all the breeding lines developed in Leeper's program and some from Crosby's program have been lost (Table S1). Breeding lines developed from all the three breeding program harbor good phenotypic variations in tomato fruit shape, size, and color. Besides the genotypes from the Texas A&M University breeding programs, 16 genotypes from the USDA collection, 3 from the Asian Vegetable Research and Development Center (AVRDC), and 3 developed by University of Florida tomato breeding program (designated FLA) were also included in the present study (Table S1).

DNA Extraction
Leaves from twelve four-week-old seedlings of the respective genotypes were collected and combined into a single bulk sample. Tissue was lyophilized, homogenized, and stored at −20˚C until extraction. Genomic DNA was extracted from 50 mg of homogenized tissue using the CTAB method [28]. Qualitative and quantitative tests of the DNA were performed by electrophoresis and Qubit 2.0 fluorometry (Life Technologies, Carlsbad, CA), respectively. For each sample, 1.2 µg of DNA was sent to the Texas A&M Genomics and Bioinformatics services (College Station, TX) for sequencing.

GBS, SNP Discovery, and Population Structure
Genotyping of 322 tomato genotypes was performed using low-coverage whole-  [30]. The aligned BAM files were sorted, quality filtered for mapping, and filtered for duplicate reads using SAMtools [31] and Picard (http://broadinstitute.github.io/picard/index.html). The GATK HaplotypeCaller (HC) [32] was used for SNP calling from the aligned data of the 322 tomato genotypes. These raw polymorphic SNPs were filtered to remove SNPs with a high percentage of missing genotypes and low minimum allele frequency (MAF). The resulting genotypes were imputed using Beagle (v4.00) [33]. The imputed genotypes were further filtered to keep only genotypes with probability ≥0.9. The polymorphic SNPs were subsequently filtered to remove the SNPs with >30% missing genotypes.
The population structure and hybrid forms of tomato genotypes were inferred using the Bayesian model-based clustering program STRUCTURE (v2.3.4) [34] using polymorphic SNPs obtained from the GBS analysis. To determine the number of populations in a given genotype, the STRUCTURE was run with 5000 burn-in periods with 5000 Markov-chain Monte Carlo (MCMC) steps using an admixture model and correlated allele frequencies among populations. The program was run independently three times for each value K ranging from 1 to 10.
To detect the true value of K (population), we used the uppermost level of structure calculated using the ΔK method as described in Evanno et al., 2005 [35]. The tomato genotypes were assigned to each true population (Q) based on the value obtained for the proportion of population membership for a given K.
The population structure of 322 tomato genotypes was visualized using a bar plot (sorted by Q) in the Python matplotlib package.

Phylogenetic and Principal Component Analysis
Phylogenetic analysis was performed using the unweighted pair-group method with arithmetic mean (UPGMA) algorithm implemented in TASSEL v5.2.52 [36]. The phylogenetic tree obtained from TASSEL was visualized using iTOL v4.3.3 and each population was annotated using customized annotation files [37]. The pairwise genetic distance matrix between each pair of genotypes was calculated using TASSEL v5.2.52 and visualized using the Python matplotlib package. The PCA was performed using the PCA function in TASSEL. The first three principal components were exported and visualized as a three-dimensional (3D) scatter plot using the Python matplotlib package.

Generation of High-Quality Tomato GBS Data
We generated a total of ~598 million sequence reads (paired-end, 150 bp) using across all 322 tomato genotypes, and SNPs with low genotype probability (<0.9) ( Figure S1 and Figure S2). We used the remaining 10,236 high-quality SNPs for downstream analysis. SNPs were not distributed evenly across all chromosomes SNPs were mapped to unanchored scaffolds (Chr00).

Genetic Distance between Tomato Genotypes
We calculated the pairwise genetic distance matrix for the 322 tomato genotypes in TASSEL v5.2.52. Genetic distance between tomato genotypes ranged from 0.092 to 0.443, with an average distance of 0.270 (Table 1 and Table S2). Among them, the combination of genotypes TAM-CS-138 and USDA-273 revealed the smallest genetic distance (0.092). Genotype TAM-CS-138 is an F 5 inbred heirloom type with large, pink fruit, developed by the Texas A&M College Station breeding program, and genotype USDA-273 is a cherry tomato that produces small red fruit, from the USDA germplasm bank (Table S1). Among all possible 100,142 combinations between the 322 genotypes, the largest genetic distance (0.443) was observed between genotypes TAM-CS-111 and TAM-W-322 (Table 1).
Genotype TAM-CS-111 is an F 5 inbred that produces small, round red fruit, from    Table 1). The sets of genotypes from the AVRDC and Florida breeding programs used in the present study showed mean genetic diversities of 0.296 and 0.298, respectively (Table 1).

Population Structure
We explored the population structure of tomato genotypes using a model-based  main population clusters (Q1 and Q2) (Figure 2(b)). Out of the entire population evaluated in this study, 32 tomato genotypes (9.9%) were grouped into Q1, while the remaining 290 genotypes were placed into Q2 (90.1%) ( Table 2). Of the two clusters, the genetic diversity assessment indicated that Q1 is more diverse, and it included the two genotypes with the largest genetic distance observed (genotypes TAM-W-322 and TAM-CS-111, Figure 3(a)). The range of genetic distances between genotypes assigned to cluster Q1 was 0.288 -0.443, and the mean was 0.346 ( Figure 3(a)). In cluster Q2, the range of genetic distances between genotypes was 0.092 -0.334, with a mean of 0.268, and this cluster included the two genotypes with the smallest genetic distance (0.092), TAM-CS-138 and USDA-273 ( Figure 3(b)).
The population structure analysis also revealed that genotypes from the breeding programs were distributed between the Q1 and Q2 clusters, while all evaluated genotypes from the USDA collection belonged to the Q2 cluster (Table 2 and Table S1).

Phylogenetic Tree and Principal Component Analysis
Next, we constructed a phylogenetic tree based on the 10,236 SNPs and found D. R. Kandel et al. that it also divided the 322 tomato population into two groups and that these groups corresponded with the two population clusters Q1 and Q2 ( Figure 4).
Thus, the phylogenetic tree displayed consistency with the population structure revealed by the model-based clustering analysis with STRUCTURE v2.3.4 ( Figure 2). Figure 4 shows that the genotypes producing the smallest genetic distance (USDA-273 and TAM-CS-138) had the shortest branches arising from the lowermost clade. Similarly, genotype TAM-W-322, which was one of the two genotypes producing the largest genetic distance with another, was placed on the breeding programs, respectively, had the potential to yield greater genetic diversity when combined with other genotypes. We also performed PCA to check the number of population structure groups; Figure 5 presents the distribution of tomato genotypes in scatter plots of the first three principal components in a 3D space. This PCA also revealed that the tomato genotypes clustered into two groups, with some overlap indicative of the small genetic distances between some genotypes in Q1 and Q2.

Discussion
Genetic diversity studies have increased in recent years due to advances in high-throughput sequencing technologies and the availability of high-resolution SNPs. For example, 5.4 million SNPs were identified between wild and cultivated tomato genomes during the sequencing of the tomato reference genome from the cultivar Heinz 1706 [23]. Likewise, 11.6 million SNPs were found from the sequencing of 360 accessions that included both cultivated and wild tomato species [19] and 180,000 -350,000 SNPs from the sequencing of four large-fruited cultivated tomato accessions [38]. In the present study, sequencing of 322 tomato genotypes from cultivated S. lycopersicum resulted in the discovery of 3.   and Figure 4). Genotypes from all three Texas A&M breeding programs and also from AVRDC and Florida lines were observed in both the Q1 and Q2 clusters ( Figure 4). Additionally, the grouping of genotypes into two clusters with some overlaps was further validated by the PCA. Life breeding population include the gene Mi-1, which confers resistance against root knot nematode caused by Meloidogyne spp. and was introgressed from Solanum peruvianum [44]; Sw-5, which confers resistance to the tomato spotted wilt virus (TSWV), introgressed from S. peruvianum [45] [46]; Ty-2 and Ty-3, which confer resistance to tomato yellow leaf curl virus (TYLCV), introgressed from S. habrochaites [47] [48] and S. chilense [49], respectively; and I-2 and I-3, conferring resistance to vascular wilt caused by Fusarium oxysporum race 2 (Fol2) and Fol1, Fol2, and Fol3, were introgressed from S. pimpinellifolium [50] and S. pennellii [51] [52], respectively. Thus, introgressions of disease-resistance genes during hybridization could have played an important role in producing the genetic diversity among breeding lines observed in the present study and thus in grouping the genotypes into two clusters.
The present study revealed that the tomato breeding lines developed by the Texas A&M breeding programs possess a high level of genetic diversity and thus should be capable, upon selection, of yielding a variety of cultivars adapted for Texas production. Furthermore, the broad genetic base of the breeding lines and the higher recombination generated through hybridization could be utilized to uncover QTLs for complex traits. As the SNPs identified here were intraspecific, they could be valuable for uncovering economically important traits within cultivated tomato. Finally, our work here suggests that through the use of a phylogenetic tree and genetic distances, it is possible to develop crossing strategies to increase diversity and encourage hybrid development within and between breeding programs.       Figure S1. Distribution of the SNP missing rate (a) before imputation and (b) after imputation. SNPs with >50% missing, rare alleles with minor allele frequency (MAF) < 5% across all 322 tomato genotypes, and SNPs with low genotype probability (<0.9) were imputed. Figure S2. Average missing rate of SNPs across 322 tomato genotypes (a) before imputation and (b) after imputation. SNPs wit > 50% missing, rare alleles with minor allele frequency (MAF) < 5% across all 322 tomato genotypes, and SNPs with low genotype probability (<0.9) were imputed.