Exploitation of Concatenated Olive Plastome DNA Markers for Reliable Varietal Identification for On-Farm Genetic Resource Conservation

Rapid and reliable identification of olive plants using DNA markers has been attempted in the past but the selection of polymorphic regions for discrimination at varietal level remained obscure. Recent sequencing of plastid genome of the olive flaunts high resolution Cp markers for olive DNA fingerprinting. Using this information, we designed a combination of chloroplast markers to amplify genes recruited in photosynthesis, ribosomal and NADH energy metabolism for varietal identification of olive plants. Concatenated DNA sequences of more than 100 unknown and 10 reference plants samples were analyzed using various bioinformatics and phylogenetic tools. Conserved blocks of nucleotide sequences were detected in multiple alignments. Phylogenetic reconstruction differentiated the unknown plants into various clusters with known varieties. Further narrowing down of the samples through UPGMA tree clearly separated the plants into Arbosana, Frantoio and Koroneiki as the major varieties. Multiple alignments of these clusters revealed important variety specific SNPs including G and T nucleotides at specific positions. Sequence identifying at intra cultivar level was more than 98.79% while it dropped to 97%, and even to 96% at inter varietal level. Furthermore, a neighbor net network analysis separated these three clusters, thus validating the results of UPGMA tree. Over all, out of 100 plants samples, 49 plants were identified that fall into 10 varieties including Arbosana, Carolea, Chetoui, Coratina, Domat, Frantoio, Gemlik, Koroneiki, Contributed equally. Corresponding authors.


Introduction
One of the characteristic fruit trees of the Mediterranean area is the evergreen and long-lived olive (Olea europaea L.).It is diploid with 46 chromosomes from the Oleaceae family [1]- [3].Olive can be as older as 500 years but over 2000 years older trees are also in record.It is a medium sized tree with grey-green leaves arranged opposite to one another.The olive comes from the genus Olea that has 3 subgenera Paniculatea, Olea and Tetrapilus [4].Olea europaea L. is the only single species that bears edible fruit [5] [6].The origin of olive is still unclear, but the main hypothesis suggests that it originated from the Eastern shores of the Mediterranean [7].
The fruit and oil of olive are of prime importance worldwide.Although 90% of world olive production is used for oil extraction [8], the consumption of table olives is also growing globally.Today, the olive tree is grown commercially within latitudes 30˚ and 45˚ in both the Northern and Southern hemispheres, where climatic conditions are similar to the Mediterranean basin, with mild winters and warm, dry summers [9] [10].Pakistan lies in the belt between 30˚ -45˚ North and South of the equator, hence it is a potential area for olive cultivation.The suitable areas include Pothwar, Khyber Pakhtunkhwa, Swat, Dir, Malakand, Loralai, Khuzdar and Quetta districts etc. Edible oil is the biggest food import item of Pakistan.Pakistan imports olive oil and fruit every year and huge funds are consumed on their cost.Self-sufficiency in edible can be attained by cultivating olive orchards in the marginal lands (more than 3 million acres; 30% of total land) of Pakistan.Under different projects, the total olive tree cover is more than 800 ha comprising of 106,048 trees.These plants are at fruiting stage and some of these plants are giving very good yield.But the biggest problem which is restricting their large scale propagation is that these olive varieties/plants are unidentified and there is no record which variety/cultivar they are.Therefore, oil extracted from these plants is mixed and does not get its premium price in the market.Unavailability of known high yielding and quality oil producing varieties/plants is the biggest hurdle for large scale propagation of olive in the poor lands of Pakistan.Furthermore, the unavailability of true to type olive nurseries is also impeding the olive propagation in the potential regions.
The olive's ancient origin, easy propagation and popularity have resulted in the presence of its numerous cultivars across the world.Several cultivars may have the same name (homonyms), or the same cultivar may be called by different names (synonyms) in different areas [11]- [13].Many areas in botany depend on the efficiency to discriminate plant genotypes and calculate the amount of diversity and similarity in a group of genotypes.This has been done traditionally through morphological and biochemical markers and presently through DNA markers or DNA fingerprinting [14].Molecular markers are preferred because they have several advantages over their alternatives.Like, they are co-dominantly inherited and highly polymorphic.They can be easily visualized and are spread over the whole genome evenly.They are stable, quick, inexpensive and simple to use.They require small amount of DNA and do not require any pre-info about the genome [15].The olive gene pools have also been characterized utilizing the high resolving capacity of the molecular markers.Many researchers have traced the origin of olive germplasm using different molecular makers like RAPD [16].
An advanced genome screening technique is that of the plastome sequencing or screening the chloroplast DNA through specific markers.The chloroplasts are inherited maternally in the cultivated olives [17].The plastidial variability is low in the cultivated olives in contrast to that detected at the subspecies level.The mitochondria and chloroplasts both pass through recurrent mutations but the level of mutations is low as demonstrated by [18].Taking advantage of the highly conserved nature of cpDNA, universal primers for the cpDNA introns have been developed for numerous plant species [19].Besnard and his colleagues detected 14 polymorphisms in the 3 chloroplast regions (trnT-L, trnQ-R and matK) in the Olea europaea complex [20].
For this study, the last approach of sequencing of the entire chloroplast genome of the Olea europaea subsp.europaea cv.Frantoio to identify the polymorphic regions was employed.The resulting availability of the entire plastome map allowed to evaluate the sequence arrangement of the plastid genome in Olea europaea and to identify new organellar polymorphisms that could discriminate between cultivated olive varieties [21].In order to propagate only the better and high yielding cultivars, there is dire need to screen the cultivated olive plants in Pakistan to identify variety/cultivar.We can also graft our desired varieties onto the wild plants.This can enhance the olive fruit and oil yield in Pakistan.The work of this nature has not been done in Pakistan to date.Olive growers can name accurately some cultivars with distinguished phenotypic traits.But they confuse while differentiating the cultivars having similar morphological characters.Due to this problem, certified and good quality material for the establishment of new olive orchards is not available.Hence rapid and reliable identification of unknown olive plants growing at various olive farms through DNA marker is essential.Therefore, the objective of the present study was to screen unknown and known plants through specific cholorplast DNA markers for identification of polymorphic regions, identification of unknown cultivars of olive growing at different orchards in Pakistan using DNA markers, and to infer their evolutionary relationship through phylogenetic reconstruction.The results demonstrate that olive genome harbours some very advantageous polymorphic sites which can be employed for the reliable screening of unknown olive varieties through cultivar specific SNPs.The evolutionary relationship explored by phylogenetic investigation also helped in identifying the plants.Finally the neighbor-net network analysis validated the clustering of plants into specific variety.

Selection of Materials and Sampling Plan
Information about the olive plants growing at different locations in Pakistan was obtained from National Director of the Olive Project, Pakistan Agricultural Research Council, Islamabad, Pakistan.Different areas of Khyber Pakhtunkhwa province were selected for plant sampling.Each plant was labeled using olive farm name, orchard number, row number and plant number.After plant labeling, fresh leaf tissue was harvested from the plants.The samples were stored at −80˚C until DNA extraction was performed.

DNA Extraction, PCR Amplification and Sequencing
A total of 110 plant leaf samples were used for DNA extraction using CTAB method [23].For quality assessment DNA was run on 0.8% agarose gel.The diluted DNA samples were used as a template for PCR amplification with three primer pairs.The primer pairs used were named as CP3, CP4 and CP5.The sequence of the CP5 forward primer was 5'-CTGACAATTCATTTCTATTTCTAGA-3' and reverse primer was 5'-CATTATTTATCTATAATTCGTTGGA-3'.Their position in cpDNA is 8986 to 9705 and they amplify a fragment of 720 bp length.
Each PCR reaction (50 μl) contained 10 ng DNA template, 10× reaction buffer, 5 μL MgCl 2 , 1 μL dNTPs, 1 μL of each primer, and 0.5 μL of Taq DNA polymerase (Promega, Madison, WI, USA).The reaction mixtures were incubated in a thermocycler (Applied Biosystems Inc) for 5 min at 95˚C, followed by 36 cycles of 1 min at 94˚C (denaturing), 1 minute at the annealing temperature 58˚C, and 1 min at 68˚C (extension).PCR products were run on 1.2% agarose gel to view the amplification success.The PCR product was sent to Macrogen (Korea) for sequencing.

Sequence Analysis and Multiple Alignments
The sequence files obtained were edited and analyzed with MacVector7.2 program [22].Blastn was done for target identification in NCBI database (http://blast.ncbi.nlm.nih.gov/Blast.cgi).The BioEdit software [24] was used to trim the sequences to remove the mismatched/flanking regions from both the ends.The ClustalW multiple alignment of the sequences was done using BioEdit and MEGA6 software [24] [25].The mutations were detected, recorded and matched with previously available known data of different olive cultivars.Furthermore sequence identity at intra and inter varietal level was calculated through pairwise alignments.In this way, different olive cultivars were discriminated based on sequences similarities.A dataset was prepared that comprised 100 unknown and 10 known plants marker region sequences to be analyzed with bioinformatics software.

Phylogenetic Reconstruction
In order to infer the evolutionary relationship among different cultivars, phylogenetic reconstruction using UPGMA algorithm was done in MEGA6.The data generated was also helpful in cultivar identification.
It is well demonstrated that phylogenetic network could better reveal the evolutionary history including hybridization, recombination and homoplasmy etc. than a tree like structure.Therefore, a neighbor-net network reconstruction analysis was implemented in SplitTree4 package with default parameters using an uncorrected P distance method [26].

Unknown Plant Identification
The results from cultivar specific mutations i.e.SNPs, multiple alignments and phylogenetic reconstruction were combined and analyzed for plant identification.The identified plants were tabulated and shown graphically in results section.

The Selected Marker Genes in Olive Plastome Are Polymorphic
Mariottiand his colleagues sequenced entire chloroplast genome of Frantoio cultivar and reported a number of polymorphic markers [21].Using this information we set out to find the most variable regions with high resolving power that can be used to identify the olive plants at variety level.Scanning of the olive chloroplast genome revealed three polymorphic regions (Supplementary Figure S1).The region 1 coding for the photosystem thylakoid membrane (psb-A) and transfer RNA (trnL) gene is located in the start from 8986 bp to 9705 bp.This region spans a length of 720 bp.It is the most polymorphic region as it harbors six different types of mutations including two SNPs, two indels and two SSRs.The details about these regions are given in Table 1.Similarly, the region 2 is located between 83112 bp to 83852 bp with a stretch of 740 bp.This region was also quite polymorphic and encodes ribosomal protein S (rps).Region 3 is located in the extreme distal portion.This region could amplify a size of 1334 bp between 101263 bp to 102599 bp.Ribosomal protein S (rpsT) and NADH dehydrogenase (ndhF) are encoded by these markers genes.Based on this information, three primer pairs CP5, CP4 and CP3 were designed for the amplification of selected regions 1, 2, and 3, respectively using "primer tool" in MacVector 7.2 software (Supplementary Table S1).
Initially, PCR amplification followed by sequencing analysis for five known cultivars, Carolea, Gemlik, Domat, Leccino and Moraiolo grown at NARC revealed that CP5 gave the best amplification and sequencing results in comparison with CP3 and CP4 primers.There were fewer polymorphic sites detected in regions ampli- fied using CP3 and CP4 primers.Furthermore the sizes of their products were also longer in comparison with CP5 (Data not shown).On the other hand, CP5 revealed a number of polymorphic sites.Hence CP5 primer pair was selected for the amplification of olive samples.Moreover the product size with CP5 was smaller (less than 720 bp) that could be easily amplified which reduced the sequencing cost as well.At least three PCR products were sequenced for each sample.The sequences were edited using BioEdit program and trimmed in order to eliminate the errors induced by sequencing procedure and to get the reliable sequence for analysis.
To explore the variability in the upstream regions of chloroplast genes, five reference plants sampled from NARC were compared with Frantoio sequence of NCBI database.For this purpose a multiple alignment was generated in BioEdit program.The alignment in Figure 1 shows that the selected region is quite polymorphic.In a short span of 600 bp, 14 mutations can be identified.These mutations included SNPs and deletion/insertions.There are two deletions located at 445 bp and 514 bp position, where A is deleted.The most frequent substitutions present are A and G nucleotides.There are specific SNPs in the NARC Carolea including A at position 46, 86, 294 and 296.Similarly another SNP of the nucleotide G is present only in NCBI Frantoio at 238 th position.These mutations seem to be cultivars specific.The above results allow us to infer that upstream region of the olive plastome is highly polymorphic with cultivar specific SNPs.Thus, this region i.e.CP5 primer specific can be used to identify plants at the variety level.

Phylogenetic Reconstruction Clustered the Unknown with Known Varieties of Olive
After sampling, the leaf material was immediately processed for DNA extraction using CTAB method [23].A total of 110 samples were run on agarose gel for quantification.Chloroplast DNA was also present in this genomic DNA.These DNA samples were labelled and stored at −20˚C.As CP5 primer pair was found to be the most polymorphic that could amplify a very short region of 720 bp containing 6 different mutations; therefore this primer pair was used to amplify Oe-psbK-psbI and Oe-trnS-trnG-1-4 regions of the plastome DNA of olive.It was possible to amplify the entire plate of 96 samples in a single PCR reaction.The amplified products were resolved on agarose gel against 1 kb ladder (Figure 2).The quality and quantity of PCR product was good enough for sequencing.
Sequencing of all the 110 samples was carried out using the services of MACROGEN Korea.Targets samples were selected using BLAST search.The sequences were edited using BioEdit software [24].The sequences were trimmed and aligned.This region contains all the SNPs, indels and SSRs showing polymorphism in different samples.
Based on sequence data, three types of in silico approaches were adopted to identify the unknown olive samples/sequences.Firstly, comparison of unknown sequences with known sequences through multiple alignments Secondly, identification of variety specific SNPs, indels and SSRs in unknown plant samples.Thirdly, phylogenetic reconstruction of unknown plants with known plants using UPGMA and neighbor-net network analysis.In order to get the final results about the plant samples identification, these three approaches were combined.
Multiple alignments of all the samples were generated (Supplementary Figure S2).The sequences for all the samples were highly conserved but different groups of plants with specific mutations were detectable.SNPs, indels and deletions were found throughout the aligned regions.The conserved region was shaded while the sites of mutations were not as shown in the Supplementary Figure S2.Though chloroplast like mitochondria is inherited from the mother parent only, this is exempted from genetic recombination during meiosis.Even, the major portion of the CpDNA is conserved, but the sequencing of the whole plastome of olive revealed that mutations such as SNPs, indels and SSRs are present.Some of the mutations are variety specific and this level of polymorphism is suitable to be used for cultivar identification.
In order to differentiate the unknown plants, phylogenetic reconstruction was carried out for all the samples including 100 unknown plants samples along with 10 known plants.A circular phylogenetic tree (Figure 3  Figure 3. Phylogenetic circular tree of all the 110 olive plants samples.The evolutionary history was inferred using the UPGMA method.The optimal tree with the sum of branch length = 0.77561405 is shown.The evolutionary distances were computed using the Kimura 2-parameter method and are in the units of the number of base substitutions per site.The analysis involved 110 nucleotide sequences.All positions containing gaps and missing data were eliminated.There were a total of 523 positions in the final dataset.Evolutionary analyses were conducted in MEGA6.The clusters with coloured branches were selected for further validation in two other phylogenetic reconstructions. reference plants getting the maximum matches of 8 plants were Frantoio and Gemlik (8 each) while the olive varieties with minimum matches were Carolea, Domat and Moraiolo (2 plants each) (Table 2).The Koroneiki is found at the basal position while Frantoio is the most recent variety.The rest of samples did not cluster with any of the reference samples.They clustered together, separately from the known varieties and remained unknown.They constitute majority of the samples (51).

Variety Specific SNPs, Indels, SSRs Can Be Detected in Amplified Regions
For zooming in the data were fragmented into smaller sets.For example the 1 st set contains the sequences of only Arbosana, Frantoio and Koroneiki and of the unknown plants in their clusters.A smaller phylogenetic UPGMA tree was constructed in MEGA6.Figure 4 demonstrate that all the three clades retained their integrity by the clustering of the same unknown plants to their reference plants as in the circular tree thus validating the results obtained from the circular tree.The neighbour-net network better reveals recombination, homoplasmy and evo-  lutionary relationship than a tree like structure.To further validate our results, the neighbour-net network of the sequences of three clusters was constructed in SpitsTree4 software (Figure 5).The resulting phylogenetic tree exhibited the same clusters of reference plants and unknown plants.The tree is clearly differentiated into three clusters.Though branches are scattered and are at distance in Koroneiki but it is the same cluster.Furthermore the tree retained the topology as UPGMA tree.So it can be concluded from all the three phylogenetic trees, that the mutations in the marker regions are variety specific.This marker region is reliable for the identification of olive varieties.The multiple alignments of the sequences of marker regions of 16 plants showed a number of different SNPs at specific positions (Figure 6).Variety specific SNPs are present specifically in the marker region sequences of Frantoio and its clustered plants at positions 82, 258, 275 and 357 collected from Ternab.Similarly, Koroneiki The unknown plants that have mutations corresponding to their reference plants and on this basis they have clustered together with a unique reference plant.These can be considered to be that variety sharing similarities in the chloroplast DNA sequence.This small dataset validated our results.
Over all data show that there are 49 plants differentiated into 10 varieties given as Arbosana, Carolea, Chetoui, Coratina, Domat, Frantoio, Gemlik, Koroneiki, Leccino and Moraiolo (Figure 7).A total of 188 mutations are present including SNPs and indels in 110 plants in the region amplified with CP5 marker shown in Supplementary Figure S2.

Identification of Olive Plants Using Multiple Alignments and Phylogenetics
Forty nine unknown plants were identified when a circular UPGMA (Unweighted Pair Group Mean Average) tree was reconstructed with MEGA6 (Figure 3).The remaining 51 unknown plants either clustered together or arranged separately but not with any of the known variety.They remained unidentified.The identified plants are written against their respective known variety in the Table 3.
Frantoio variety sampled from Ternab was clustered with 8 unknown plants.Gemlik sampled from NARC also clustered with 8 other unknown plants.It means, those plants that are clustered with Frantoio are all Frantoio.This is based on the similarity of the marker region and thus they clustered with their respective varieties.Five plants found to be Coratina, 6 were clustered with Arbosana, 6 with Chetoui. 2 were Carolea, 2 Domat, 2 clustered with Moraiolo and 3 found to be Leccino.A total of 49 unknown could be identified while the remaining 51 remained unidentified (Figure 7).They might also be identified by taking more reference controls.In order to find the closeness and differentiation at cultivar level, pairwise alignments were generated using Bio-Edit software to calculate the percent identity.In this connection, three plant cultivars represented in circular UPGMA tree (Tn_Arbosana, Tn_Frantoio and Tn_Koroneiki) were tested.The similarity is 99.26% -99.81% between Tn_Arbosana and samples.It is 99.44% -99.81% in Tn_Frantoio and its clustered plants.Similarly, Koroneiki and its samples are 98.17% -99.16% identical as given in Table 4.It means these are closely related and represent one cultivar.
But surprisingly, the identity was less than 98% and even reduced to 96% between the different known cultivars.Tn_Arbosana and Tn_Frantoiohas 98% identity.Tn_Arbosana and Tn_Koroneikihas 97% and Tn_Frantoio and Tn_Koroneiki has 96% identity.Hence we can infer that 98% identity shows a different cultivar and above it is the same cultivar or plant.

Discussion
Varietal identification of olive plants is very important for further propagation and marketing of olive oil.The   majority of the cultivated olive plants present in Pakistan were brought from foreign countries, mostly Afghanistan and their variety name is not known and this is serious problem that the farmers are facing for years.They can differentiate these plants only by their morphology.They have no idea about the exact variety name or cultivar.As the morphological as well as biochemical parameters have limitations of being not reliable and very time consuming [27].Thus it urged to develop a rapid, reliable and cost effective protocol for the accurate identification through DNA marker, an alternative.Molecular markers can detect DNA polymorphism to discriminate different cultivars in a very effective way [28].
The chloroplast genome of olive is the best platform for resolving the mixed and unknown plants of olive exactly into their varieties [29].CpDNA is mostly conserved but has polymorphic regions enough to be used for this purpose.In this regard, the recent sequencing of the entire chloroplast genome of Frantoio cultivar is a big landmark.Marrioti and colleagues revealed 40 polymorphic regions in the CpDNA.Recent sequencing of plastid genome of the olive flaunts high resolution Cp markers for olive DNA fingerprinting [21].Using this information, we designed a combination of chloroplast markers to amplify genes recruited in photosynthesis, ribosomal and NADH energy metabolism.Concatenated sequence of more than 100 unknown plants and 10 reference plants samples were analyzed using various bioinformatics and phylogenetic tools.
Scanning of entire chloroplast genome revealed 3 polymorphic regions.Multiple alignments of Frantoio and 5 NARC cultivars exhibited cultivar specific SNPs and deletions insertion that paved the way to extend this work to identify plants from 100 samples with more reference controls sampled from Ternab.Besnard and colleagues designed three markers in this region for identification of species or plants [30].The plastid DNA regions screened by them showed a higher level of polymorphisms within the genus Olea than the rps16 and trnL-trnF sequences used in previous study [31].The trnS-trnG intergenic spacer was the most variable region and was highly recommended for phylogenetic reconstructions of Oleaceae.
In this study, the marker region sequences of 100 unknown olive plants were analyzed.In order to investigate the evolutionary relationship, a phylogenetic tree was constructed taking 10 known reference plants.The tree clearly separated the samples into 10 clusters.These clusters include Arbosana, Carolea, Chetoui, Coratina, Domat, Frantoio, Gemlik, Koroneiki, Leccino and Moraiolo.This relationship shows that these plants have sequences similar to the known plants and might be the same variety.Multiple alignments were generated for the entire samples.The alignments revealed conservations groups in these plants on the basis of sequence similarities.This dataset was fragmented into smaller groups.Three clusters including Arbosana, Frantoio and Koroneiki were put under phylogenetic reconstruction again.There was a clear separation of these clusters along with unknown plants.This clustering was further validated using a neighbor net network in Splits Tree4 package.In order to find variety specific SNPs, a multiple alignment for these three clusters was generated.There was an obvious differentiation into three groups."A" was specific to Koroneiki, "C" seemed to be preferable SNP for Frantoio.This is supported by pairwise alignments generated to calculate the percent identity between the samples of three clusters in circular phylogenetic reconstruction.The similarity is 99.26% -99.81% between Tn_Arbosana and samples.It is 99.44% -99.81% in Tn_Frantoio and its clustered plants.Similarly, Koroneiki and its samples are 98.17% -99.16% identical.It means these are closely related and represent one cultivar.But surprisingly, the identity was less than 98% and even reduced to 96% between the different known cultivars.Hence we can infer that 98% identity shows a different cultivar and above it is the same cultivar or plant.
Taken together the data from all the approaches allow us to demonstrate that out of 100 plants 49 could be identified separated into 10 varieties.It is very important to mention that 51 plant samples could not be identified.They were not clustered into any of the known sequence clade.This means that there exist other varieties in these orchards for which we do not have any reference genome sequence.There are two solutions to this problem.First there is need to sequence more known varieties growing in Pakistan or to acquire the DNA of these varieties from other olive growing countries to be used as reference known genome.Secondly we need to sequence another nearby marker to expand gene region.Both the sequences will be joined.This is referred as concatenation of the sequences.It has more resolving power than a single sequence.Hence both sequences will be concatenated for alignment and phylogenetic reconstruction.This will generate more sequence diversity to get plants identified.An alternative strategy is to use nuclear markers (Cos markers) for which already many olive varieties have been sequenced.The implication of the above study is to identify all the fruit bearing unknown olive plants.The advent of high throughput genotyping through base calling SNPs has revolutionized the DNA fingerprinting.It is now possible to sequence the entire genome of the organisms and this technology is becoming cheaper ever passing day.This can be very practical for plants especially olive to sequence the entire plastome of all the samples.

Conclusion
In nutshell, our data reveal that the chloroplast genome of olive has polymorphic sites having variety specific SNPs and indels and they have resolving power to discriminate the olive plants at variety level.The Cp5 primer used successfully identified 49 varieties out of 100 unknown olive plants through mutations detection by alignment of the marker region sequences followed by the phylogenetic reconstruction with different bioinformatics software.This strategy can be further extended to characterize the olive tree germplasm reliably and efficiently with low costs which is distributed throughout the country in search of the better varieties.After the better varieties have been identified, this will enhance the olive oil and fruit production in Pakistan by the on-farm preservation and provision of the authentic germplasm to olive growers for the establishment to new olive orchards.

Figure 1 .
Figure 1.Multiple alignments of the marker region sequences of 5 olive varieties collected from NARC and one sequence of Frantoio retrieved from NCBI database, using BioEdit software.The shaded regions show the conserved sequences in the marker region of the chloroplast DNA of these different varieties.The regions that are not shaded exhibit the sites of mutations.These are SNPs and indels.SNPs are substitutions of single nucleotides.The gaps are the indels.

Figure 2 .
Figure 2. PCR product amplified with CP5 primer visualized on agarose gel.Each fragment is about 720 bp in length.1→110 indicates samples and control PCR products These include 100 unknown samples and 10 reference known samples."M" denotes marker (1 kb).

Figure 4 .
Figure 4. UPGMA phylogenetic tree showing unknown plants along with their reference plants.Tree was constructed using MEGA6 software.The topology of the tree is as that of the corresponding clusters in the circular tree.The number on the nodes indicates bootstrap values for 1000 replicates.

Figure 5 .
Figure 5. Neighbour-net network constructed with SplitsTree 4. The clusters retained their integrity thus further validating the corresponding clusters of the circular tree.and the unknown plants in its cluster taken from Ternab have SNPs at 147, 163 and 221 positions.Similarly, Arbosana and the unknown plants in this cluster from Ternab have a common SNP at position 163 where T has substituted.Two deletions are also found at positions 532 and 536.The unknown plants that have mutations corresponding to their reference plants and on this basis they have clustered together with a unique reference plant.These can be considered to be that variety sharing similarities in the chloroplast DNA sequence.This small dataset validated our results.Over all data show that there are 49 plants differentiated into 10 varieties given as Arbosana, Carolea, Chetoui, Coratina, Domat, Frantoio, Gemlik, Koroneiki, Leccino and Moraiolo (Figure7).A total of 188 mutations are present including SNPs and indels in 110 plants in the region amplified with CP5 marker shown in Supplementary FigureS2.

Figure 6 .
Figure 6.Multiple alignment of the sequences of the marker regions generated in BioEdit software.These are the sequences of the CP5 amplified marker regions of three reference plants (Arbosana, Frantoio and Koroneiki) and the 13 unknown plants (Tn_16, Tn_49….).Variety specific SNPs can be seen in the unknown and reference plant.The regions that are not shaded are the sites of SNPs.The shaded regions are the conserved sequences in this region of cpDNA of the olive plants shown here.

Figure 7 .
Figure 7. Graph showing the number of identified and unidentified olive plants on the basis of DNA sequence variations, multiple alignment and phylogenetic reconstruction.Frantoio and Gemlik revealed maximum matches of 8 each.

Table 1 .
Mutations detected in the selected polymorphic region of olive plastome.Type and position of mutation are also mentioned.

Table 2 .
List of plants identified using multiple alignments and phylogenetic reconstruction.

Table 3 .
Identified olive varieties and number of plants from Tarnab olive orchard.

Table 4 .
Sequence identity percentage calculated through pairwise alignment of the samples in three clusters of Tn_Arbosana, Tn_Frantoio and Tn_Koroneiki calculated in BioEdit software.