The Potential of Dna Barcode-based Delineation Using Seven Putative Candidate Loci of the Plastid Region in Inferring Molecular Diversity of Cowpea at Sub-species Level

The novelty and suitability of the mitochondrial gene CO1 in DNA barcoding as a reliable identification tool in animal species are undisputed. This is attributed to its standardized sequencing segment of the mitochondrial cytochrome c oxidase-1 gene (CO1) which has the necessary universality and variability making it a generally acceptable barcode region. CO1 is a haploid single locus that is uniparentally-inherited. Protein-coding regions are present in high-copy numbers making it an ideal barcode. The mitochondrial oxidase subunit I (COI) gene is a robust barcode with a suitable threshold for delineating animals and is not subject to drastic length variation, frequent mononucleotide repeats or microinversions. However, a low nucleotide substitution rate of plant mitochondrial genome [mtDNA] precludes the use of CO1 as a universal plant DNA barcode and makes the search for alternative barcode regions necessary. Currently, there exists no universal barcode for plants. The plastid region reveals leading candidate loci as appropriate DNA barcodes yet to be explored in biodiversity studies in Kenya. Four of these plastid regions are portions of coding genes (matK, rbcL, rpoB, and rpoC1), and three noncoding spacers (atpF-atpH, trnH-psbA, and psbK-psbL) which emerge as ideal candidate DNA loci. While different research groups propose various combinations of these loci, there exists no consensus; the lack thereof impedes progress in getting a suitable universal DNA barcode. Little research has attempted to investigate and document the applicability and extend of effectiveness of different DNA regions as barcodes to delineate cowpea at subspecies level. In this study we sought to test feasibility of the seven putative


Background
DNA-barcoding is a technique used for the taxonomic characterization and phylogenetic analysis of organisms and entails the use of defined regions within the DNA genetic material of an organism, which though exposed to evolution mechanism, is conserved between and within the species.This region serves as a tool to uniquely identify two individuals with unique ancestral-lineage.DNA barcoding is a sequence-based identification system that may be constructed of one or several loci taken together as a complementary unit in delineating relationships and inferring patterns of change among related organisms.It employs short highly variable regions of the genome to delineate organism that are closely related.In animals including human, cytochrome oxidase I (coI-gene) has vastly been utilized as a unique barcoding region for phylogenetic analysis.The mitochondrial coxidase subunit I (COI) gene has demonstrated significant reliability and recoverability notwithstanding its limited application in the plant kingdom [1].The Consortium for the Barcode of Life (CBOL) plant-working group proposes seven Barcode regions for use as barcoding markers [2].The feasibility of DNA barcoding and the use of plastid regions in biodiversity studies can be an important tool of utility at molecular level.Plastid DNA candidate loci are universally present and conserved in the plant target lineages and can provide a rapid and reproducible molecular identification away from the Linnaean system of nomenclature.The informativeness of each barcode region was therefore explored singly and in combination with a view to assessing the feasibility of these candidate loci in infra-specific and intraspecific discrimination of phylogeographic groups among Kenyan cowpea variants.This was informed by the fact that chloroplast genes exist in a single copy and are conserved among plant eukaryotic genomes.These regions undergo limited mutation overtime and are considered novel in revealing evolutionary divergences and ancestral lineages within species.A single and multilocus approach was explored using seven chloroplast genic candidate DNA regions (rbcL gene, atpF-atpH spacer, matK gene, rpoB gene, rpoC1 gene, trnH-psbA spacer and psbK-psbI spacer.

MatK
matK is considered an important tool in plant evolutionary studies and systematics.matK gene loci is the only putative group II intron maturase encoded in the chloroplast genome of plants and is the only plastid gene containing this putative maturase domain in higher plants [3].matK is Maturase-Kinase gene, a plastid gene responsible for the chloroplast post-transcriptional processing.It has an unusual evolutionary tempo, with relatively high substitution rates at both nucleotide and amino acid levels according to [4].The strong phylogenetic signal from matK gene renders it invaluable gene loci in plant systematics and evolutionary studies at various evolutionary depths.This gene is proposed as the only chloroplast-encoded group II intron maturase, and is suggested to play a role in chloroplast post-transcriptional processing.The ability of Maturase kinase sequence to solely work as a barcoding candidate was assayed together with the six other candidates in the present study to delineate distance characterization.matK has in the recent past emerged as an invaluable locus in plant biodiversity and systematics based on its highly informative ability to decode phylogenetic distinctiveness unlike other candidate loci [5].This genelocus has 1500 base pair nested in the group II intron of the 50 and 30 exons of trnK in the large single copy region of the chloroplast genome of green plants.matK gene sequence is one of the seven putative gene loci widely utilized in the DNA barcoding of land plants [6].Phylogenetic analysis based on matK, against other candidate genes has demonstrated excellent parsimony informative characters with significantly more phylogenetic structure pereach parsimony-informative site contrary to the highly conserved chloroplast/plastid region.matK sequence information has been reported to generate robust phylogenies [7] and is considered to have reliable evolutionary rate, suitable length and good inter specific divergence as well as a low transversion rate [8].matK is however difficult to amplify universally demonstrating that matK barcode albeit informative, may be inadequate and inconclusive when used in isolation as a universal barcode.This study therefore considered matK alongside other six barcode.

rbcL
A region of the chloroplast gene rbcL-RuBisCo large subunit has been considered an P. Okoth et al.
ideal candidate barcode region in plants and is famed as the most abundant protein on earth.The region, RuBisCo (Ribulose-1, 5-bisphosphate carboxylase oxygenase) is used in the catalysis of the first step of carbon fixation and is a target region in phylogenetic investigations due to its easy amplification, sequencing and alignment.Many taxonomists consider rbcL gene as ideal DNA barcoding region for plants at both family and generic level.However, rbcL sequences have the limitation of slow evolution.The rbcL locus has the lowest divergence of plastid genes in flowering plants according to [1].Studies by (CBOL Plant Working Group [9]- [12] report modest discriminatory power of this locus.Other studies however indicate that rbcL remains one of the best candidate barcodes based on the straightforward recovery of the gene sequence, easy accessibility and discriminatory power [13].

trnH-psbA
The trnH-psbA region is a straightforward region easily amplifiable across land plants, and is one of the most variable intergenicspacers [14].It has been used successfully in a range of barcoding studies.[15] report that the trnH-psbA, non-coding intergenic region exhibits significant sequence divergence with notable insertion/deletion rates.Studies by [16] indicate that this plastid region has highly conserved coding sequences that makes it an attractive marker.These attributes make trnH-psbA an important plant barcode for species discrimination [15].However, the complex molecular evolution and considerable length variation of trnH-psbA limits it as a barcode singly [17].However, trnH-psbA is reported to suffer high rates of insertion or deletion in larger families of angiosperms.The trnH-psbA putative gene loci albeit a standard barcode region in most plants has been reported to suffer frequent inversions in some lineages of plants and singly as a barcode marker, may result to overestimation of genetic divergence and consequently inaccurate assignment of phylogenetic position [18].

atpF-atpH
The second International Barcode of Life Conference proposes that at pF-atpH intergenic spacer is a potential plant barcode region [9].The fact that atpF-atpH marker has not been widely used in studies of plant systematic and phylogeographics has led to paucity of data on its performance as barcodes.However, the CBOL Plant Working Group indicate that atpF-atpH has relatively modest discriminatory power, intermediate sequence quality and universality and could be used as a plant DNA barcode.Recent studies document positive reports on the performance of atpF-atpH as a plant barcode region [19].Ki-Joong Kim, pers.Comm reports usefulness of atpF-atpH on the Korean flora biodiversity studies.Studies on duckweeds [20] also demonstrated that atpF-atpH, a noncoding spacer could serve as a universal DNA barcoding marker for species-level identification.In their study, the utility of this non cording region in identification of new species by reason of its ease of amplification, straightforward sequence alignment and rates of DNA variation was reported [20].In the same study, it's documented that DNA barcoding made significant contribution to the taxonomical structure in duckweeds as opposed to the less informative morphological classification and therefore recommends atpF-atpH as an important barcode region in biodiversity studies.The current study therefore seeks to among others test the in formativeness of this barcode region in delineating cowpea diversity at sub-species level.
Previous studies by [15] support the earlier observation that trnH-psbA coupled with rbcL can correctly delineate and discriminate among related species.In the same study, a combination of the non-coding trnH-psbA spacer region and a portion of the coding rbcL locus are considered ideal two-locus global land plant barcode that provides the necessary universality and species discrimination that meets good threshold of CBoL.
In this study, we sought to test these combinations in delineating cowpea accessions at varietal level.An investigation of the relevance of DNA barcoding to correctly delineate and discriminate between closely related cowpea variants is therefore presented for biodiversity analysis and to evaluate the overall utility of chloroplast DNA barcode candidates in reconstructing their ancestry.

Statistical Analysis
BIOEDIT Software was used to analyze raw sequence data chromatograms by trimming and assembling.MEGA 6.06 software was then used for multiple sequence alignment.
The UPGMA was used for clustering and generating an accurate topology with reference to the molecular clock based on the greatest similarity amongst pairs [35].Distance estimation model used the Kimura 2 Parameter (K2P) model.This was because K2P distances are best when distance is low especially in highly similar sequences [36].This was further informed by the fact that our sequences were from closely related cowpea sub variants.The K2P distances generated were exported into an Excel format which was uploaded into GraphPad software and STATA for further statistical analysis.GraphPad Prism v. 7 was then used to plot the histograms on Intra-specific and Infra-specific distances after importing the excel file containing the K2P distances.Wilcoxon Signed Ranks test was performed to separately compare the intra-specific and infra-specific distances of between markers at single and multi-locus level.The choice of Wilcoxon test was informed by the fact that data was non-parametric and did not assume a normal Gaussian distribution [37].STATA 13 was then used to perform the Moods median Test and the Wilcoxon Two Sample Test for Table 7. Moods median test was used because the data was not normally distributed and yet the mean would not be a good representative of "location" [37]; hence the need to employ a statistical tool which utilized the median as a better estimator of "location".The Moods Median test was then preferred to the Sign test.The Wilcoxon two sample tests acted as a quality control for the median test by comparing the P-values of both for significance.Young leaves were sampled eight days after planting in 1.5 mL Eppendorf tubes.They were immediately frozen in liquid nitrogen and stored at −80˚C.Leave samples were then manually ground using a micropestle.DNA quality and quantity control was done using Nano-drop spectrophotometer.DNA was normalized by adjusting its concentration to 25 ngμL in an optical 96-well Reaction plates using sterile de-ionized water.Total DNA was extracted using Qiagen plant DNA extraction protocol as per the manufacturer's guidelines with slight modifications.Total DNA was extracted from each sample and quantified using the DNA Nano Drop ND-1000 by Thermo Fisher at the Kenya Medical Research Institute [KEMRI-Kisumu Kenya].This was followed by amplification of three intergenic spacers' atpF-atpH, psbK-psbI and trnH-psbA and four genes: matK, rbcL, rpoB, rpoC1using the primers as shown in Table 1.The amplicons were resolved on 3% agarose gel at 80 V for 48 minutes.The gels were observed for bands using a UV trans-illuminator (FotoDyne model 3 3500 Foto-Prep).Photographs of the bands were taken using the software "Strata-gene Eagle View" that was integrated with the digital camera on the UV trans-illuminator.The bands containing DNA of interest which is 400 bp long was excised and the DNA purified using the DNA purification kit from Qiagen.Sequenced products were then analyzed using an automatic sequencer, ABI3730XL (Applied Biosystems).The sequence chromatograms were analyzed and the terminals were trimmed using the BioEdit software (Thomas Hall & Abbott).Similarity searches were conducted based on Basic Local Alignment Search Tool (BLAST) located at www.ncbi.nih.gov/blast; with the parameters set as follows: database-non redundant; The sequences were then named as per the molecular characterization.Three subspecies were identified namely: Vigna unguiculata (L.) Walp.Subsp.Cylindrica; Vigna unguiculata var.serotina Bertoni; Vigna unguiculata subvar.Deflexa Bertoni.Also included in our analysis were two other species namely Rhynchosia minima and Vignaluteola to act as out groups for the phylogenetic analysis.The sequences were then as- sembled into a single Fasta file format in BioEdit software version 7 (Thomas Hall & Abbott).This was followed by performing a local alignment using muscle in Mega 6 software with UPGMA as the clustering method.The intra-specific and infra-specific distances were determined using Kimura-2-parameters in mega 6.Since all the sequences were from the genus Vigna unguiculata but different sub-variants; therefore in Mega 6;

Materials and Methods
the intraspecific distances were determined as "between the various group mean distances" and the infraspecific sequences inferred based on "within group mean distance".

Sequence Alignment, Analysis and Amplification Efficiency
Consensus sequences were generated and sequences of the candidate DNA barcodes aligned using muscle.Genetic distance matrices were calculated on the basis of Kimura 2-Parameter (K2P) substitution model for the seven chloroplast candidate DNA loci and the average values between subpopulations inferred.Combined DNA barcode sequences showed significant intra specific but low infra-specific variation rates (Table 3 and Table 4).Average infra and intra-specific distance, mean theta and coalescent depth were calculated to determine infra and intra-specific variation [Table 3 and Table 4].Wilcoxon signed-rank tests were performed.The distribution of intra-specific versus infra specific variability was evaluated by assessment of the DNA barcoding gaps.The average intra-specific distance, mean theta, and coalescent depth were calculated to evaluate the intra-specific variation [Table 3 and Table 4].Wilcoxon signedrank tests were used [Table 5 and Table 6].Infra-and intra-specific genetic divergences were calculated based on each putative candidate loci.To characterize intraspecific divergence it was necessary to invoke three different metrics.Genetic distances between cowpea variants were used to characterize intra-specific divergence.For each barcode, pairwise distances were calculated with the simplest K2P model followed by Wilcoxon Signed Rank Tests to compare infra-and intra-specific variability for every barcodes following Kress and Erickson [Table 5 and Table 6.DNA barcoding gaps were evaluated by comparing the distribution of infra-versus intra-specific divergences.Median and Wilcoxon Two-Sample Tests were used to evaluate any overlaps in the distributions with a view to establishing a suitable single and multilocus barcode for cowpea at sub-species level [Table 6].

DNA Barcoding Success and Levels of Variability
Overall, PCR amplifications were largely successful and any low quality sequences and dubious amplicons were excluded from the analyses [Table 2].Similar observations were made by [11]; [15].However, all the primer pairs designed for each DNA region proved highly successful [Table 1].

Evaluating the Feasibility of Using DNA Barcodes in Delineating Subspecies of Cowpea
Infra-and Intra-Specific Diversities Performances of each of the seven candidate DNA barcode loci was assessed by means of intra-and infra-specific diversity calculated from K2P (Kimura's two parameters) pairwise distance matrices [12].The highest intraspecific diversity was reached by rbcL [Table 3] followed by matK and psbL-psbK respectively.However, the lowest intraspecific distance was reported by rpoB [Table 3 and Table 4].The mean coalescent depth was slightly superior to the average of overall intraspecific distances because it takes into consideration only the highest distance.Results showed the highest mean of intraspecific differences was recorded by rbcL (Table 4).Three different metrics were used to characterize intra and infra-specific divergence between average of all the pairwise distances between all individuals sampled within the samples and mean theta with theta being the average pairwise distances calculated for each sample and the coalescent depth [Table 3   Legend: Table 3 above and Table 4 below: The intra-specific and infra-specific genetic divergences were calculated for each DNA barcode.Measures used include: the average pair wise distances between all the sampled within subspecies having a minimum of two representatives; the "mean-theta" where theta is the average pair wise distance computed for each species having more than one representative thus eliminating partiality due to uneven sampling among taxa; and the average coalescent depth which is the depth of a node linking all samples.The aforementioned parameters were deemed important in characterizing infra-specific divergence.the seven loci at single and multi-loci level for sub-species identification, BLAST1 searches and the nearest genetic distance were used (Table 3 and Table 4).The goal was to identify the most informative single locus candidate DNA barcode gene markers that show the best discriminatory power at the varietal level in common cowpea accessions.Our results revealed that at intraspecific level, rbcL [50.15%] possessed the highest identification efficiency among the seven loci followed by matK [27.76%] and psbK-psbL [23.77%] respectively at single locus level [Table 2 and Table 3 To circumvent the challenges associated with single locus approach, study took a multilocus analysis (MLA) as a useful option in delineating closely related cowpea variants based on the following multigenic combination matK + atpF-atpH; matK + trnH-psbA; matK + atpF-atpH + psbKpsbI; matK + psbK-psbI; matK + trnH-psbA + atpFatpH and matK + trnH-psbA + psbKpsbI.It is worth noting that the success rates of combined barcodes were higher than those of the single locus for intraspecific variation as well as infraspecific variation as expected with matK + psbK-psbL + atpF-atpH and psbK-psbL + atpF-atpH + matK giving the highest identification at 108.6% [Table 3 and Table 4].Overall, however, the means of all infraspecific distances were significantly lower than those of intraspecific distances as expected [Table 3 & Table 4].
Overall, the findings of this study reveal the novelty of chloroplast genes in delineating the diversity of cowpea at sub-species level.The superiority of rbcL as a suitable single loci candidate in delineatingcowpea variants is demonstrated and continues to reveal multigenic combinations as being equally informative.

Single Locus Approach [SLA]
The infra and intraspecific genetic divergence was inferred based on the seven candidate DNA barcode loci at a scale of 0.001 distance units [Figure 1 for single loci and  2].In this study however, no distinct barcoding gaps typical of CO1 may have been reported but it lends credence to a clearly defined range where the infraspecific variation is significantly lower than the intraspecific divergence as expected.Out of the seven candidate loci, rbcL loci reveal a relatively well separated distribution followed by matK at single locus level [Figure 1 & Figure 2].Furthermore, it was confirmed that the intraspecific divergences of all the seven loci was significantly higher than that of the corresponding infraspecific variations by Wilcoxon two-sample tests [Table 5 and Table 6].

Multi Locus Approach [MLA]
Previous studies have raised concerns about SLA in discriminating between closely related organisms [9]; [12]; [15]; [25]; [26] and [28].To circumvent this, a multilocus approach (MLA) was employed in delineating infra and intra-specific genetic divergencebetween the samples based on the following multigenic combination matK + atpF-atpH; matK + trnH-psbA; matK + atpF-atpH + psbKpsbI; matK + psbK-psbI; matK + trnH-psbA + atpFatpH and matK + trnH-psbA.Overall, the findings of this study reveal the informativeness of chloroplast genes in delineating the diversity of cowpea at sub-species level and continue to reveal further that MLA is more informative [Tables 3-5].The superiority of rbcLas a suitable single loci candidate in delineating cowpea variants is demonstrated, and continues to reveal multigenic combinations as being more informative.Two clear peaks appear distinguishable albeit some overlap of intra-and infra-specific distances [Figure 2].These observations are confirmed by median and Wilcoxon two samples statistical tests differentiating the medians for the former and the medians plus the distributions between the infra-and intra-specific distances for the latter Table 5 and Table 6].For each distribution, Median and Wilcoxon two sample tests were largely significant which agrees with studies by [12].The histograms above give a visual impression of the bar-coding gap for each potential marker.A good marker for DNA bar-coding should have "good" gap and no overlaps between the peaks [12]; [28], [38].However, a marker with overlaps between the extreme peaks is not necessarily a bad candidate for bar-coding.Therefore to test this, a statistical tool was employed to evaluate each of the markers and the different marker combinations as a potential barcode candidate despite the overlaps and issues with the bar-coding gap.The use of the Wilcoxon signed rank tests was specifically due to the fact that our data on the Kimura Two Parameters Pairwise Distances (K2P) do not assume a normal Gaussian distribution as well as sample means did not assume a normal distribution.So instead of testing for the difference between the means we tested for the difference in distribution.Consider the case of matK vs atpF-atpH below; W+ = 25350, W− = 31600, N = 1446, P = 0.0808, matK = atpF-atpH.Looking at the histograms, the marker atpF-atpH seems to be more superior to matK because matK has a lot of overlaps.However upon conducting the Wilcoxon Signed Sum Rank test comparing the data on the two markers, we find that there is no significant difference in the distribution of the datasets within the two markers since P = 0.0808 which is more than the P value threshold at 95% confidence (P = 0.05).In Table 7 above, a comparison of intra and infraspecific distance medians was explored in order to determine the best barcode marker.In this table, the P value for the median test is the probability that the difference between intra and infraspecific  divergence does not occur for example, the P value for matK is 0.0387 signifying a 3.87% type two error rate.Therefore, in determining intra and infra specific distances, matK has an accuracy of more than 95% where P < 0.05.This is confirmed by the Wilcoxon test where P = 1.66 −5 which is the type one error rate.Compared to the marker atpF-atpH, the median test P value = 0.8473 signifying an 84.73% error rate with Wilcoxon P value = 0.1523 making it an unsuitable marker for delineating intra and infraspecific distances.

Conclusion
This study sought to investigate the plastid sequences in Kenyan cowpea variants looking for the loci that could be used for delineating different phylogeographic groups.
Based on the results presented here, the study concludes the best locus combinations for DNA-barcoding of Kenyan cowpea.The evidence presented here clearly demonstrates the overall utility of DNA barcoding in delineating molecular diversity of Ke-nyan cowpea at sub-species The results of this study demonstrate the informativeness of plastid region in delineating intra and infra-specific distances at single loci level; matK, trnH-psbA, psbK-psbL, and rbcL.rbcL and matK distinguish themselves as ideal barcodes at single loci level.However, among the combinations, matK + trnH-psbA, rpoB + atpF-atpH + matK appear to be the best barcodes in delineating genetic distances.The current study however demonstrates that using combinations of DNA barcodes [MLA] improves accuracy of delineation.This study therefore recommends a multi locus approach in delineating cowpea at varietal level.

4. 1 .
DNA Extraction, PCR Amplification, and Sequencing 57 cowpea accessions from different phylogeographic locations of Kenya deposited in the National Gene Bank of Kenya were used for this study.The accessions were collections from different agro-ecological zones of Kenya.Seeds were planted in the green house at Masinde Muliro University of Science and Technology under strict conditions.

Figure 2
Figure 2 for multiple loci combinations].The goal was to identify the most informative single locus candidate DNA barcode gene markers that show the best discriminatory power at varietietal level in common cowpea.DNA barcoding gaps were evaluated by comparing the distribution of intra-versus infra-specific divergences [38].Median and Wilcoxon Two-Sample Tests were used to evaluate whether these distributions overlapped.The assessment of the informativeness of each candidate loci was therefore done by analyses of intra and infraspecific K2P distance matrices.The purpose was to delineate the barcoding gap.It is noteworthy that the distance distribution for each single loci gene displayed the characteristic peaks [Figure 1 & Figure 2].In this study

Table 1 .
List of cpDNA genes/intergenic spacers amplified in the present study including primers and approximate amplicon lengths.
search-megablast and the expectant value set at 10 −9 .The sequence that had the lowest expectant value (E-value) and the highest identity score was considered to be a similar sequence.The NCBI taxonomy tool (https://www.ncbi.nih.gov/taxonomy) was then used to determine the complete classification of the sequence which was confirmed in the International Plant Names Index website (https://www.ipni.org/ipni/plantnamessearchpage.do).

Table 3 .
Measurements of intra-and infra-specific K2P distances matrices for four potential barcode regions, three intergenic spacers and multi locus combinations

Table 4 .
Measurements of intra-and infra-specific K2P distances matrices for three potential barcode regions, three intergenic spacers and multi locus combinations