Quantitative Analysis of Pathway Enrichment within Faba Bean Seeds RNA-Seq ( Vicia faba L)

Faba bean (Vicia faba L) seeds are an important source of plant protein for humans and animals. A total of 15,697 Differentially Expressed Genes (DEGs) with pathway annotation were discovered in RNA-Seq of the faba bean seeds. A total of 75 significant KEGG pathways abundance were discovered and 9 pathways were conserved within all genotypes. 41 significant pathways were found to be partially conserved within comparisons of 2 to 6 pairs of genotypes and 25 significant pathways were unique to single pairs of genotypes. There were 8 specific significant pathways discovered related to the faba bean seed Hydration Capacity trait and 9 specific significant pathways discovered related to the PSbMV seeds staining trait. The DEGs dem-onstrated the genetic distance between these varieties was confirmed by the breeding pedigree selection information and a PCA graph clearly illustrated the genetic distance within these genotypes.


Introduction
Faba bean (Vicia faba L) is one of the most important food and feed legume crops which provide a source of high protein for humans and animals, and it contributes to increased soil fertility through biological nitrogen fixation. It is an ancient crop and is cultivated by small holder Bronze age farms in the Mediterranean region [1]. Currently it is widely grown and the annual world production Rana, selected from Manafest//611/Manafest, medium-large seed, mid flowering, low level of seed staining due to PSbMV.

RNA-Seq (Quantification) and Bioinformatics analysis
RNA-Seq (Quantification) is used to analyze gene expression of certain biological objects under specific conditions [30] [31].

Total RNA Extraction
RNA extraction using RNeasy kit (Qiagen, Australia) according to the manufacturer's instructions. The RNA-Seq experimental process shows the steps for the experimental pipeline. During the QC step, Agilent 2100 Bioanaylzer was used to qualify and quantify the sample library. The library products were ready for sequencing on the Ion Proton platform performed by the Beijing Genomics Institute (BGI-Shenzhen).

RNA Solutions Preparation and Sequencing
The total RNA samples were first treated with DNase I to degrade any possible DNA contamination. Then the mRNA was enriched by using the oligo (dT) magnetic beads. Mixed with the fragmentation buffer, the mRNA was fragmented into short fragments (about 200 bp). Then the first strand of cDNA was synthesized by using random hexamer-primed reverse transcription. Buffer, dNTPs, RNase H and DNA polymerase I were added to synthesize the second strand. The double strand cDNA was purified with magnetic beads. End reparation was then performed. After the previous step, adaptors were ligated to the end of these fragments. Next, ligation products were selected by size and purified on TAE-agarose gel. Finally, the fragments were enriched by PCR amplification, then purified by magnetic beads and dissolved in the appropriate amount of Epstein-Barr solution. During the QC step, Agilent 2100 Bioanaylzer was used to qualify and quantify the sample library. The library products were then ready for sequencing via Ion Proton platform.

Bioinformatics Analysis Pipeline
Primary sequencing data that produced by Ion Proton, called as raw reads, were subjected to quality control that determined if a resequencing step was needed. After quality control, raw reads were filtered into clean reads which were transformed to fq format, and aligned to the reference sequences at the same time.
QC of alignment was performed to determine if resequencing was needed. The alignment data was utilized to calculate distribution of reads on reference genes and mapping ratio. If alignment result passed QC, we proceeded with downstream analysis including gene expression and deep analysis based on gene expression such as PCA/correlation/screening differentially expressed genes, and further perform deep analysis based on DEGs, including cluster analysis, Gene Ontology analysis and Pathway enrichment analysis.

Sequencing Data Assess
The original image data was transferred into sequence data via base calling, which is defined as raw data or raw reads and saved as BAM file. As the raw reads may contain low quality reads or adaptor sequences, preprocessing was necessary before starting further analysis.

Raw Data Statistics
As there are some adaptor sequences and/or low quality reads present in the raw reads data filtering was carried out to obtain high quality reads as the clean reads (clean data). Filtering steps were as follows: 1) Remove reads where length was less than threshold; 2) Trim reads adapter, if length of trimmed reads was less than threshold, then remove it; 3) Calculate the average quality of 15 bases from 3' end until the average quality was larger than 10, then trim the bases that have been counted.
After filtering, the remaining reads are called "clean reads" and used for downstream bioinformatics analysis.

Base Composition, Quality and Length of Clean Data
We performed quality control on clean data through drawing base composition, quality and length distribution charts. ATGC base content refers to the proportion of the total bases of the four types of bases. Through detecting the content of ATGC each base to measure the stability of library and sequencing eligibility.
The nucleotide distributions at each position were stable under the normal circumstances, non-AT or GC separation. Bases quality reflects the accuracy which can be affected by RNA-Sequencing, reagents or sample quality. If the percentage of the bases with low quality is low, then it indicated the sequencing quality is good.

Alignment Assessment
After data quality statistics, clean reads were mapped to gene reference and/or genome reference set.

Alignment Statistics
We used TMAP to map clean reads to gene reference and/or genome reference.
In general, the higher ratio of alignment, indicated the closer the genetic relationship between the samples and the reference faba bean species.

Sequencing Saturation Analysis
Sequence saturation analysis was used to measure the sequencing data of the sample. The number of detected genes increased with an increase in the number of reads. However, when the number of reads reaches a certain amount, the growth curve of detected genes flattens, which indicates that the number of detected genes tends to saturation.

Reads Distribution on Gene
If the randomness is good, the reads in every position would be evenly distributed. If the randomness is poor, reads preference to specific gene region will directly affect subsequent bioinformatics analysis. We used the distribution of reads on the reference genes to evaluate the randomness. A distribution of reads position on genes shows that the randomness of RNA fragmentation is better than cDNA fragmentation [32]. Therefore, we used the mRNA fragments to construct the library in the experiment.

Visualization of Genome Alignment
We provide genome alignment data in BAM format and recommend using IGV (Integrative Genomics Viewer) tool to visualize BAM file in different scales. IGV supports loading of multiple samples to do comparison in the same scale, and can view distribution of reads on the Exon, Intron, UTR, and Intragenic regions, which makes it very convenient and intuitional.

Gene Expression
Expression levels of individual Unigenes were quantified using the software package Sailfish [33]. Sailfish quantifies genes expression by k-mer, first, built index files use reference and length of k-mer, then computes maximum likelihood abundance estimates using the Expectation-Maximization (EM) algorithm for its statistical model to determine which transcripts are isoforms of the same gene. Expression level was measured in reads per kilobase per million mapped reads (RPKM) according to Equation (1): 6 3 10 RPKM 10 where C is the number of reads that are uniquely aligned to a specified gene (A), N is the total number of reads uniquely aligned to all genes and L is the length of the specified gene (A) in bases. The RPKM values can be directly used for comparing the difference of gene expression among samples. If there was more than one transcript for a gene, the longest one was used to calculate its expression level and coverage. Linear correlation of RPKM values was used to assess the robustness of experimental comparisons made between genotypes.

Deep Analysis of Gene Expression
For multiple samples, we can do more deep analysis based on gene expression to do a comprehensive assess on the whole project.

Correlation between Genotypes
We calculated the correlation value between each two genotypes based on RPKM results. According to the standard that Encode plan recommends, the square of the correlation value should be ≥0.92 (under ideal experiment environment and with reasonable samples).

Cluster Tree of All Samples
The distances of expressed genes in samples were calculated by Euclidean method. Meanwhile, the algorithm of Sum of Squares of Deviations was used to calculate the distance between samples so that a cluster tree could be built.

Cluster Analysis of Gene Expression
Genes with similar expression patterns usually have same the functional correlation. We performed cluster analysis of gene expression patterns with cluster [34] [35] and java Treeview software [36].

Screening of Differentially Expressed Genes (DEGs)
Screening for DEGs was based on the Poisson distribution method described by Audic & Claverie [37] and corrected P-values using the Bonferroni method [38].
Since DEGs analysis generates a large multiplicity of problems in which thousands of hypotheses (is gene x differentially expressed between the two groups) are tested simultaneously, correction for false positive (type I errors) and false negative (type II) errors was performed using the False Discovery Rate (FDR) method [39]. We used FDR ≤ 0.001 and the absolute value of Log 2 Ratio ≥ 1 as the threshold to judge the significance of gene expression difference. DEGs screening is aimed to find differentially expressed genes between (within) samples and perform further function analysis on them. We provide three optional methods for screening analysis of differentially expressed genes, respectively: Based on the analysis method of the Poisson distribution; Based on Noiseq package method; Based on EBSeq package method.
According to the analysis method of the Poisson distribution which screening DEGs between two genotypes. Referring to the significance of digital gene expression profiles, we have developed a strict algorithm to identify differentially expressed genes between two samples. Denote the number of unambiguous clean tags (which means reads in RNA-Seq) from gene A as x, given every gene's expression occupies only a small part of the library, x yields to the Poisson distribution with Equation (2): The total clean tag number of sample 1 is N 1 , and total clean tag number of sample 2 is N 2 ; gene A holds x tags in sample 1 and y tags in sample 2. The probability of gene A being expressed equally between two samples can be calculated with Equation (3): We do correction on P-value corresponds to differential gene expression test using Bonferroni method. Since DEG analysis generates a large multiplicity problems in which thousands of hypotheses (is gene x differentially expressed between the two groups) are tested simultaneously, correction for false positive (type I errors) and false negative (type II) errors are performed using FDR method. Assume that we have picked out R differentially expressed genes in which S genes really show differential expression and the other V genes are false positive. If we decide that the error ratio "Q = V/R" must stay below a cutoff (e.g. 5%), we should preset the FDR to a number no larger than 0.05. We use "FDR ≤ 0.001 [40] and the absolute value of Log 2 Ratio ≥ 1" as the threshold to judge the significance of gene expression difference. More stringent criteria with smaller FDR and bigger fold-change value can be used to identify DEGs.
Based on Noiseq package method (screening DEGs among two groups which contain biological replicates) Noiseq method [41] can screen differentially expressed genes between two groups, showing a good performance when comparing it to other differential expression methods, like Fisher's Exact Test (FET), edgeR, DESeq and baySeq. Noiseq maintains good True Positive and False Positive rates when increasing sequencing depth, while most other methods show poor performance. In addition, Noiseq models the noise distribution from the actual data, so it can better adapt to the size of the data set, and is more effective in controlling the rate of false discoveries.
First, Noiseq uses the sample's gene expression in each group to calculate log 2 (foldchange) M and absolute different value D of all pair conditions to build noise distribution model with Equation (4): Second, for gene A, Noiseq computes its average expression "Control-avg" in control group and average expression "Treat-avg" in treatment group. Then the foldchange (M A = log 2 ((Controlavg)/(Treat-avg))) and absolute different value D(D A = |Control-avg-Treat-avg|) will be derived. If M A and D A diverge from noise distribution model markedly, gene A will be defined as a DEG. There is a probability value to assess how M A and D A both diverge from noise distribution model with Equation (5): Finally, we screen differentially expressed genes according to the following criteria: Fold change ≥ 2 and diverge probability ≥ 0.8.

Deep Analysis of DEGs
Cluster analyses of DEGs with similar expression patterns usually have the same functional correlation. We perform cluster analysis of DEGs with cluster and java Treeview software [36]. Expression differences are shown in different colors. Red means up regulation and green means down regulation.

Pathway Enrichment Analysis of DEGs
Genes usually interact with each other to play roles in certain biological functions. Pathway-based analysis helps to further understand genes biological func-   with three replicated per genotype/experiment. The standard Hydration Capacity test consists of soaking 50 seeds in 150 ml RO water at 22˚C for 16 hours and determining the % change in weight due to uptake of water.
2) 20 samples of each faba bean genotype were sown in a growth room in 2018 and inoculated with PSbMV virus (mechanical inoculation with macerated infected field pea leaves mixed with carborundum powder and rubbed on the faba bean leaves) when they were at the two leaf growth stage. Faba bean leaves were collected 6 weeks after inoculation to validate infection by PSbMV using both ELISA assay and RT-PCR sequence. Seeds were harvested at maturity and assessed for staining due to PSbMV. Seeds were assigned to one of 6 categories rated as 0 for no staining to 5 for very severe staining and an index Fx of overall staining for the individual sample was calculated by the following Equation (6): where n = the number of seeds within a category, N = the total number of seeds.

Results and Discussion
Genes interact with each other to play roles in certain biological functions.
Pathway-based analysis helps to further understand the genes biological functions. KEGG (Kyoto Encyclopedia of Genes and Genomes) is the major public pathway-related database (http://www.kegg.jp/kegg/kegg1.html), which is used to perform pathway enrichment analysis of DEGs. This analysis identifies the most significant enriched pathway in DEGs as compared with the whole genome. Detecting the most significantly enriched pathway of DEGs, allowed us to see detailed pathway information in the KEGG database, no matter if the genes are up-regulated or down-regulated. We generated a scatter plot for KEGG enrichment results. Rich Factor is the ratio of number of differentially expressed genes in this pathway to the number of all genes annotated in there. The greater the Rich Factor means the greater the degree of pathway enrichment. The Q value is the corrected p value with the range 0 -1, where a lower value means greater intensiveness. The top 20 pathways enriched in comparisons between individual genotypes and between phenotype groupings are presented in each comparison (See Figure 1 for Farah vs Nura and Supplementary files Figures S1-S14). These results indicate that most significant enriched pathway is Metabolic pathways (ko01100) followed by the Biosynthesis of secondary metabolites (ko01110); Plant hormone signal transduction (ko04075); Plant -pathogen interaction (ko04626); ABC transporters (ko02010); Zeatin biosynthesis (ko00908); Stilbenoid, diarylheptanoid and gingerol biosynthesis (ko00945); Phenylpropanoid biosynthesis (ko00940); Phenylalanine, tyrosine and tryptophan biosynthesis (ko00400); Pentose and glucuronate interconversions (ko00040); N-Glycan biosynthesis (ko00510); mismatch repair (ko03430); Limonene and pinene degradation (ko00903); Isoflavonoid biosynthesis (ko00943); Glycine, serine and threonine metabolism (ko00260); Glycerolipid metabolism (ko00561); Flavonoid biosynthesis (ko00941); Fatty acid biosynthesis (ko00061); Caffeine metabolism (ko00232) and Base excision repair (ko03410).

Common KEGG Pathways Abundance Discovered within Faba Bean Seeds RNA-Seq
A total of 47,621 expressed genes were identified in faba bean seeds RNA-Seq [16]. There were 15,697 significant Differentially Expressed Genes (DEGs) with pathway annotation. A total of 266 significant or highly significant enriched pathways were discovered in faba bean seeds by single variety pair comparisons. This was reduced to 75 unique abundance pathways after removal of redundancies (See Table 1 and Table 2). In the comparison of individual pairs of genotypes, the most common pathways were Metabolic pathways (ko01100) and Biosynthesis of secondary metabolites (ko01110) where commonality of the pathways was significant to highly significant for 14 out of a total of 15 (93.3%) of all pairwise comparisons. Only one pair comparison (PBAZahra versus AF06125) was not significant within these two pathways. The second most common pathway was Zeatin biosynthesis (ko00908) which was significant for 13 out of 15 (86.7%) pair comparisons. Four pathways were significant for 9 out of 15 (60%) pair comparisons:   Note: *means 0.01 ≤ P ≤ 0.05, **means 0.001 ≤ P ≤ 0.01, ***means P < 0.001. () If the significance level is different in the same KEGG pathway within different pairs, the number in the bracket represents the number of pairs at the same level of significance of faba bean seeds in common in total of 15 pairs analyzed. 2. AF06125-VS-Nura High seed staining Note: *means 0.01 ≤ P ≤ 0.05, **means 0.001 ≤ P ≤ 0.01, ***means P < 0.001.

Trait Related KEGG Specific Pathways Abundance Discovered within Faba Bean Seed
The comparison of genotypes grouped on the basis of particular seed traits was much more specific than the comparison of individual genotype pairs. Regarding the seed trait Hydration Capacity the comparison between the Lowest and the Highest groups identified 5 pathways where abundance was highly significant: Zeatin biosynthesis (ko00908), Ribosome biogenesis in eukaryotes (ko03008), RNA degradation (ko03018), Terpenoid backbone biosynthesis (ko00900), and Homologous recombination (ko003440), while three pathways were significant: Biosynthesis of secondary metabolites (ko01110), Porphyrin and chlorophyll metabolism (ko00860), and Stilbenoid, diarylheptanoid and gingerol biosynthesis (ko00945) (See Figure 1). For the comparison between Intermediate Hydration Capacity and Highest Hydration Capacity, there was only one pathway that was highly significant: Base excision (ko03410), and there were two significant pathways: Cutin, suberine and wax biosynthesis (ko00073) and Zeatin biosynthesis (ko00908) (See Figure 2(a) and Figure 2(b)).

PCA Analysis Result
Principal component analysis (PCA) display the distance of the relationship between each sample, including visual effect of clusters groups. The PCA 3D Figure (See Figure 4) shows the genetic distance relationship between these genotypes generated by the DEGs. The closest genetic distance of these samples is between Farah and PBA Zahra and the second closest variety to Farah is AF06125. PBA Zahra and AF06125 are close in the same vertical line. The third closest variety to Farah is Nura, and then PBA Rana, and the most distant variety to Farah is PBA Warda. These results are perfectly matching the breeding selection pedigree information. Farah is PBA Zahra's paternal parent (50%) and Farah is also in the pedigree of AF06125 (12.5%). This figure also explains the much lower number of significant common pathways between PBA Zahra and AF06125; about one third of average pathways number compare to all other pairs comparison in Table 2.

Validation Outcome
Hydration Capacity The hydration capacity of AF06125 was highest with an average of 95.4% and PBA Zahra was lowest with 74.3%. The data were for samples harvested from two sites growing in South Australia from 2007 to 2017. The other four samples were intermediate. (See Table 3) These results supported the pathway abundance of Hydration Capacity index samples information.   PSbMV Index The PSbMV Index of AF06125 was lowest on 33% and PBA Zahra was highest on 73%, where the others were intermediate (See Table 4). The ELISA test and RT-PCR test were both positive for PSbMV virus indicating that all varieties are susceptible to PSbMV (See Table 5). These results confirmed the information of pathway abundance of PSbMV index samples information.
There are reports of pathway studies on faba bean at the seedling and flowering stages, however there are no prior reports of pathway studies on mature faba American Journal of Plant Sciences   bean seeds. Hence the pathways information reported here is the first deep pathway analysis of faba bean seeds RNA-Seq. It can be a source of knowledge for future studies on faba bean seeds. These results reveal the genetic distance within these varieties and genomic information for faba bean seeds, and fill the knowledge gaps of faba bean seed pathways. The KEGG pathways information will help understanding the faba bean seed genes functional activities. The genetic distance of these varieties are confirmed both by DEGs generated from RNA-Seq and breeding pedigree information.
There are pathway analysis studies on other plants and crops such as rice, maize, and chickpeas.
Due to the limited number of faba bean genotypes using for sequencing in this report, it may not fully identify all the pathways information in the faba bean seeds. Hence, investigating more genotypes in future sequencing for analysis would be recommended.
RNA-Seq (Quantification) is a cost-effective quantification method that produces high reproducibility, high accuracy and wide dynamic range. It can be applied in drug response, biomarker detection, basic medical research, and drug R&D. And also applied in gene expression analysis, differential gene expression analysis, expression profile analysis of DEGs, and Gene ontology classification and pathway enrichment analysis.

Conclusion
In summary, there are a total of 75 significant or highly significant KEGG pathways discovered within these faba beans seeds RNA-Seq. There are a total of 9 significant pathways (over 53.3% pairs) that are conserved in abundance within all the seeds, 41 significant pathways found within 2 to 6 pair comparisons and 25 significant pathways were unique to single pair comparisons. There were 8 specific significant pathways associated with the faba bean seed Hydration Capacity trait and 9 different specific significant pathways associated with the PSbMV seeds staining trait. The seeds hydration level has been validated by 10 years of standard hydration capacity testing. The seeds staining level of these samples has been validated by the field observation in South Australia and a growth room inoculation test in Adelaide. ELISA and RT-PCR experiment confirmed that inoculated plants were positive for PSbMV. The genetic distance between these varieties in the PCA 3D graph confirmed breeding pedigree selection information.

Ethics Approval and Consent to Participate
Not applicable.

Consent for Publication
Not applicable.

Availability of Data and Materials
BioProject ID PRJNA319071 and RNA-Seq (Quantification) data are deposited at National Centre for Biotechnology Information (NCBI) gene bank. Reference number is SRA accession: SRP074308. The faba bean Assembly data deposited in Figshare: The data DOI is: Digital Object Identifier 10.6084/m9.figshare.4910039.