Prediction of Monophyletic Groups Based on Gene Order and Sequence Similarity in Organelle DNA

Organelle genomics has become its own field of study. Much information can be gleaned from the study of cell organelles. The differences in the genomes of organelles, such as the mitochondrion and the chloroplast are amenable to phylogenetic and cladistic studies. These differences include the genome sequence, GC%, genome length and gene order. The conserved nature of the organelle genomes and the gene inventory of both mitochondrial and chloroplast genomes also make this easier to accomplish. This paper includes a review of existing organelle genome software. These include gene annotation and genome visualization tools as well as organelle gene databases for both mitochondrion and plastid. A new R tool, available on github, called “Orga-nelle DNA Lineages”, or ODL, was written to compare and classify organelle genomes based on their genome sequence and gene order. The software was run on the mitochondrial genomes of a set of 51 cephalopod species, deline-ating ten separate monophyletic groups, including argonauts, nautiluses, oc-topuses, cuttlefish, and six squid groups. This new tool can help enrich and expand the field of organelle genomics.


Introduction
The DNA inside mitochondria and chloroplasts can very useful in analyzing phylogenetic relationships between species. Several characteristics demonstrate why organelle DNA is amenable to such analyses. But despite its small size, organelle DNA is easy to isolate and sequence. Many species which do not have (OXPHOS), the tricarboxylic acid cycle (TCA), the β-oxidation of fatty acids, calcium handling, regulating apoptosis, and participation in the cell cycle. Because mtDNA lacks histones, it is more prone to molecular stress. Damaged mtDNA lies behind a number of human diseases [5] [6] [7]. Other illnesses related to mitochondrial dysfunction include skeletal muscle dysfunction, lung injury, acute renal failure, and immune function dysregulation [8].
An example of the classical mtDNA genome is that of human (NC_01290, Figure 1), which is 16,569 bp long and has 37 genes, coding for 13 protein subunits, 14 tRNAs and 2 rRNAs [9]. The 13 protein subunits form the oxidative phosphorylation complexes I, II, IV and V [7]. The mtDNA consists of two strands, the heavy (H) strand and the light (L) strand, which are differentiated based on their GC-content, the H-strand having a higher GC-percentage. This correlates to the sense and antisense terminology. The codon usage of the mtDNA also differs from that of the nuclear genome (see Table 2), and also has a ten-fold higher mutation rate [10]. The thirteen protein subunits coded on the mtDNA all take part in energy metabolism. A list of genes in the mitochondrial genome can be seen in Table 3.
The mtDNA of different groups can also differ based on the presence or absence of specific genes. These genes serve as diagnostic markers of these groups. For example, several algal lineages have a DnaB helicase encoded in the plastid genome [11]. Opisthokonts (animals and fungi) use DNA polymerase γ (Polγ) for the replication of the mtDNA [12]. The number and type of DNA polymerase also varies across eukaryotes. For example, parasite apicomplexans have lost them altogether [11].   Besides genes, the presence or absence of certain DNA motifs can help classify species into different groups because of their genome structure. They are abundant, have a simple genomic structure, and are also conserved, and can be used as genetic markers [13]. For example, Xi Wen et al. [14] found that 218 simple sequence repeats (SSR) in the plastid genome of M. grandiflora correlate with three other Magnoliaceae species. These SSRs range from mononucleotides to hexanucleotides.
In contrast with the mtDNA, the plDNA is much larger than the mtDNA, and contains around 250 genes, including tRNAs and rRNAs, involved in photosynthesis and plastid gene regulation. Xiao-Ming et al. [15] found that in a study of 126 plDNA genes in 272 angiosperm species, 106 (84.1%) were found in 245 species (90.1%). Genome sizes range from 10 -250 kbp in size, and can be present inside the plastid between 100 to 1000 copies Kbp [16]. Some genes in the plastid genome of some species also have introns, such as in Magnolia grandiflora [14]. Similar to the mtDNA, the plDNA is inherited from one parent only, except for 20% of angiosperms, where plastids are inherited from both parents.
Plastid genes are also double in number and three-fold in silent substitution rates as compared to mitogenomes [17]. Plastid genomes are also conserved in structure and are poor in repeat elements. There are several types of plastids, which fulfil different roles in the plant. The well-known chloroplasts harvest light energy during photosynthesis. Chromoplasts are colored chloroplasts which store different pigments. Leucoplasts are colorless plastids, which store proteins, fat, monoterpenes, tannins, polyphenols, and other secondary metabolites [18].
In this paper, a new resource called the Organelle DNA Lineages (ODL) software will be described and compared to existing organelle DNA resources. The software removes duplicate species entries. Using the "getAnnotations-GenBank" function from the "ape" package, it downloads and stores the annotation for each accession. Using the "read.GenBank" function, it retrieves the genome sequence for each accession. The software adds all of the organelle genome sequences into a DNAStringSet, and then calls the "msa" function to create a multiple sequence alignment. Then, in an all-versus-all pairwise fashion, the script calculates the sequence similarity between all possible pairs of genome American Journal of Molecular Biology sequences. The sequence similarity values are then stored in a symmetric square sequence similarity matrix.

Description of Software
The gene, tRNA and rRNA elements are also written to an output file, which is compatible with the input format of the CREx software [19]. The output can be directly uploaded to the CREx website at http://pacosy.informatik.uni-leipzig.de/crex. The GenBank records for all sequences are also written to a subdirectory as part of the output. These can be used as input for the GenomeVx visualization software [20].
A gene order distance matrix is calculated for each species pair by comparing the indexes of each gene/tRNA/rRNA between the two species using the following equation: In Equation (1), the difference between the mean index value of element i (meaning genes, tRNAs, and rRNAs) between species a and b is summed up for all common elements between the two species. The sum also includes the unique number of elements pertaining to species a and b, respectively. Figure 2 depicts a simple example of calculating the gene order distance between three hypothetical species. For example, the distance between species 1 and species 3 is 3 + 1 + 2.5 + 3 = 9.5. The distance in the index value for gene A It is possible that a certain gene/rRNA/tRNA is present in multiple copies. In this instance, the average index value is taken for that particular gene/rRNA/tRNA.
The distance value between two species may be greater than 1, since we are Figure 2. Schematic diagram depicting how gene order similarity is calculated between species. In a given study, the gene order distance between all possible species pairs is calculated. Species within the same monophyletic group have a similar gene order (species 1 vs. species 2). Species from two different groups have different gene orders (species 1 vs. species 3, and species 2 vs. species 3). The distances are calculated according to Equation (1), and then normalized to get a number between 0 and 1.
The genome sequence similarity matrix and the gene order similarity matrix are then visualized on a heatmap. Furthermore, these two matrixes are also combined into a "combined" matrix, giving equal weight to both the genome sequence similarity matrix and the gene order similarity matrix. The "heatmap" function is called, using the "ward.D2" method to perform species clustering. A Silhouette plot is also created by the software, which depict the average silhou- Clustering is performed using the "ward.D2" method on the sequence similarity matrix. Several statistics measures are written to an output file, such as the Hopkins clustering statistic, which denotes how well the matrix forms clusters.
Other parameters are also recorded for each of the clusters (which correspond to monophyletic groups), including the number of species in the cluster, the average oDNA length ± one standard deviation, the minimum, mean, and maximum similarity value, as well as the standard error of the mean (SEM), as well as the p-value, which calculates statistical significance between similarity values between species within the cluster and between species within the cluster and outside the cluster. This p-value is a good measure of discontinuity between monophyletic groups.
The main output of the software is a gene order map, showing the order of genes/rRNAs/tRNAs along the organelle genome per species per cluster. A color-coded legend helps the user identify a given gene/rRNA/tRNA along the organelle genome of a given species.

Other Software
GenomeVx was used to create the image of the human reference mitogenome in

Used Sequences
The mitogenomes for 51 cephalopod species were downloaded from the Organelle Genome Browser, specified in the previous subsection. If a given species occurred more than once, only one of its accession number was randomly selected.

Review of Existing Organelle Software and Databases
There are a lot of different databases and programs which are tailored for storing and visualizing genome and synteny. Some of these programs include software American Journal of Molecular Biology which are amenable for the visualization of organelle genomes and gene order.
In the following, some of these software and databases will be reviewed. A list of the most commonly used mitogenome and plastid genome annotation tools are listed in Table 4.
GenBank and RefSeq are the standard databases used for storing sequence data, including organelle genomes. Refseq offers curated and non-redundant data for users. Several databases exist, which are built upon GenBank and RefSeq, which store data derived from these databases and which have improved annotation and data quality. Such databases include OGRe [21], MetAMIGA [22], MitoZOA [23], Mamit-tRNA [24], MamMiBase [25], and tRNAdb [26].
CREx is a software which calculates the hypothetical most parsimonious gene order rearrangement between two organelle genomes, and then visualizes the output. Possible rearrangements include transpositions, reverse transpositions, reversals and tandem duplication random losses (TDRLs) [19]. GenomeVx is a web-based software which allows the user to create editable, colorful, publication-ready images of circular genomes, such as that of mitochondrial and plastid genomes as well as large plasmids. As input, it takes raw feature positions or GenBank files, which the user can upload to the GenomeVx website [20]. Mitofish is a database of annotations and re-annotations for mitochondrial genome for a large number of fish species. Its companion program, Mitoannotator is an extremely rapid, high quality annotation pipeline for fish mitogenomes [28]. The MITOchondrial genome annotation Server (MITOS) is a well-known, high-quality mitochondrial genome annotation pipeline. MITOS uses standardized gene names and gene boundary designations. It uses BLAST to compare existing mitochondrial proteins to the raw genome sequence to annotate the location of these genes. It also has a web server where users can upload their raw genome sequences. After selecting a translation code, the server will calculate and present results for the user. These results include tabular summary of annotated genetic elements and visualization of genes within the uploaded genome sequence. The results can then be downloaded by the user [29] [30]. DOGMA is another web service, which uses BLAST against internal databases to detect protein-coding genes and also rRNAs, and use tRNAscan-SE to discover tRNAs, however, according to Guyeux et al. [31] it is outdated. DOGMA and MITOS both use metazoan databases data [29] [32].
The ChloroMitoSSRDB database is an open-source repository of perfect and imperfect microsatellites, two to six nucleotides long. Information include the position of repeats, size, motif and length polymorphisms. The repeat sequences are hyperlinked to annotated gene regions at NCBI [38].
The current software differs from all the previous methods in that it clusters the organelle genomes based on both gene order and sequence similarity. It then depicts a linear organelle genome map for each cluster and species in the study showing the position of each gene. The software also produces accompanying statistics files and also heatmaps showing species relationships based on gene order and sequence similarity. Other software calculates genome distance by the number of rearrangements needed to transform the organelle genome of one species into another [39]. The present software does this by calculating the difference between the index of the order of each gene.

Analysis of Cephalopod Mitogenomes
Since the main goal of this paper is to present a new bioinformatics tool, usage of the ODL software will be showcased here in the mitochondrial mapping of cephalopods. Cephalopods are a class of species in the phylum Mollusca (mollusks). They have two subclasses, Nautiloidea (nautiluses) and Coleoidea, which is made up of two superorders, Decapodiformes (squids and cuttlefish), and Octopodiformes (octopuses and argonauts). The mitogenomes of 51 species as well as the outlier American Journal of Molecular Biology Danio rerio were analyzed to discover putative monophyletic groups.
After running the software, ten putative monophyletic groups were discovered, besides Danio rerio, the outlier species. A list of species and accession numbers can be found in Supplementary File 1. The statistics for each cluster can be seen in Table 5. The matrixes, clusters, and statistics for the mitochondrial gene order, sequence similarity, and combined analyses can be found in Supplementary Files 2 -4. Here results for only the combined matrix are reported. The mitogenome map can be seen in Figure 3, and the corresponding combined heatmap can be seen in Figure 4. The Hopkins clustering statistic for the combined matrix is 0.81, which denotes good clustering. Supplementary   Figures 1-3 show the Silhouette plot for clustering based on gene order similarity, mitogenome sequence similarity, and combining both methods.  Each pixel in the heatmap shows how similar a pair of species is. The heat map is symmetrical, mirrored across the diagonal going from bottom left to top right. Darker, redder colors denote species with high similarity to one another, while brighter, yellow colors denote species which are less similar to one another. The heat map shows the same ten clusters shown in Figure 3.

The Ten Cephalopod Clusters
Argonauta and Nautilus form two separate clusters. Nautilus also has a significantly large non-coding region between the tRNAs for glutamine (Q) and threonine (T) [40]. Nautilus also has a mean GC% of 40% ± 0.26%, whereas for Argonauta this  [41] is classified as a member of this group.
Ten species of cuttlefish from the genera Sepia and Sepiella also form a monophyletic group. Their gene order is very conserved, but very different from all other cephalopod groups. The genome length is also very conserved, between 16,163 to 16,244 bp. The GC% is 24.6% ± 1.47%. Takumiya et al. [42] also found significant differences between coleoid cephalopods based on mtDNA, separating cuttlefish from all other groups. Following are several squid groups, including four monophyletic groups and two species which don't belong anywhere else. The first of the four groups is Loliginidae (pencil squids), with species from the genera Doryteuthis, Heterololigo, Loliolus, Sepioteuthis, and Uroteuthis. The species Sepioteuthis lessoniana is an outlier due to the translocation of three tRNA sequences (for isoleucine, valine and tryptophan) and l-rrna dn s-rrna in its mt genome. Yokobori et al. [43] also found that this squid group is monophyletic.
Four species, Architeuthis dux, Dosidicus gigas, Stenoteithis oualaniensis, and Todarodes pacificus form a separate group. These four species have a mean GC% of 29.8% ± 1.4% and a genome length from 20,254 to 20,331 bp. Xu, et al. [44] also separate them from all other squids.
Bathyteuthis abyssicola (the deepsea squid) belongs to its own family, Bathyteuthidae, in the suborder Oegopsina. It possibly even belongs to its own order, according to Uribe and Zardoya [45]. Kawashima et al. [46] found that the mitogenome structure of Semmirossia patagonica (Patagonian bobtail squid), from the family Sepiolidae is unique among decapods. For example, like Nautilus species, the ATP6 and ATP8 genes are not adjacent to one another. Ommastrephes bartrami (the neon flying squid) and Watasenia scintillans (the firefly squid) and Chiroteuthis picteti and Ilex argentinus cluster together into the same putative groups.

Conclusion
Studying organelle genomes is a new and expanding field of genomics research. Several tools have been devised to determine phylogeny based on gene order. Several tools and databases exist to annotate and visualize organelle genomes and to store organelle sequences. The present software, Organelle DNA Lineages was designed to cluster species into monophyletic groups based on genome sequence similarity and gene order and visualize the results. The software was run on a set of 51 cephalopod species and found ten clusters. This software can help expand the field of organelle genomics.