Multiple Chromosomes in Bacteria: Low Level of Evolutionary Constraint Drives the Rapid Genetic Divergence of Chromosome II

Multiple chromosomes in bacteria are designated as a larger primary chromosome (CI) and smaller accessory chromosomes (CII and CIII). Although previous studies examined multiple chromosomes in several bacterial species, the evolutionary mechanisms for the origin of CIIs still remain unclear. In this study, the four following hypotheses were tested. 1) CIIs exhibit lower sequence conservation and sequence divergence compared to their corresponding CIs across species of Proteobacteria. 2) The differential sequence divergence of CI and CII depends on pathogenic and non-pathogenic lifestyles. 3) CIIs harbor a higher level of horizontal gene transfers (HGTs) than CIs. 4) Orthologs located on CIIs experience less purifying selection than their corresponding orthologs on CIs. Results reveal a higher level of sequence conservation of CIs than the sequence conservation of CIIs. There is no significant difference in HGT estimates between CIs and CIIs. A majority of orthologous genes of CIs and CIIs experience purifying selection; however, genes on CIIs were significantly less constrained than the corresponding ones on CIs. This finding is true for both pathogenic and non-pathogenic bacteria,


Introduction
There is a long-held paradigm that the bacterial genome is comprised of a single circular chromosome and additional dispensable non-essential plasmid(s). This view has been revised because of the existence of multiple chromosomes in several bacterial species including Rhodobacter sphaeroides 2.4.1 [1] [2], Brucella melitensis 16M [3], Agrobacterium tumefaciens C58 [4], Leptospira interrogans Verdun and RZ11 [5], Paracoccus denitrificans Pd1222 [6], and Vibrio cholerae AI1837 [7]. The explosion and impact of genome sequencing technology [8] [9] further illuminated the presence of accessory chromosomes in ten percent of the total sequenced bacterial genomes in the NCBI database, and these species were widely distributed throughout different lineages [10]. Therefore, the existence of multiple chromosomes is now an accepted paradigm of the bacterial genome structure.
Three strains of R. sphaeroides (2.4.1, ATCC 17025, and ATCC 17029) have been previously examined for their sequence divergences and the result indicated that genes on CIIs have lower nucleotide identity, which suggests the rapid evolution of CIIs in R. sphaeroides [11]. The study was further expanded among strains of both within-species and between-species of Proteobacteria, including species of α-, β-, and γ-Proteobacteria [12]. Out of the ten within-species comparisons, eight organisms displayed higher levels of CII-specific DNA sequence divergences. Also, trans-genera pairwise comparisons indicated an average of ~42% nucleotide identity among CIs and a significantly lower nucleotide identity of ~27% among CIIs. The previous studies further demonstrated a high divergence and rapid evolution of CII-specific sequences compared with CI-specific sequences in a majority of bacterial species [10].
The current study tests the following four hypotheses. 1) CIIs exhibit lower sequence conservation and sequence divergence compared to their corresponding CIs across species of Proteobacteria. 2) The differential sequence divergence of CI and CII depends on pathogenic and non-pathogenic lifestyles. 3) CIIs harbor a higher level of HGTs than CIs; thus they diverge more rapidly. 4) Orthologs located on CIIs experience less purifying selection than their corresponding those on CIs.

Comparative Genome Alignment
One hundred bacterial strains (see Supplementary Table S1) with multipartite genomes were identified using the National Center for Biotechnology Information (NCBI) database, of which 83 of the strains belong to the Proteobacterial phylum. A total of 16 species that contain two or more strains and represent α-, β-, γ-subgroups of Proteobacteria were selected for whole genome alignment.
They include 11 pathogenic and 5 non-pathogenic species (see Supplementary   Table S2).
DNA sequence files (in FASTA format) for CI and CII from 50 strains were downloaded from the NCBI database. Mauve 2.3.1 was utilized to obtain whole genome alignments of the 16 Proteobacterial species CIs and CIIs, separately.
Both percentage conservation from the local collinear blocks (LCBs) and percentage nucleotide identity from the entire conserved regions were extracted. A paired t-test was performed to assess the significant difference in percent conservation and percent nucleotide identity between CIs and CIIs. These results were further classified by subgroup and lifestyle. Again, a paired t-test was used to estimate the significant difference in percent conservation and percent nucleotide identity between CIs and CIIs.

Estimation of HGT
HGTs in CIs and CIIs were estimated using Island Viewer [14] which identifies genomic islands (GI) using three methods. Two of the three are sequence composition-based methods, namely, SIGI-HMM and Island Path-DIMOB. The third one, Island Pick, is a comparative genomics method that assesses putative islands by performing alignment of closely related genomes using Mauve [13].
SIGI-HMM, based on a hidden Markov model (HMM), exploits codon usage bias to discriminate between the native and GI sequences. Island Path-DIMOB recognizes GIs by common characteristics, such as mobility genes (e.g. integrases, transposases) and anomalous sequence composition. Island Viewer takes an integrated approach to increase the accuracy; it also allows access to results from each component method.
A recently published genome-mining tool, called GEMINI [15], based on a recursive segmentation and clustering procedure, was utilized to quantify HGTs.
The framework recursively segments a genome sequence into two parts at each position where the compositional difference between the resulting sequence segments is maximized in terms of information-theoretic measure.

Selective Constraints on Genes in Multipartite Genomes
To quantify the evolutionary divergence of orthologs shared by multiple strains of the same species, BLASTP was employed to compare genes of CI and CII separately within and across species. A total of the 50 strains representing the 16 species within α-, β-, and γ-subgroups were analyzed for further determination of the selective pressures on CI and CII, separately. Each comparison began by pre-processing which gives specific instruction for file extraction, preparation, processing, and storage. The query and target sequences in GenBank flat file format (.gbk) were downloaded from the NCBI database and converted to FASTA (.faa) format. The target files in FASTA format were organized into a local database for sequence comparison. Then, BLASTP was initiated and the query sequence was compared to all sequences in the target local database.
Post-processing converted aligned results from the previous step into files available for further analysis. Next, orthologs for CIs and CIIs were identified with BLAST E-values < 10ˉ2 0 and aminoacid identity > 30%. Orthologs satisfying these criteria were used to determine whether CI and CII have different modes of selection.
The rates of non-synonymous substitution (Ka), synonymous substitution (Ks), and the selective constraints (ω = Ka/Ks) were calculated using KaKs_Calculator [18], which requires DNA sequences and computes Ka, Ks, and ω, considering biases in transition to transversion ratio and the codon redundancy. The amino acid sequences corresponding to their nucleic acid sequences of orthologs were aligned using MUSCLE [19] and then reverse translated to DNA sequences using PAL2NAL [20]. The resulting nucleic acid sequences were used as an input to the KaKs_Calculator, and then Ka, Ks, and ω were calculated. Negative selection is inferred if ω < 0.3, implying that the orthologs likely have maintained identical function. Neutral selection is inferred if 0.3 < ω < 3, suggesting orthologues are nearly equivalent and more likely to perform similar functions. Positive selection is inferred if ω > 3, allowing new gene functions to evolve.

Multipartite Genome Evolution in Prokaryotes
Chromosome size, gene count, and gene density (genes/kb) on CIs, CIIs, and CIIIs of the 100 bacterial strains are shown in Supplemental Whole genome alignments for the related strains of the species within each α-, β-, and γ-subgroup in Proteobacteria were examined. As an example of visual representation, the whole genome alignment of CIs and CIIs for the four strains of R. sphaeroides is depicted in Figure 1. The colored blocks represent local collinear blocks (LCBs), which are areas of conserved regions between four strains and the vertical lines link each matching LCB displaying large-scale chromosomal alterations, such as large deletions, insertions, translocations, and inversions. The alignment results indicate 80% conservation of CIs, while 27% sequence conservation of CIIs among all the four strains of R. sphaeroides. Also, inversions, deletions, and translocations are more abundant on CIIs, possibly contributing to the rapid divergence of CIIs in R. sphaeroides.
The whole genome alignment results, including the percent conserved regions and percent nucleotide identity for CIs and CIIs in the 50 strains within 16 species, are depicted in Figure 2(a) and Figure 2(b). Also, the data associated with  and CII (in green) among the 11 pathogenic species (left) and the 5 non-pathogenic species (right). The whole genome alignment for the pathogenic strains was completed using 35 sequenced multipartite genomes within 4 genera of α-, β-, and γ-subgroups of Proteobacteria. Also, the whole genome alignment for the pathogenic strains was completed using 15 sequenced multipartite genomes within 4 genera. The number of strains used in each species is specified in parentheses.; (b) Percent nucleotide identity of CI and CII within the total conserved regions (Local Collinear Blocks, LCBs) among the 11 pathogenic species (left) and the 5 non-pathogenic species (right). The whole genome alignment for the pathogenic strains was completed using 35 sequenced multipartite genomes within 4 genera of α-, β-, and γ-subgroups of Proteobacteria. Also, the whole genome alignment for the pathogenic strains was completed using 15 sequenced multipartite genomes within 4 genera. The number of strains used in each species is specified in parentheses.  Figure 2 exhibits that of the 16 whole genome alignments, 11 CI alignments show higher % conservation than CII. To be more specific, 4 out of 5 species in α-subgroup have higher % conservation in CI than in CII; 4 out of 7 species in β-subgroup have higher % conservation in CI than in CII, and 3 out of 4 species in γ-subgroup have higher % conservation on CI than CII. A paired t-test was performed on all the 16 alignments, and there was a significant difference (P-Value = 0.037) in the total conserved regions on CIs compared to those on CIIs.

Sequence Conservation in Nucleotide Identity between Pathogenic and Non-Pathogenic Life-Styles
The impact of lifestyle on sequence conservation was also examined by classifying the species as pathogens and non-pathogens. The 11 pathogenic strains in Figure 2(a) show that sequence conservation between CIs and CIIs was not significantly different (Paired t-test, P-Value = 0.690). However, among the five non-pathogenic strains also shown in Figure 2(a), sequence conservation for CIs were significantly higher than that for CIIs (Paired t-test, P-Value = 0.016).
These results indicate that CIIs in non-pathogenic strains within α-, β-, and γ-subgroups in Proteobacteria have diverged more rapidly than CIs.
All the 16 whole genome alignments also revealed that the percent nucleotide identity of the conserved regions in CII was significantly lower than that of the CI conserved regions (Paired t-test, P-Value = 0.036), as shown in Figure 2(b).
The percent nucleotide identity between pathogenic and non-pathogenic species was shown in Figure 2(b). In the pathogenic bacterial strains, percent nucleotide identity for CIs is not significantly different from that for CIIs (Paired t-test, P-Value = 0.437). However, percent nucleotide identity for CIs is significantly higher than that for CIIs in the non-pathogenic bacteria (Paired t-test, P-Value = 0.005).

Horizontal Gene Transfer Estimates for CIs and CIIs
Island Viewer estimates of HGT among the 83 strains within the α-, β-, and γ-subgroups of Proteobacteria are listed in Supplemental Table S3. There is no significant difference in HGT estimates between CIs and CIIs (Paired t-test, P-Value = 0.790). In each subgroup, the following number of strains had higher HGT estimates on CI compared to CII: 17 out of 27 strains in α-subgroup, 19 out of 35 strains in β-subgroup, and 12 out of 21 strains in γ-subgroup. The result of paired t-test indicated no significant difference in the HGT levels between CIs and CIIs within every subgroup as follows: α-subgroup (P-Value = 0.707), β-subgroup (P-Value = 0.793), and γ-subgroup (P-Value = 0.857). For both αand β-subgroups, the average HGT estimates are slightly higher in CIs than in CIIs. However, in γ-subgroup, the average HGT estimate for CIs is slightly lower than that for CIIs, but not significantly different. Furthermore, HGT estimates for the 83 strains were compared between the two subcategories: pathogens and The HGT estimates for CI and CII of R. sphaeroides were obtained from a recently proposed integrated segmentation and clustering method, GEMINI and they are schematically represented in Figure 3. Furthermore, the different HGT estimates of α-, β-, and γ-subgroups of Proteobacteria are shown in Figure 4. A total of 48 out of 76 strains had higher HGT estimates (% nucleotides labeled "alien") on CIs than those on CIIs as follows: α-subgroup (   ("Alien") DNAs within the chromosomes of α-, β-, and γ-subgroup Proteobacteria, deciphered using an integrative method of segmentation and clustering. Both segmentation and clustering were performed within a statistical hypothesis framework with significance levels for segmentation and two-stage clustering set at 10 −5 , 10 −6 and 10 −7 , respectively. Both segmentation and clustering procedures employed Markovian Jensen-Shannon divergence measure for quantifying difference between DNA sequences. Homogeneous Markov models of order 2 were used for analysis.

Selective Constraints of CI and CII
Orthologs present on CI and CII within the 50 strains were identified and selective constraint analysis of CIs and CII of the 16 species was performed. The number of orthologs between each species, their multi-way comparisons, and their average Ka, Ks, and ω values are presented in Supplemental Table S4. In all the 16 species, the majority of the orthologs on CI and CII are under negative selection as shown in Figure 5; however, the average selective constraint value for CI (ω = 0.124) was lower than that for CII (ω = 0.196). T-test indicated a significant difference between CI and CII selective pressures among the 16 species (P-Value = 0.038).

Differential Evolution of CI and CII in Prokaryotes
Mauve analysis revealed that sequence conservation and DNA sequence diver-  In contrast, there is a significant difference in genetic divergences between CI and CII in non-pathogenic bacteria. The rapid evolution of CII in the majority of bacterial species, especially in non-pathogenic strains, is possibly due to relaxed selection on CII compared to CI and/or more frequent HGTs on CII than on CI.

Similar HGT Estimates in CI and CII
HGT plays an important role in the evolution of bacterial genomes [21], metabolic functions [22], and re-patterning of the regulation mechanisms [23]. Proteobacteria. Even when the genomes were examined separately by pathogenic and non-pathogenic group, the results still indicated no difference in HGT levels between CI and CII. The role of HGT in the initial phase of origin and development of the accessory chromosome may be important; however, it may not contribute much through remaining evolutionary processes.
Since HGT is a continuing process, the amount of HGTs present in replicon is possibly indicative of the relative evolutionary history. The high HGT estimates in certain CIIs could suggest a recent origin of the accessory chromosome, while similar HGT estimates on CI and CII may suggest ancient origin of CII. In the case of R. sphaeroides, CI and CII have similar HGT estimates. However, three out of the four strains have slightly higher HGT estimates on CII. The oldest strain ATCC 17025 has higher HGT estimates on CI, which supports the theory of the ancient origin of CII [10] [11], while the other three strains that emerged later display higher HGT estimates on CII than on CI. The older CIIs may have lost a substantial fraction of earlier acquired alien DNAs. It is also possible that the old alien genes may have ameliorated their composition to that of the native genome and therefore may not be detected by Island Viewer. On the other hand, foreign genes in recently originated CIIs might not have substantially ameliorated yet and therefore they may be identified by Island Viewer. Furthermore, among all the three subgroups, α-subgroup displayed slightly higher HGT estimate for CIthan CII.

CI and CII Are under Purifying Selection
Selective constraint (ω) is a measure of the ratio of the synonymous substitution rate (K s ) and nonsynonymous substitution rate (K a ). It signifies the mode of selective pressures [24]: negative, neutral, or positive selection. The majority of orthologous gene pairs located on both chromosomes have selective constraint (ω) values < 0.3, which demonstrates that regardless of their locations, these orthologs were maintained under purifying selection. Although both chromosomes are under purifying selection, selective constraints on CII are more relaxed than that on CI. The differential selective pressure on multipartite genomes supports the rapid divergence of CII. As genes on the accessory chromosomes experience weaker negative selection because they are possibly semi-essential, not always expressed, redundantly connected within gene network, and/or more susceptible to mutation [25]; thus, they diverge more rapidly than their corresponding primary chromosomes. Previous studies have demonstrated that the synonymous codon usage orderliness (SCUO) was significantly less on CII than on CI [26]. This codon usage bias is reflective of the decreased purifying selection on CII. Additionally, codon usage bias varied greatly among genomes and was highly dependent on their G + C composition. Furthermore, as the reduced SCUO is an inherent attribute of genes on CII, they experience reduced selection for translation efficiency because of either their reduced expression or greater protein dispensability [27]. CIIs of most multipartite ge- nomes harbor more nonessential genes including genes of hypothetical protein and unknown functions. This suggests that the nonessential genes possibly evolve faster and therefore they are much more suited for biodiversity including in the emergence of new strains and species.

Evolutionary Mechanisms for the Origin of CIIs
Two different models have been previously invoked to explain the origin of CIIs in bacteria [10]. In the first model, when a native or a newly captured plasmid from another species secures some essential genes via intra-or inter-genomic gene transfer from CI, this plasmid becomes non-dispensable for bacterial survival an growth under all growth conditions. In this scenario, the presence of a plasmid type of replication-origin and some essential genes for the cell survival

Evolutionary Role of CIIs
The multipartite genome structure in bacteria has three advantages, which include shortening the genome replication time, specializing the CII for a specific or pathogenic environment, and evolving new metabolic functions for generating species diversity. The first advantage is to reduce the duration of the genome replication and cell doubling or cell cycle, when a large primary chromosome splits into two chromosomes; therefore, the multipartite genome structure helps bacteria to cope with the nutrient-rich environments. The second advantage is to differentially regulate the expression of genes on different chromosomes under a specific host or free-living condition. It was validated that in Vibrio cholera, a pathogenic bacterium, CII-specific genes involved in various metabolisms and pathogenicity are highly expressed under pathogenic conditions, while CI-specific genes are expressed at similar levels under both free-living and pathogenic conditions [10]. The current study has shown that the orthologs located on CII compared to the corresponding orthologs present on CI are structurally and functionally less constrained. Therefore, CII-specific genes may evolve more rapidly and thus generate high level of strain or species diversity within a specific group of bacteria.

Conclusion
This study concludes that CII-specific sequences have higher sequence diver- gence than their corresponding CI-specific sequences, and therefore CII evolved rapidly. The results of previous and current studies suggest that both differential selective constraint and pathogenic/non-pathogenic lifestyle seem to be the major driving force for the genetic divergence between the primary and the accessory chromosomes. Although HGTs do not seem to be a major factor for the differential sequence divergences of CIs and CIIs, the role of HGT for genome evolution should not be underestimated based on the limited number of species examined. Future studies focused on transcriptome and proteome analyses will help establish if gene orthologs located on primary and accessory chromosomes are differentially expressed for adaptation in specific host-environments or free-living growth conditions.   Table S4. Average Ka, Ks, and ω values of CI and CII in16 species within α-, β-, γ-subgroup of Proteobacteria. The number of strains used in each comparison is indicated in parentheses.