Evolution of PE 35 and PPE 68 Gene Families in Mycobacterium : Roles of Horizontal Gene Transfer and Evolutionary Constraints

Mycobacterium is a genus of bacteria with over a hundred non-pathogenic and pathogenic species, best recognized for certain members known to cause diseases such as tuberculosis and leprosy. Two novel protein families important in the pathogenesis of Mycobacterium species are the PE and PPE families. These two protein families affect the antigenic profiles, disturbing host immunity. To better understand the origin and evolution of these gene families and the differences in their composition between pathogenic and non-pathogenic strains, several bioinformatic analyses were conducted both among Mycobacterium and closely related species that contain PE35 and PPE68 gene homologs. The methods included protein homology searches (BLASTP), horizontal gene transfer analysis (IslandViewer), phylogenetic analysis, gene cluster analysis and structural and functional constraints. Results revealed that PE and PPE gene homologs were not only limited to Mycobacterium, but also existed in three other non-mycobacterial genera, Rhodococcus, Tsukamurella and Segniliparus, and were possibly initially acquired from non-mycobacterial microorganisms by multiple horizontal gene transfers. Results also demonstrated that PE and PPE genes were more diverse and more rapidly evolving in pathogenic Mycobacterium as compared with non-pathogenic Mycobacterium and other non-mycobacterial species. These findings possibly shed light on the diverse functions and origins of the PE/PPE proteins among these organisms.

recognized for certain members such as Mycobacterium tuberculosis and Mycobacterium leprae, which cause the diseases tuberculosis [1] and leprosy [2], respectively.Growth rate and pathogenicity are two defining characteristics of the Mycobacterium species.Pathogenic strains generally grow slowly, forming colonies on solid media in several weeks to months, while environmental species are non-pathogenic and grow rapidly within a week [3] [4].
Pathogenicity is defined as the ability of a microorganism to cause disease that is harmful to the host.Furthermore, pathogenic characteristics result from specific interactions between the pathogen and its host, and the components of these characteristics are coded by their genomes.PE and PPE gene families are a surprising discovery in the genome of M. tuberculosis and they represent ~10% of its genome [5] [6].The genome of M. tuberculosis strain H37Rv is annotated with 99 and 69 PE and PPE proteins, respectively.Members of the PE protein family have conserved ~110 amino acid N-terminal domains with the proline-glutamic acid motif at positions 8 -9 [5] [7] [8].Members of the PPE family have longer ~180 amino acid conserved N-terminal domains with the proline-proline-glutamic acid motif at positions 7 -9 [5] [8].Compared with the conserved N-terminal domains, the C-terminal domains of both PE and PPE vary in sequence and length, often containing repetitive regions.
PE and PPE genes are further classified into subfamilies based on consensus motifs in the C-terminus regions [8].For example, the polymorphic GC-rich-repetitive sequence (PGRS), which contains repeats of glycine-glycine-alanine or glycine-glycine-asparagine, is the largest PE subfamily of M. tuberculosis [8] [9].On the other hand, the PPE family has two major subfamilies: SVP (GxxSVPxxW) and major polymorphic tandem repeat (MPTR) [5] [10].Several genes in PE-PGRS subfamilies serve as cell surface constituents necessary for cellcell interactions, cellular structures and infectivity of host cells [1] [11]- [15].These genes encode surface-exposed proteins associated the cell wall, which provide antigenic diversity.A majority of PE and PPE coding genes are differentially expressed under different experimental growth conditions [6] [16]- [19], which suggests that the differential expression of these genes may contribute to the varied antigenic potential to the changing microenvironments within the host.Further, several PE and PPE family members are secreted by the ESX-5 protein secretion system in M. marinum [20].This is consistent with ESX-5 being related to ESAT-6, and PE/PPE gene members often being located within ESAT-6 clusters in mycobacteria.
Fast growing species possess fewer PE/PPE genes than slow growing species suggesting that these gene families originated in rapidly growing mycobacteria and then laterally transferred into and expanded further in slow growing mycobacterial species.However, some slow growing species, such as M. ulcerans and M. avium subsp.Paratuberculosis, have fewer number of PE/PPE genes [21].PE and PPE genes from the ESAT-6 cluster region 1, M. tuberculosis Rv3872 (PE35) and Rv3873 (PPE68) respectively, were considered as the ancestors of PE/ PPE families [8].PE35 and PPE68 were more conserved among MTB strains as compared with other PE and PPE family members, and this idea was compatible with the gene duplication model for ESAT-6 clusters at that time.The authors also concluded, consistent with others, that no PE/PPE gene homologs were present in species outside of the genus Mycobacterium.
The two following hypotheses were investigated in the present study.First, PE and PPE genes were initially acquired by horizontal gene transfers (HGT) from related species and then those genes further spread into multiple copies using gene duplication and/or transposition in the genome.Second, PE/PPE genes in pathogenic mycobacterial species evolved faster than their homologs in the non-pathogenic species, and therefore they would be less evolutionary constrained and more duplicated than the homologs present in non-pathogenic species.In this study, the identified ancestral PE (PE35) and PPE (PPE68) genes were analyzed among mycobacterial and related species to ascertain their possible evolutionary relationships.A number of bioinformatics approaches (protein homologies, evolutionary constraints, horizontal gene transfer analyses, gene cluster analyses and phylogenetic analyses) were used to ascertain the evolutionary relationship of PE/PPE proteins and the level of selection they experienced.

Identification of PE35/PPE68 Gene Pair Homologs among Species
Since the PE35 and PPE68 gene pair are considered the most ancestral genes of the two respective gene families [8], protein similarity searches were conducted to determine the highest matches for PE and PPE protein-pairs.To help elucidate the origin and evolution of this pair across different organisms, the PE35 and PPE68 protein sequences for each Mycobacterium were compared against the National Center for Biotechnology Information (NCBI) microbial database using BLASTP [22].Proteins were selected using the lowest e-value homolog found for respective organisms.
The reference PPE68 and PE35 proteins were selected from Mycobacterium tuberculosis CDC1551 (100% identical in H37Rv) and blasted against fully sequenced Mycobacterium genomes found in the NCBI database.For each Mycobacterium, the protein with lowest e-value was selected for further analysis.Subsequently, each of these selected PPE68/PE35 protein homologs was blasted against fully sequenced non-Mycobacterium organisms in the NCBI microbial database.Given the potential rapid divergence of such gene families and the likelihood of ancient relationships between gene homologs, a simple filtering of genes based on a single measure such as e-value or bit-score would likely miss relevant relationships.Rather, all genes which showed homology to Mycobacterium reference sequences were selected for further scrutiny.For these genes, inspection of sequences and similarity regions was performed to ascertain which genes were to be included in subsequent analysis.If multiple genes within a single organism were found with significant homology to the reference strains, then usually the gene with lowest e-value was selected.
For the organisms for which a significant match was obtained with a PE/PPE gene, their corresponding 16S rRNA gene sequences were additionally obtained to serve as a comparison group.Upon identification of homologs, full length DNA and protein sequences were obtained through the NCBI database using their respective accession numbers.The 16S rRNA accession numbers can be found listed in Table 1 and the PE35/PPE68 numbers in Table S3.

Functional Constraints Analysis
Protein sequence alignments were carried out using MUSCLE (Multiple Sequence Comparison by Log-Expectation) [23], a program known for its accuracy and speed.For the functional constraints analysis, comparisons were conducted across all Mycobacterium strains whose genomes were sequenced and annotated.More specifically, the functional constraints analyses were performed independently for the PE and PPE genes.The synonymous rates (K s ) and nonsynonymous substitution rates (K a ) along with the nonsynonymous-synonymous substitution rate ratio (ω) were calculated using the modified Yang-Nielsen method [24] [25] using K a /K s calculator [26].

Phylogenetic Analysis
Geneious 4.6 was used to organize and perform the protein similarity searches, generate alignments, and construct phylogenetic trees [27].Only organisms with completely sequenced genomes were chosen to avoid poor or incomplete sequence data from shotgun or partial genome sequencing projects.
The 16S rRNA nucleotide sequences as well as PE35 and PPE68 homolog nucleotide sequences for all species were obtained from the NCBI gene database.Phylogenetic analysis was performed using PhyML [28] with the Tamura-Nei (TN) model [29] to generate unrooted, maximum likelihood trees.For all trees, bootstrap values were calculated using 100 replications.

Gene Cluster Analysis
For all relevant organisms, information concerning gene location, direction, and content was obtained from the NCBI database.More specifically, the genes adjacent to the PE35 and PPE68 homologs were analyzed in content, direction, and length to look for similarities and differences in organization and structure.Relative gene maps were then constructed showing the distribution of genes around PE35/PPE68.Some gene maps were subsequently grouped together to provide for a visual comparison of related gene clusters.

PE/PPE Copy Number
The N-terminal domains of PE35 (conserved N-terminal domain, residues 5-162) and PPE68 (conserved Nterminal domain, residues 5-163) as identified by Pfam [30] were compared to their respective genomes using BLASTP.All proteins with homology above an e-value of 0.01 were then further classified as PPE or PE.

Genome Characteristics and Life Styles of Mycobacterial Species
The detailed genome and lifestyle characteristics of 22 Mycobacterium and 8 other related species are shown in Table 1.Mycobacterium species exhibit varying levels of genome sizes, ranging from ~3.3 Mbp of M. leprae to ~7.0 Mbp of M. smegmatis.The percentage GC content varies from 57.8% of M. leprae to 69.3% of M. bovis.Genomes of most mycobacterial species have ~90% coding capabilities with the exception of M. leprae and M. ulcerans which have 49% and 72% coding capabilities, respectively.

PE35 and PPE68 Protein Families in Mycobacterium
Pairwise amino acid identities between PPE68 and PE35 homologs across the genomes of mycobacteria and the related species are listed in Table S1 and Table S2, respectively.The protein homology searches indicate that PE and PPE proteins are not limited to the genus Mycobacterium.More specifically, organisms within the genera Rhodococcus, Segniliparus, and Tsukamurella were found to contain genes with significant homology (>50%) to the ancestralPE35 and PPE68 genes found in Mycobacterium.Consistent with this, the genes from the non-mycobacterial species are annotated as PE/PPE family members (except the PE68 homolog entry in R. opacus, listed as a hypothetical protein, but does include the PE domain) in their respective database entries (Table S3).
The blast of the conserved N-terminal domains, for both PE35 and PPE68, to their own respective genomes shows that pathogenic Mycobacterium have high copy numbers of PPE68 homologs (≥30 gene copies) as compared to non-pathogenic (<10 PPE gene copies) with an exception of M. leprae, whose genome contains only 4 gene copies of PPE (Table 2).The numbers of PE homologs in mycobacteria were lower than the numbers of PPE copies in their respective genomes, except in three non-pathogenic species, including M. smegmatis.In addition, pathogenic mycobacterial species contain more copies of PE as compared to the PE gene copies in nonpathogenic mycobacterial species.Pathogenic Mycobacterium contain 8.3 ± 5.4 PE genes and 51.2 ± 30.5 PPE genes, while non-pathogenic contain 3.1 ± 1.9 PE and 2.6 ± 0.8 PPE genes.Thus pathogenic species have experienced a strong expansion of the PPE family.Other related species copy numbers for PE/PPE were not greater than seven combined.Three of the eight species did not have any noted PE genes: Rhodococcus equi ATCC 33707, Rhodococcus jostii RHA1, and Tsukamurella paurometabola DSM 20162.
PPE68 and PE35 GC composition were relatively similar to their genome GC composition in mycobacteria.However, within the related species three significant differences of PPE GC content to genome GC content were noticed.Rhodococcus equi103S PPE GC content is 76.4% while its genome GC content is 68%.Similarly, Rhodococcus jostii RHA1 and Rhodococcus opacus B4 have PPE GC contents of 71.7% and 73.7%, respectively, while their genome GC composition are both 67%, suggesting recent HGT.
None of the PE/PPE genes were found in predicted HGT regions except for the PE35 gene homologs of M. smegmatis and M. avium subsp.Paratuberculosis.Data was obtained using the program IslandViewer and its datasets [31].However, PE35 and PPE68 genes in other species were not found to be in HGT regions identified by employing the IslandViewer program.

Phylogenetic Analysis of Mycobacterium and Other Related Species
Phylogenetic relationships based on 16S rRNA gene sequences are a standard tool to reflect evolutionary histories of species.As such, the phylogenetic tree shown in Figure 1 revealed that slow growing pathogenic species and rapid growing environmental species of Mycobacterium form two distinct evolutionary groups, as found in numerous other studies.Also, organisms within the genera Rhodococcus, Segniliparus, and Tsukamurella, which contain genes with significant homology to PE/PPE genes found in Mycobacterium, are found together as an "out-group" at the base of the tree and are distantly related to mycobacterial species.The closest relative tomycobacteria was Tsukamurella paurometabola and the most distant was Segniliparus rotundus.The Rhodococcus genus is grouped between the Tsukamurella and Segniliparus genera.
The phylogenetic trees based on PPE68 and PE35 homolog gene sequences are shown in Figure 2 and Figure 3, respectively, and these two gene trees do not completely parallel the 16S ribosomal tree shown in Figure 1, or each other.The PPE68 tree, as shown in Figure 2, reveals that genes in the MTB complex, Mycobacterium leprae, and Mycobacterium marinum genes are closely related to Tsukamurella paurometabola, while non-patho- BOLD: Indicative of recent horizontal gene transfer of PE35 homolog as ascertained by IslandViewer.Genes were found by using the conserved domains (residues listed in text) of the PE35 and PPE68 genes from MTB CDC1551, which were compared to the respective genome sequences, and included with a minimum e-value of 0.01.These numbers are not the same as the current genome annotations.GC content is from the closest homolog to PPE68 or PE35 in that respective genome.
genic and several other pathogenic species, such as Mycobacterium avium, Mycobacterium bovis, Mycobacterium ulcerans, and Mycobacterium abscessus are more closely related to Segniliparus rotundus and Segniliparus rugosus as compared to their counterpart mycobacterial species, as reflected in Figure 1.On the other hand, the PE35 tree, as shown in Figure 3, reveals that the PE35 gene homologs of several pathogenic mycobacterial species are closely related to Rhodococcus erythropolis, non-pathogenic, environmental mycobacterial species are clustered together, and evolutionary groups, both pathogenic and non-pathogenicare related to Rhodococcus opacus.It is interesting to note, that the pathogenic species Mycobacterium avium and Mycobacterium abscessus are clustered with two non-mycobacterial species, Segniliparus rotundus and Segniliparus rugosus.These results strongly suggest that both PE35 and PPE68 genes have been acquired among these species by HGT.Maximum likelihood trees were developed using the Tamura-Nei (TN) model [29].Scale bar at bottom of tree allows for gauge of numbers of substitutions per site and numbers on the tree branches reflect bootstrap values.

Functional Constraints Analysis
For the functional constraints analysis, pairwise comparisons were conducted between each PE35 and PPE68 best homolog of the 22 mycobacterial strains.The relationship between K a and K s of PPE68 and PE35 homologs are shown in Figure 4.The results revealed that the PE35 and PPE68 genes from non-pathogenic Mycobacte-  rium strains are under strong negative selection (ω < 0.3), while in many of the pathogenic mycobacterial species these two genes have evolved under relaxed or neutral constraints(0.3 < ω < 1).The Ka/Ks (mean ± standard deviation) of PPE68 homologs in pathogenic Mycobacterium is 0.291 ± 0.092 while in non-pathogenic is 0.078 ± 0.036.The K a /K s of PE35 homologs in pathogenic species is 0.359 ± 0.165 while in non-pathogenic is 0.115 ± 0.041.The constraint analyses also revealed that the PPE family overall is experiencing more intense negative selection as compared to the PE family (Welch's t-test two-tailed p-value =0.0162).Furthermore, there is a greater degree of variation and lesser constraint among pathogenic strains as compared to non-pathogenic strains.

Gene Mapping Analysis of PE35 and PPE68 Genes among Mycobacterium and Related Species
The organization of genes around PE35 and PPE68 homologs for the Mycobacterium and their related species are shown in Pasteur, show significantly different organization, containing genes that encode transposase enzymes involved in the movement of a DNA fragment from one site in the genome to another, and PE35 and PPE68 homologs not located adjacent to each other.It is unclear at this time if this genomic rearrangement is related to the attenuation of these strains.
In the second section of Figure 5(a), a majority of the nonpathogenic Mycobacterium also have PPE68/PE35 gene homologs flanked by identical sets of genes.More specifically, downstream from PE35 there are two celldivision proteins while upstream from PPE68 they tend to have antigen and EsaT6 proteins.In the third section of Figure 5(a), Rhodococcusequi 103S is shown to have significant similarity to the pathogenic mycobacterium in the first section with the notable difference of a Phosphoglycerate mutase located upstream from PPE68 instead of EsaT6.

Genomic Characteristics
The data revealed that the pathogenic strains tend to have smaller genome sizes, with the exception of M. mari-num and M. ulcerans Agy99.Smaller genome size has been attributed to a narrow host range of the pathogen [32].However, the above two strains are outliers which can be most likely ascribed to more recent acquisition of virulence and thus insufficient time to delete those genes which are necessary only to the free-living lifestyle [32] [33].It should be noted that two pathogenic strains in particular, Mycobacterium leprae Br4923 and Mycobacterium leprae TN, have the smallest genome sizes among all pathogenic species.Moreover, M. leprae utilizes the smallest gene set among these organisms, as its genome codes for only ~1600 proteins (49% of genome), which reveals an extreme case of reductive evolution and massive degeneration resulting from the obligate intracellular pathogenic lifestyle.Despite causing a chronic infection in humans, M. leprae has lost almost all of the PE/PPE genes, thus these have little role in pathogenesis of that organism.

PE35 and PPE68 Gene Homologs in Mycobacteria and Their Related Species
Organisms within the genera Rhodococcus, Segniliparus, and Tsukamurella were found to contain genes with significant homology to several PE/PPE genes found in mycobacteria.It is interesting to note that PPE68 gene homologs are more diverse and found in all these species while M. leprae lacks any significant PE35 gene homolog, consistent with previous findings [8].It should be noted that five PE genes are annotated in the M. leprae genome, based on homology of partial blocks of sequence.
The results showed that pathogenic mycobacteria had higher copy numbers of PPE genes as compared to nonpathogenic, again, with the exclusion of M. leprae.This may be indicative of the leprae species splitting off before amplification of these genes occurred in the sister lineage or gene decay in the existing genome.Only five PPE genes are annotated in the M. leprae genome.Furthermore, the low PPE copy number (<11) for all the nonpathogenic mycobacteria is indicative that these genes are specifically needed for virulence and host infectivity and may not be needed in the environment, consistent with previous literature [8].Aquatic environmental mycobacteria may have pathogenic phases in their life cycles, interacting with protozoa [34].
The low copy numbers of PE, as compared to PPE, indicate that PPE is more diverged than PE.The related species also have PE/PPE genes but those that are pathogenic have low copy numbers for both PE and PPE gene families.This may be indicative of three possible scenarios: 1) PE/PPE genes have less important roles in virulence and host-infectivity, 2) it is possible that this HGT is more recent and thus these related species have not been given sufficient time to expand, or 3) these related pathogenic species are acute infections that do not need to evade the host and thus have little need for the PPE and PE antigen variation.Yet Rhodococcus can cause persistent pneumonia and other infections in horses.Upon examination of the respective trees it is evident that since these species have not diverged more recently, the most likely suggestion is that PPE and PE gene families are not prominent in virulence for these species.This is a fertile area for experimental investigation.

Horizontal Transfer of PE35 and PPE68 Genes among Mycobacterium and Related Species
Only PE35 gene homologs in M. smegmatis and M. avium subsp.Paratuberculosis were found in predicted horizontal transfer regions.However, this is likely because the IslandViewer program only takes recent HGTs into account and cannot account for ancestral events.This is important to note as older HGTs would likely be homogenized with the surrounding genome and would be substantially more difficult if not impossible to find using conventional bioinformatics techniques.Consistently, GC content in the PE/PPE genes and surrounding genomes is similar in most cases.
As compared to the 16S rRNA tree, both the PPE and PE phylogenetic trees show significant differences.This is possible due to a variety of scenarios.It can be indicative of HGT taking place in the form of three different scenarios: from Mycobacterium to Mycobacterium, Mycobacterium to its related-species and from related-species to Mycobacterium.On the other hand or in conjunction, different rates of divergence of the genes could result in alternative trees.With inductive reasoning it is possible to posit transfer events based on the ordering of phylogenies and the placement of tree nodes.More specifically, the rearrangement of PE/PPE nodes on a tree can be suggestive of gene transfer events in comparison to tree structures expected from 16S comparisons.In the PPE68 tree, the clades containing M. gilvum, M. sp.Spyr1, M. vanbaalenii and M. smegmatis show slight reordering as compared to the 16S rRNA reference tree.However, although ordering may be slightly different, they are still clustered together very well, the cluster is within the expected tree segment, and they have not diverged very significantly so it is unlikely that any ascertainable transfer events occurred, as this is more easily explainable by simple genetic drift.However, there are three notable HGTs that may have taken place using the aforementioned methodology: two between a Mycobacterium and a Mycobacterium-related species (depicted by asterisks in Figure 2).Tsukamurella paurometabola DSM 20162 moved up the PPE tree from its original 16S rRNA placement, indicative of a transfer by a common ancestor of several pathogenic mycobacterial species to this organism.Similarly, the location of Segniliparus is incongruent with its expected tree placement.As such, it is suggestible that it experienced an HGT event from a Mycobacterium ancestor in the nearby clade and obtained its PPE68 homolog as such.Furthermore, although the differences within these trees signify that significant HGT of PPE68 and PE35 has taken place, since the mechanisms and host/environment interactions of the eight Mycobacterium-related species are not fully understood it is not possible to identify what role exactly the PPE and PE genes hold within these organisms.For non-pathogens, these cell envelopelocated proteins may play roles in cell attachment to other cells and surfaces.
For the PE35 trees, M. leprae was not found to have any significant PE35 homolog and hence is missing from the tree.Otherwise, there are three notable differences within the PE35 tree compared to the 16S rRNA tree.First, similar to the PPE68 tree, the clades containing M. gilvum, M. sp.Spyr1, M. vanbaalenii and M. smegmatis show slight reordering as compared to the 16S rRNA tree.Once again, however, although ordering may be slightly different, they are still clustered together very well and are not diverged very significantly so it is unlikely that these represent transfer events either.The other two differences are more notable and may represent HGTs that have taken place, specifically between Mycobacterium and Mycobacterium-related species (depicted by asterisks in Figure 3).The first involves Segniliparus obtaining a PE35 homolog.It is unlikely that it acquired its PE35 homolog from convergent evolution because of the close relationship established in the PPE tree.As such an HGT event, likely involving the PE/PPE region, occurred whereby either the PE35 homolog from M. avium or a precursor of M. abscessus was transferred to it.The second event may be similar in which Rhodo-coccusopacusB4 or its precursor potentially obtained its PE35 gene homolog from a Mycobacterium ancestor in the nearby clade.However, its gene location is not significantly incongruent and may be the result of convergent evolution of retained homology from a common ancestor to the Mycobacterium in the associated clade.
To note, within both the PE35 and PPE68 trees, M. avium is far removed from its location within the 16S rRNA tree and is in a place that it should not be if divergence patterns and rates were fairly consistent across the Mycobacterium organisms.As such, this may be indicative of a HGT event of PPE from another Mycobacterium into the M. avium group.Alternatively, since the M. avium complex members are opportunistic pathogens with a dominantly environmental lifestyle, mutations may have accumulated in these genes in the absence of host selection.
On the other hand, the PE35 and PPE68 trees fairly clearly suggest that the genes originated from a common ancestor of the Mycobacterium and Mycobacterium-related species or that an ancient gene transfer event occurred among the ancestors of these two groups of organisms.Subsequently, the genes diverged and specialized their functions in the Mycobacterium species.For instance, the PE and PPE homolog genes are not very diverged in the Rhodococcus organisms in comparison to the Mycobacterium and of the five Rhodococcus that contain a PPE homolog, one does not contain a PE homolog suggesting that it may have been deleted.To reiterate then, these genes may not play as integral a role in pathogenicity in these organisms.Conversely, the rapid divergence of the PE and PPE genes in Mycobacterium may lend credence to its success as a chronic pathogen.

Gene Mapping Analysis
As revealed in Figure 5(a), PPE68 gene homologs are flanked by identical sets of genes in the majority of pathogenic mycobacterial genomes and Tsukamurella paurometabola.However, M. bovis BCG str.Tokyo and M. bovis BCG str.Pasteur show significantly different organization within that region, which also contains genes that encode transposases.This is not surprising as the BCG strains have been cultured in vitro throughout the years and have large deleted and rearranged regions in their genomes [35].
Another point to note is that Tsukamurella paurometabola DSM 21062 has a similar upstream and downstream pattern to the other pathogenic Mycobacterium species in that section.This gives credence to the HGT hypothesis established earlier, in which it was stated that Tsukamurella obtained its PPE68 gene homolog from several pathogenic Mycobacterium species.However, given the hypothetical nature of the Tsukamurella proteins in Figure 5(a), further analysis was conducted.More specifically, the two proteins downstream to PE35 and the two proteins upstream to PPE68 in Tsukamurella were compared via MUSCLE pairwise alignment [23] to their corresponding proteins in M. tuberculosis F11.The two proteins downstream to PE35 in Tsukamurella, Tpau_0324 and Tpau_0325 (labeled "Pseudo") have sequence lengths of 1,551 bp and 3,546bp respectively.Their corresponding proteins in M. tuberculosis F11, TBFG_13905 (labeled "Transmembrane protein") and TBFG_13906, have sequence lengths of 2244 and 1776.Upon comparison it was found that the comparisons of Tpau_0324 to TBFG_13906 and Tpau_0325 to TBFG_13905 showed greater pairwise identities (45%, 37.5%) versus their comparisons to their corresponding genes, specifically Tpau_0324 to TBFG_13905 and Tpau_0325 to TBFG_13906 (39.5%, 31.3%).Also, taking into account that the sequence lengths are more similar across these "crossed" pairs (1551 and 1776 & 2244 and 3546) it may be inferred that these two genes in Tsukamurella were rearranged.For the two genes upstream to PPE68 in Tsukamurella (Tpau_0328 and Tpau_0329), the same information was gathered.The pairwise identities are as follows: Tpau_0328 to TBFG_13909 48%, Tpau_ 0328 to TBFG_13910 44.3%, Tpau_0329 to TBFG_13909 47.9% and Tpau_0329 to TBFG_13910 43%.In this case, there were no significant differences in pairwise identity between the sets and no significant differences in sequence lengths (all four have lengths of ~300 bp) and hence it may be suggested that these genes (Tpau_0328 and Tpau_0329 in Tsukamurella paurometabola DSM 20162) did not rearrange but rather stayed in a similar gene pattern.Furthermore, these may represent duplicate gene pairs.Lastly, in the upper section of Figure 5(a), it is notable that MTB complex members show very similar patterns.This is consistent with low genetic variation seen in Mycobacterium tuberculosis [36].More specifically, given that PE35 and PPE68 are involved in pathogenicity, it is not surprising that the surrounding regions in the obligate pathogen Mycobacterium tuberculosis are highly conserved, as they are likely integral for continued organism success.Furthermore, it has been proposed that this low level of genetic variation suggests that the entire population resulted from clonal expansion following an evolutionary bottleneck around 35,000 years ago.

Diversification of PE and PPE Protein Families in Pathogenic Mycobacteria Is Due to
Less Evolutionary Constraint PE35 and PPE68 genes of non-pathogenic and non-mycobacterial species are under strong negative selection (ω < 0.3) while pathogenic mycobacterial species are under relaxed or neutral constraint (0.3 < ω < 1).Also, there is a greater degree of variation and lesser constraint among pathogenic strains as compared to non-pathogenic strain.These results suggest, as expected, that the antigenic variation function of PE/PPE genes and pressure from the host immune system has resulted in amplification and divergence of these genes in MTB and other pathogens.A significant question that remains is the function of PE/PPE genes outside the host, if any.

Conclusions
In this study, we have showed that significant homologs to the ancestral PE/PPE genes exist outside the mycobacterial lineage.Mycobacterium and their related species have acquired PE and PPE genes through horizontal gene transfers from each other.

Figure 1 .
Figure1.The phylogenetic relationship of 16S rRNA genes among 22 Mycobacterium and 8 Mycobacterium-related species.Maximum likelihood trees were developed using the Tamura-Nei (TN) model[29].Scale bar at bottom of tree allows for gauge of numbers of substitutions per site and numbers on the tree branches reflect bootstrap values.

Figure 2 .
Figure 2. The phylogenetic relationship of PPE68 gene homologs among 22 Mycobacterium and 8 Mycobacterium-related species.Maximum likelihood trees were developed using the Tamura-Nei (TN) model[29].Scale bar at bottom of tree allows for gauge of numbers of substitutions per site and numbers on the tree branches reflect bootstrap values.

Figure 3 .
Figure 3.The phylogenetic relationship of PE35 gene homologs among 20 Mycobacterium and 6 Mycobacterium-related species.Maximum likelihood trees were developed using the Tamura-Nei (TN) model[29].Scale bar at bottom of tree allows for gauge of numbers of substitutions per site and numbers on the tree branches reflect bootstrap values.

Figure 4 .
Figure 4. K a -K s correlations of PPE68 (22 Mycobacterium and 8 Mycobacterium-related species) and PE35 gene homologs (20 Mycobacterium and 6 Mycobacterium-related species).K a and K s values were estimated using MYN (Modified Yang-Nielsen algorithm).ω = 0.3, 1 and 3 were used for negative, neutral, and positive selection, respectively.

Figure 5 (
a) and Figure 5(b).As revealed in the upper section of Figure 5(a), the PE35/PPE68 genecluster homologs in the pathogenic MTB complex members (M.tuberculosis, M. bovis, and M. marinum) and Tsukamurella paurometabola are all flanked by transmembrane protein genes located upstream from PE35 and EsxB and EsaT-6 (esxA) protein genes located directly downstream from PPE68.The only exceptions are the BCG stains, which have extensive genomic rearrangements due to growth in laboratory culture.Furthermore, among these pathogenic mycobacterial genomes only one of the three M. bovis strains, M. bovis AF2122/97 is present while, from Figure 5(b), the other two, M. bovis BCG str.Tokyo and M. bovis BCG str.

Figure 5 .
Figure 5. (a) Gene maps of regions surrounding PPE68/PE35 homologs.In this figure PPE68 and PE35 are located adjacent to each other.Organisms were grouped together based on similarities seen based on flanking regions.Two genes downstream from PE35 and two genes upstream from PPE68 are shown.(b) Gene maps of regions surrounding PPE68/PE35 homologs.PPE68 and PE35 are not located adjacent to each other and have been illustrated in their respective regions.Two genes upstream and downstream from both PE35 and PPE68 are shown.

Figure 5 (
b) contains all the Mycobacterium and their related species from which gene clusterpatterns could not be distinguished.Also all of the species in Figure5(b) with the exception of M. leprae Br4923 and M. leprae TN-for which no PE35 homologs were found-have PE35 and PPE68 homologs that are not located directly adjacent to each other.

Table 1 .
Genomic characteristics ofMycobacteriumand their related species.
N/A: not Applicable; Accession number is for the respective complete genome sequence.

Table 2 .
Sequence characteristics of PE35 and PPE68 homologs of Mycobacterium and their related species.

Table S2 .
Pairwise protein identities between PE35 protein homologs of Mycobacterium and their related species.