Towards a Comprehensive Search of Putative Chitinases Sequences in Environmental Metagenomic Databases

Chitinases catalyze the hydrolysis of chitin, a linear homopolymer of β-(1,4)-linked N-acetylglucosamine. The broad range of applications of chitinolytic enzymes makes their identification and study very promising. Metagenomic approaches offer access to functional genes in uncultured representatives of the microbiota and hold great potential in the discovery of novel enzymes, but tools to extensively explore these data are still scarce. In this study, we develop a chitinase mining pipeline to facilitate the comprehensive search of these enzymes in environmental metagenomic databases and also to explore phylogenetic relationships among the retrieved sequences. In order to perform the analyses, UniprotKB fungal and bacterial chitinases sequences belonging to the glycoside hydrolases (GH) family-18, 19 and 20 were used to generate 15 reference datasets, which were then used to generate high quality seed alignments with the MAFFT program. Profile Hidden Markov Models (pHMMs) were built from each seed alignment using the hmmbuild program of HMMER v3.0 package. The best-hit sequences returned by hmmsearch against two environmental metagenomic databases (Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis—CAMERA and Integrated Microbial Genomes—IMG/M) were retrieved and further analyzed. The NJ trees generated for each chitinase dataset showed some variability in the catalytic domain region of the metagenomic sequences and revealed common sequence patterns among all the trees. The scanning of the retrieved metagenomic sequences for chitinase conserved domains/signatures using both the InterPro and the RPS-BLAST tools confirmed the efficacy and sensitivity of our pHMM-based approach in detecting putative chitinases sequences. Corresponding author. A. S. Romão-Dumaresq et al. 324 These analyses provide insight into the potential reservoir of novel molecules in metagenomic databases while supporting the chitinase mining pipeline developed in this work. By using our chitinase mining pipeline, a larger number of previously unannotated metagenomic chitinase sequences can be classified, enabling further studies on these enzymes.


Introduction
Enzymes are catalysts that support the development of environmental-friendly industrial processes.At present, most of the industrial enzymes of major importance are of microbial origin, so the search for novel of these catalysts is a key step towards the development of innovative bioprocesses.Chitinases are enzymes responsible for the hydrolysis of chitin, a linear homopolymer of β- (1,4)-linked N-acetylglucosamine, which is the second most abundant biopolymer in nature.A set of different enzymes are needed to drive the complete hydrolysis of chitin to free N-acetylglucosamine (GlcNAc), involving diverse mode of actions known to be synergistic and consecutive [1] [2].The endochitinases (EC 3.2.1.14)randomly cleave the chitin chain at internal sites, whilst the exochitinases (EC 3.2.1.52)catalyze either the successive removal of sugar unit from the non-reducing end or the hydrolysis of terminal non-reducing sugar [3] [4].
The low discovery rate of novel natural products from culturable microorganisms [17] coupled with the fact that only a small portion (estimated less than 1%) of the microbial community is capable of growing under artificial conditions [18] [19] has brought about the need to explore metagenomic approaches to speed up the finding of new biomolecules potentially useful in biotechnology [20].To date, a great number of environmental metagenomic studies were performed, such as the extensive studies on the Sargasso Sea [21] and the Global Ocean Expedition [22] [23], and as a result, a huge amount of sequence data has been generated but has not been entirely explored.Different projects have been implemented to provide an open infrastructure for metagenomic sequence data storage and analysis, as CAMERA ("Community Cyberinfrastructure for Advanced Microbial Ecology Research & Analysis") [24], MG-RAST ("Metagenomic Rapid Annotation using Subsystem Technology") [25], and IMG/M ("Integrated Microbial Genomes") [26].The current challenge is to fully exploit the metagenomic sequence information using appropriate data-management and data-analysis methods.
Typical metagenomic analyses rely on similarity search against some databases, followed by annotation of the output.The most frequently used similarity search tool is BLAST [27], but as it requires significant computational capacity for large datasets, faster searching tools have been developed, such as Pattern-Hunter [28] [29] and BLAT [30].However, comprehensive searches on specific genes or gene families require more sensitive tools to be used.Therefore, methods are needed to find subtler similarities between sequences and to assign putative structure and functional characterization to new proteins [31].Pipelines based on Hidden Markov Model (HMM) [32] are very promising since this is a statistical representation of a protein family conservation pattern extracted from multiple alignment of sequences, which has been demonstrated to be very effective in detecting distantly related homologues [33]- [35].
The aim of this work was to develop and validate a data mining strategy based on profile HMM (pHMM) in order to be able to broadly explore environmental metagenomic databases for putative chitinase sequences.The results confirmed the efficacy of our pipeline in detecting chitinase sequences and highlighted the power of pHMM-based strategies to identify remote homologues.

Environmental Metagenomic Databases
Two environmental metagenomic databases were selected to test our chitinase mining strategy.The first one was CAMERA v2.0 [36], available at http://camera.calit2.net/,which contains 84 unannotated metagenomic datasets with 135,704,056,943 nucleotide sequences.Six-frame translation of the nucleotide sequences was performed using the EMBOSS Transeq tool available at http://www.ebi.ac.uk/Tools/st/ and a total of 75 Gb of sequences were generated.The second database was IMG/M [26], available at http://img.jgi.doe.gov/cgi-bin/m/main.cgi/, which includes 364 automatically annotated metagenomic datasets containing 119,059,610 amino acid sequences, making a total of 20 Gb.Database sequences were downloaded to a local server by June 2011.

Construction of Profiles HMM and Search for Putative Chitinase Homologues
First, multiple sequence alignments were generated for each chitinase reference set (seed alignments) using the default settings ("-auto") of MAFFT v6.717b program [37] [38].Alignment visualizations were carried out in Jalview version 2 [39].The quality of each seed alignment was controlled by manual checking and, in a few cases, manual editing was necessary.Profile HMMs (pHMMs) were then built from each seed alignment using the hmmbuild program of HMMER v3.0 package (http://hmmer.janelia.org/).The 15 pHMMs generated were used to perform sequence database searches with the hmmsearch program also of the HMMER v3.0 package and an e-value threshold of 1.0E−05 against the two environmental databases CAMERA and IMG/M.

Mining Strategy Validation
The resulting sequence database searches (described in detail in Section 2.3) were used to extract the best-hit sequences of each metagenomic dataset, that is, the hits which presented the lowest e-value parameter among all the sequences of a metagenomic project.Best-hit sequences were retrieved in a fasta format using fastacmd program of BLAST package [27] [40] and then scanned for the occurrence of chitinase conserved domains/ signatures using both InterPro v4.7 (http://www.ebi.ac.uk/interpro/) and RPS-BLAST v.2.2.21 resources, with a evalue threshold of 1.0E−05.InterPro v4.7 combines predictive models and protein signatures from 10 member databases (Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs) [41] and RPS-BLAST v2.2.21 integrates seven conserved domain databases (CDD v2.25, Pfam v.24.0,Smart v.5.1, COG v1.0, KOG, TigrFam v9.0 and Prk v.5.0).These conserved domain and protein signature databases were downloaded from EBI and NCBI on October 2010.InterPro and RPS-BLAST search results were parsed into spreadsheets using an in-house ruby script, and the frequency of the different chitinase conserved domain/signatures was calculated.

Phylogenetic Analysis of Putative Chitinase Sequences
Best-hit sequences (described in detail in section 2.4) were selected to perform phylogenetic reconstructions using the Neighbor-Joining (NJ) algorithm from MEGA 5.05 [42], p-distance model and 1000 bootstrap tests.Catalytic domain amino acid sequences from the chitinase reference sets and the selected best hit sequences were concatenated to generate a multiple sequence alignment using MAFFT v6.717b [37], which was used as query to build the NJ trees with MEGA 5.05.

Results
The construction of chitinase-reference sequence sets was a key step in the success of the mining strategy applied in this work.The collection and grouping of chitinase sequences on subsets allowed the generation of 15 chitinase groups covering all the three chitinase GH families, in which 9 were fungal GH family-18, three were bacterial GH family-18, one was bacterial GH family-19, one was fungal GH family-20 and one was bacterial GH family-20 (Figure 1).The use of these chitinase-reference subsets enabled the production of high quality multiple sequence alignments and, consequently, the properly construction of chitinase pHMMs.
The hmmsearch analysis performed against CAMERA and IMG/M metagenomic environmental databases retrieved a total of 708, 104 and 256 best-hit sequences putative of GH family-18, 19 and 20, respectively.The scanning of these sequences using a RPS-BLAST search revealed the presence of chitinase conserved domains in 74.6%, 97.1% and 97.7% of the GH family-18, GH family-19 and GH family-20 sequences, respectively (Figures 2(a)-(c)).Only a small portion of the sequences presented hits with conserved domains other than the chitinase ones (4.8% of GH family-18 and 0.8% of GH family-20).No hits sequences were 20.6% of GH family-18, whilst just 2.9% of GH family-19 and 1.6% of GH family-20 (Figures 2(a)-(c)).The InterPro search inferred the occurrence of chitinase signatures in 81.7%, 89.4% and 98.8% of the metagenomic sequences belonging to GH family-18, 19 and 20, respectively (Figures 2(d)-(f)).Compared to the RPS-BLAST search, the In-terPro analysis revealed a higher percentage of sequences hosting protein signatures other than the chitinase ones (10.3% of GH family-18, 8.7% of GH family-19 and 0.4% of GH family-20) and a smaller percentage of sequences presenting no hits against the databases examined (8.0% of GH family-18, 1.9% of GH family 19 and 0.8% of GH family 20) (Figures 2(d)-(f

)).
A large difference in diversity among all the three chitinase GH families was revealed in the RPS-BLAST and the InterPro analysis.That is, GH family-19 and GH family-20 presented no more than 12 types of conserved domains, and most of the sequences shared the same conserved domain hits (Tables 1 and 2).In contrast, GH family-18 displayed up to 34 different sorts of conserved domains and there was not a predominant set of conserved domains to the majority of the sequences (at most, half of the sequences shared the same conserved domain hits) (Tables 1 and 2).In addition, the scanning of IMG/M sequences has showed that some sequences annotated as hypothetical protein exhibited chitinase conserved domain hits, showing the sensitivity of our mining pipeline.
The phylogenetic analysis generated NJ trees corresponding to each chitinase dataset.All datasets showed some variability in the amino acid sequence of the catalytic domain region, except for the two active site residues (aspartate and glutamate in GH family-18 and 20, and two glutamates in the case of GH family-19), which The second stage consisted of the generation of profile Hidden Markov Models (pHMM) for each chitinase reference sequence subset, followed by a sequence database search against CAMERA and IMG/M.The best-hit sequences of each metagenomic project were retrieved and used in the last step of our analysis.The validation of the mining strategy was carried out by performing both an InterPro and a RPS-BLAST search against protein signatures, conserved domains and motifs databases.The phylogenetic analysis of the metagenomic sequences together with the chitinase reference sequences generated NJ trees for each chitinase subset.
were conserved in almost all sequences examined (data not shown).In addition, the NJ tree analysis also revealed two common sequence patterns, that is, all the trees presented metagenomic sequences phylogenetically related to characterized chitinases; and all these trees also displayed metagenomic sequences which did not cluster with any characterized chitinase (Figures 3-6).Interestingly, some metagenomic sequences annotated as "hypothetical protein" in the IMG/M database were retrieved after running our mining pipeline and were grouped with chitinase GH family-18 reference sequences in the NJ phylogenetic analysis (Figure 4), indicating they are putative chitinase sequences.

Discussion
The broad range of applications of chitinolytic enzymes makes their identification and study very promising.Metagenomic approaches offer access to functional genes in uncultured representatives of the microbiota and hold great potential in the discovery of novel enzymes, but tools to extensively explore these data are still scarce.This study aimed the development of a chitinase mining pipeline to facilitate the comprehensive search of these enzymes in metagenomic databases.The use of a pHMM-based strategy allowed sensitive and efficient detection of putative chitinase sequences.
The generation of representative seed alignments and the selection of the homology detection method are key steps in sequence mining pipelines.The quality of an alignment is critical to its utility in different approaches, such as functional analysis, evolutionary studies and structure prediction [43].For instance, the quality of a query and template sequence alignment is a major determinant of model quality in comparative modeling studies [44].In fact, the higher an alignment quality, the higher the sensitivity in detecting homologous sequences [43].However, the assignment of a high quality alignment depends on the relatedness of the sequences being aligned.Alignments of sequences sharing high levels of similarity, or about 50% identity, are generally unambiguous and easier to be automatically generated, but alignments of more distant sequences, as for some family of proteins (sharing 30% identity or less), usually will need to be manually checked for higher qualities.For most alignment methods, the quality increases significantly at about 20% identity [45].The algorithm implemented in the MAFFT program is considered to be faster though still accurate compared to other methods, such as Clus-talW and T-Coffee [38], thus making this program to be considered one of the best global alignment tools currently available [46] [47] and justifying the decision for using it in our mining pipeline.In this study we put some effort on properly generating chitinase reference sets representative of the different subgroups of sequences belonging to the GH families-18, 19 and 20.Basically, well-characterized chitinase sequences were chosen and organized in subsets of at least five sequences.Seed alignments were generated and manually checked, and then used to build reliable pHMMs.
pHMMs are statistical models that use multiple alignments of homologous sequences to quantify amino acids frequencies and the position-specific probabilities for inserts and deletions along the alignment [32] [48].They are broadly used for modeling conserved motifs of protein families since they contain more information about  the sequence family than a single sequence [32] [48] [49].These pHMMs have been described as very efficient to detect conserved patterns in multiple sequences [35] [50] [51] and to perform better than simple profilesequence methods such as PSI-BLAST [48] [49].This higher sensitivity found with pHMMs is very promising when performing comprehensive searches to find remote homologues, as is such the case in our study.Two software packages are frequently used to build pHMMs and to perform profile-sequence searches, SAM [33] and HMMER [52], but the last one has been reported as more suitable for large sequence dataset searches [53] and then was used in the analyses of the present work.
The scanning for the presence of chitinase conserved domains and motifs/signatures in the best hit sequences (the ones retrieved after the hmmsearch analysis) was carried out in order to evaluate the performance of our chitinase mining pipeline on detecting true putative chitinase sequences.Many annotation pipelines use searches against conserved domain databases since these regions are evolutionarily conserved units in proteins [54].The recognition of a conserved domain footprint in a protein sequence usually indicates its cellular or molecular function [55] and provides more reliable protein classification than sequence similarity analysis.The RPS-Blast and InterPro searches performed in this work found high percentages of chitinase-related domains and motifs in the best hit metagenomic sequences, validating our chitinase mining pipeline.The presence of best hit metagenomic sequences showing no hits to any conserved domain may represent putative novel chitinases that possibly would not be identified using sequence-sequence similarity searches.Furthermore, some IMG/M metagenomic sequences annotated as hypothetical proteins resulted in hits with chitinase conserved domains in our analysis, indicating that our pipeline may have high sensitivity and it is able to detect remote homologues.
The results obtained in the RPS-Blast and InterPro analyses emphasized the large differences in diversity among the three chitinases GH families-18, 19 and 20.As described in previous reports, GH family-18 holds higher variability in evolutionary terms and contains the greatest number protein members [4] [7].The diversity observed in the GH family-18, 19 and 20 was also assessed in the phylogenetic reconstructions for the metagenomic and the chitinase reference sequences.Indeed, interpreting phylogenetic relationships among sequences is particularly important since it allows to infer gene function [56], genetic variability and protein evolution.Phylogeny-based classification systems have been used before to identify enzymes in metagenomic sequence datasets [57] [58].Based on the phylogenetic relationships observed in the NJ trees generated in this study, two common sequence patterns were identified, one including metagenomic sequences phylogenetically related to characterized chitinases-which may help to understand their origin and classification; and the other comprising metagenomic sequences which did not cluster with any characterized chitinase-suggesting a great reservoir of putative new chitinases to be exploited in these metagenomic databases.Our results reinforced the sensitivity and efficiency of our mining pipeline in detecting putative chitinase sequences from metagenomic databases.

Conclusion
Traditional sequence search pipelines frequently are not able to extensively exploit metagenomic databases.The current flood of sequence data from metagenomic studies and the wide range of applications of chitinases brought about the need to develop a new data search pipeline.The chitinase mining pipeline developed in this work was based on the generation of high quality seed alignments from reliable chitinase reference sets, which were then used on the construction of chitinase pHMMs.The searches using these pHMMs were able to retrieve high percentages of putative chitinase sequences, which were confirmed in silico by a scanning for chitinase conserved domains and motif/signatures and in NJ phylogenetic reconstructions.The results confirmed the efficacy of our pipeline in detecting chitinase sequences and highlighted the sensitivity of pHMM-based strategies to identify remote homologues.These analyses provide insight into the potential reservoir of novel molecules in Endo-beta-N-acetylglucosaminidase_Flavobacterium_sp._strain_SK1022 (sp|P80036|) CAM_READ_0274750873 Glycosyl_hydrolases_family_18_Atta_columbica_fungus_garden_and_dump Glycosyl_hydrolases_family_18_Soil_microbial_communities_from_FACE_and_OTC_sites Glycosyl_hydrolases_family_18_Atta_columbica_fungus_garden Secreted_endo-beta-N-acetylglucosaminidase_Streptomyces_griseoaurantiacus (tr|F3NDC4|) Endo-beta-N-acetylglucosaminidase_H_Streptomyces_plicatus (sp|P04067|) Secreted_endo-beta-N-acetylglucosaminidase_Streptomyces_lividans_TK24 (tr|D6ESW9|) Hypothetical_protein_Aquatic_dechlorinating_community_(KB-1) NCBI_PEP_149233387 Endo-beta-N-acetylglucosaminidase_F2_Flavobacterium_meningosepticum (sp|P36912|) Secreted_endo-beta-N-acetylglucosaminidase_EndoS_Melissococcus_plutonius (tr|F3Y8V4|) Hypothetical_protein_Dendroctonus_frontalis_Fungal_community Hypothetical_protein_Dendroctonus_ponderosae_beetle_community Secreted_xylanase_Xanthomonas_oryzae_pv._oryzae (tr|Q9AM28|) Xylanase_glycosyl_hydrolase_family_10_Clostridium_acetobutylicum (tr|Q97TI5|) Glycosyl_hydrolase_family_10_xylanase_Flavobacterium_sp.(tr|C0M1B6|) Endo-14-beta-xylanase_A_Bacillus_halodurans_xynA (sp|P07528|) GH10_xylanase_Clostridium_cellulolyticum_xyn10A (tr|Q0PRN5|) Xylanase_A_Streptomyces_thermocyaneoviolaceus_xynA (tr|Q9RMM5|) metagenomic databases while supporting the in silico chitinase mining pipeline developed in this work and identifying phylogenetic relationships among the chitinase sequences.By using our chitinase mining pipeline, a larger number of previously unannotated metagenomic chitinase sequences can be classified, enabling further exploration of these enzymes.

Figure 1 .
Figure 1.Workflow of the methodology applied in this study.The first step was to generate fungal and bacterial chitinase reference sets for the glycoside hydrolase (GH) families 18, 19 and 20.Fifteen subsets were created, in which 9 were fungal GH family-18, 3 were bacterial GH family-18, one was bacterial GH family-19, one was fungal GH family-20 and one was bacterial GH family-20.The second stage consisted of the generation of profile Hidden Markov Models (pHMM) for each chitinase reference sequence subset, followed by a sequence database search against CAMERA and IMG/M.The best-hit sequences of each metagenomic project were retrieved and used in the last step of our analysis.The validation of the mining strategy was carried out by performing both an InterPro and a RPS-BLAST search against protein signatures, conserved domains and motifs databases.The phylogenetic analysis of the metagenomic sequences together with the chitinase reference sequences generated NJ trees for each chitinase subset.

Figure 2 .
Figure 2. Pie charts representing the percentage of metagenomic sequences (the hmmsearch best hits sequences) which exhibited chitinase-related domain and/or signatures after RPS-BLAST ((a), (b), (c)) and InterPro ((d), (e), (f)) searches against different conserved domain databases.The plots represent each GH family separately: GH family-18 results are presented in (a) and (d); GH family-19 in (b) and (e), and GH family-20 in (c) and (f).* Percentage of metagenomic sequences showing conserved domains other than the ones found in the representative chitinase sequences; ** Percentage of metagenomic sequences which did not find any hit in these searches against conserved domain databases.

Table 1 .
Conserved domains hits recovered after a RPS-BLAST search using the metagenomic sequences (hmmsearch best hit sequences) against seven conserved domain databases (CDD, COG, KOG, Pfam, Prk, SMART and TIGRfam).
a Only the conserved domains hits found in more than 10% of the sequences analyzed were displayed in table; b Percentage of sequences which showed hit with that conserved domain.
a Only the conserved domains hits found in more than 10% of the sequences analyzed were displayed in table; b Percentage of sequences which showed hit with that conserved domain.