Current Trend of Metagenomic Data Analytics for Cyanobacteria Blooms

Cyanobacterial harmful algal blooms are a major threat to freshwater ecosystems globally. To deal with this threat, researches into the cyanobacteria bloom in fresh water lakes and rivers have been carried out all over the world. This review presents an overlook of studies on cyanobacteria blooms. Conventional studies mainly focus on investigating the environmental factors influencing the blooms, with their limitation in lack of viewing the microbial community structures. Metagenomics study provides insight into the internal community structure of the cyanobacteria at the blooming, and there are researchers reported that sequence data was a better predictor than environmental factors. This further manifests the significance of the metagenomic study. However, large number of the latter appears to be confined only to present snapshoot of the microbial community diversity and structure. This type of investigation has been valuable and important, whilst an effort to integrate and coordinate the conventional approaches that largely focus on the environmental factors control, and the Metagenomics approaches that reveals the microbial community structure and diversity, implemented through machine learning techniques, for a holistic and more comprehensive insight into the cause and control of Cyanobacteria blooms, appear to be a trend and challenge of the study of this field.


Introduction
Cyanobacteria blooms are commonly associated with toxin production in drinking water supplies and have been a severe risk to human beings health [1] [2].The blooms have become a major threat to freshwater ecosystems globally and a worldwide challenge [3] [4].To deal with the threat and challenge, studies of the cyanobacteria blooms have been carried out on the fresh water systems in Asia [5]- [14], Europe [15]- [19], North America [20]- [26], Oceania [27] [28] and Africa [29] [30].For decades, it has seen that the study appears to be characterized mainly by investigating the nutrient control (such as Nitrogen and Phosphorus) and the influence of other environmental factors (temperature and pH, for example) on the blooms, and deploying hydrodynamic and microbial combined models to predict the blooms, or introducing machine learning methods such as Artificial Neural Network (ANN), with environmental factors as their input variables, for gaining understanding of the cause and development of the microbial community's explosive reproduction [8] [11] [31] [22] [33].On the other hand, new generation high-throughput sequencing techniques, based on the system of 16S rRNA gene, makes it possible to quickly examine the composition of the microbial community comprehensively in different habitats, enabling insight into profiles of the community composition [34] [35] [36].With the rapid development of next generation sequencing techniques, metagenomic data analysis has been applied to the Cyanobacteria bloom study.Applying Metagenomics to investigate the genetic and metabolic diversity of the mixed populations helps understand the interactions of different microbial populations and their functions in the blooming process.A recent research carried out by Tromas et al. [25] shows that sequence data was a better predictor than environmental factors.In this article, we present a review of recent study of harmful Cyanobacteria bloom, with more attention paid to the cases in China that has been suffering from severely and critically growing water quality problem [37]- [38].

Conventional Approach: Investigating the Relation between Environmental Factors and the Blooms
The conventional approach is focused on impact of environmental factors to the bloom.Kong, F. and Fao, G. [39] noticed the environmental elements as control factors of the algae blooms.Temperature and dissolve oxygen on the sediment surface were observed, and they concluded that the hydrological and meteorological condition would cause the algae to float up to the water surface and then form the water bloom.Wilhelm et al. [40] investigated the relationships between nutrients, cyanobacterial toxins and the microbial community in Taihu, China.They provide an independent confirmation that both total nitrogen and total phosphorus concentrations are strongly related to cyanobacterial biomass in the system and demonstrated that both nitrogen and phosphorus inputs play a role in microbial community biomass production and structure.They also indicated that the toxicity of the community is not closely coupled to key factors leading to bloom formation.In a work by McCarthy et al. [41], nitrogen dynamics and microbial food web structure during a summer cyanobacterial bloom in Lake Taihu were studies, they found that N limits or colimits primary production in and near central Lake Taihu, contrary to the previous paradigm of exclusive P limita-tion, and saw this result an example to show the importance of characterizing N cycling in freshwater systems, where most studies have focused on P dynamics.They also stated the importance of water column N recycling relative to sediment processes.An example from North America is a study undertook by Graham et al. [20].Physicochemical data were collected from 241 lakes in Missouri, Iowa, northeastern Kansas, and southern Minnesota U.S.A., to determine the environmental variables associated with high concentrations of the cyanobacterial hepatotoxin microcystin (MC), during May-September 2000-2001.Relationships between particulate MC values and environmental variables were developed using nonparametric Spearman-Rank correlation (a = 0.05).The following environmental factors were measured: Secchi transparency, surface temperature, total phosphorus (TP), total nitrogen (TN), TN:TP ratio, and total suspended solids (TSS); chlorophyll (Chl), and Chl:TP ratio.They found that the presence and concentration of microcystin increase along a gradient of increasing lake trophic status.Ma et al. [42] reported the influence of N, P and pH on Microcystis growth and colony formation in field simulation experiments in Lake Taihu (China).Krausfeldt et al. [43] examined the spatial and temporal variability in the nitrogen cyclers of hypereutrophic Lake Taihu.These studies focused on the physical and chemical parameters to examine how the environmental and biological variables were associated with the cyanobacteria blooms.
Examining environmental parameters such as water temperature, solar radiation, precipitation, water transparency, pH, DO, and nutritious elements e.g.TN, TP, DN, DP, PO 4-P, NH 4-N.NO 3-N, etc (e.g.[11]), characterized the early studies of the cyanobacteria blooms.Hydrodynamics modelling, statistical methods and machine learning approaches have been facilitating the research [8] [31] [44] [45] [46].Regression and multivariate analyses by principal component and classifying analysis were performed in the study of Wu et al. [31] on cyanobacterial toxin microcystin in 30 subtropical shallow lakes in the middle and lower reaches of the Yangtze River area in China; Li et al. [8] used a coupled hydrodynamic-algal biomass model for forecasting short-term cyanobacterial blooms in Lake Taihu; the model was applied to predict the occurrences of the algae blooms of the next 3 days in Lake Taihu during April to September in 2009 and 2010.They reported that independent evaluations from remote sensing images and boat survey data showed that the accuracy of these bloom forecasts was more than 80%.Yabunaka et al. [11] applied machine learning techniques for modelling algal bloom dynamics.Artificial Neural Network models, through genetic programming, were used to model and predict the blooms in Tolo Harbour, Hong Kong.The input variables were 10 parameters, four nutrients (PO 4-P, NH 4-N.NO 3-N, and Si), four physic-chemical conditions (water temperature.transparency, DO, and pH), plus two zooplankton species, and the output variables were the chlorophyll a concentration or the biomass of specific phytoplankton species.Vilán et al. used support vector machines and multilayer perceptron networks from cyanobacterial concentrations determined experimentally in the Trasona reservoir in Northern Spain, to build a cyanotoxin diagnostic model [46].They reported that the SVR (support vector regression) and MLP (multilayer perceptron) techniques predict the observed actual cyanobacteria blooms from 2006 to 2010 more effectively and accurately than traditional regression models.In this research, the output variable was cyanotoxins and the input variables were a number of biological and physical-chemical variables; the biological parameters included Microcystis aeruginosa, Woronichinia naegeliana, other cyanobacteria species, diatoms, chrysophytes, chlorophytes and other phytoplankton species; the physical-chemical variables were: water temperature, ambient temperature, secchi disk depth, turbidity, total phosphorus concentration, total nitrogen concentration, nitrate concentration, nitrite concentration, ammonium ion concentration, dissolved oxygen concentration, conductivity, alkalinity, calcium concentration, and pH.Zhou et al. [12], in their study of the influence of turbulence on MCs concentrations in Lake Taihu during cyanobacterial bloom periods, investigated how toxic Microcystis and MCs production may be affected by wind-driven turbulence, using a mesocosm experiment.The study aims to deliver deeper insights into the competition of toxic Microcystis and MCs regulation, and understand the coupling of MCs production and turbulence.In this study, a 6-day mesocosm experiment was carried out to evaluate the effects of wind wave turbulence on the competition of toxic Microcystis and MCs production in highly eutrophicated and turbulent Lake Taihu, China.Under turbulent conditions, MCs concentrations (both total and extracellular) significantly increased and reached a maximum level 3.4 times higher than in calm water.Specifically, short term (about 3 days) turbulence favored the growth of toxic Microcystis species, allowing for the accumulation of biomass which also triggered the increase in MCs toxicity.Moreover, intense turbulence raises the shear stress and could cause cell mechanical damage or cellular lysis resulting in cell breakage and leakage of intracellular materials including the toxins.The results indicate that short term (about 3 days) turbulence is beneficial for MCs production and release, which increase the potential exposure of aquatic organisms and humans.This study suggests the importance of water turbulence in the competition of toxic Microcystis and MCs production, and provides new perspectives for control of toxin in CyanoHABs-infested lakes.
Although encouraging outcomes have obtained from the studies represented by aforementioned cases, localization of mainly examining the environmental parameter makes it still inadequate for a deep and holistic inspection of the bloom phenomenon to be gained, especially in terms of the diversity and composition or structure of the microbial community of the bloom forming cyanobacteria.Recent research has demonstrated that, in addition to the conventional methods, it is possible to use pre-bloom sequence data to predict the number of days until a bloom event occurs, with good accuracy; sequence data appears to be a strong predictor, similar or better than prediction with environmental variables [25].

Next Generation Sequencing Techniques Based on the System of 16S rRNA Gene for Microbial Community Profiling
Application of Metagenomics [47] has been accompanied with high speed throughput Next-Generation Sequencing (NGS) that surpass traditional Sanger approach for DNA isolation and sequencing.Development of high-throughput DNA sequencing techniques brings about the progress in microbial community profiling using 16S rRNA [34]- [36], [48] for analysing population structure of cyanobacterial blooms.High-throughput DNA sequencing techniques speed up the analysis through bypassing the need of isolation or cultivation of microorganisms [49]- [50], i.e., the cyanobacteria concerned.The next-generation sequencing technology shows advantages in its high flex, short test period, low cost and repeatability [51]- [52].This culture-independent, molecular way of analysing environmental samples of cohabiting microbial populations has opened up fresh perspectives on microbiology [53].Figure 1 is a microbial community analysis pipe line diagram:

Applying Metagenomics to Characterize the Structure and Function of Microbial Community in Fresh Water Ecosystem
Some examples are inspected in this section, with special attention paid to lakes in China and North America for their significant influence on the large number of populations.

Microbial Community Structures
Through a metagenomic approach, Xie et al. investigated the relationship between Microcystis and the associated bacteria [10].They analyzed cyanobacteria-dominated bloom communities from Lake Taihu, China, applying a visualization-enhanced binning method they developed.By analyzing the metabolic pathways of the microbial community, cooperative interactions among the complex species were indicated.The study revealed that while all heterotrophic bacteria were dependent upon Microcystis for carbon and energy, Vitamin B12 biosynthesis, which is required for growth by Microcystis, was accomplished in a cooperative fashion among the bacteria.The analysis also suggests that individual bacteria in the colony community contributed a complete pathway for degradation of benzoate, which is inhibitory to the cyanobacterial growth.Next-gen- diversity in lake-river ecotone of Poyang Lake, China as well [9].They aimed to identify the micro diversity in different lake-river ecotone, and to explore the evolution and adaptation of the microbial population to changing environmental conditions.The results showed the major Poyang Lake had the largest microbial population, followed by Yao Lake, Ganjiang River and Raohe River.Based on the Shannon and Simpson Index, major Poyang Lake had the largest biodiversity of microbial communities, followed by Ganjiang River, Yao Lake, and Raohe River.Microbial characteristics vary with the TN and TP concentration, for instance, the nitrifying bacteria were relatively rich in Yao Lake and Ganjiang River ecotone, and the polyphosphate-accumulating organisms (PAO) in Raohe In summer, Cyanobacteria were dominant which may result from the strong co-occurrence pattern, suitable temperature and eutrophication.The bacterial community within a module maintained similar ecological niches.The analysis of the relationships between the module eigengenes and environmental variables provided a highly simplified version of the complex effects of environmental variables on the bacterial communities.Module eigengene analysis indicated that temperature only affected some Cyanobacteria members, while others were mainly affected by the nitrogen associated factors.Overall, this study applied network analysis for better understanding the associations of bacterioplankton communities in freshwater lakes.
In a comparative study, Steffen et al. highlighted the utility of Metagenomics as a tool for exploration of microbial communities, provided microbial snapshots of three separate toxic cyanobacterial blooms, Lake Erie (North America), Lake Tai (Taihu, China), and Grand Lakes, St. Marys (OH, USA), using comparative Metagenomics [21].They concluded that despite being single samples, these metagenomes provided a unique snapshot of the microbial community associated with toxic cyanobacterial blooms.They noticed that sequences of the Microcystis phage Ma-LMM01 were detected at all three lakes.This was especially worth noting due to the importance of phage in bloom dynamics and termination.Their findings included the presence of the mlrC gene in both Taihu and Erie.This gene is involved in microbial degradation of microcystin, and its presence warrants further inquiry into the presence of potential important microcystin degraders in these lakes.Within their observations key functional genes, such as those involved in nitrogen assimilation, appeared to be more informative than standard 16S rDNA gene analysis and demonstrated that within two similar biological events (blooms in Lake Erie and Taihu) the analogous processes were likely carried out by different members of the community.With this approach, they were able to identify potentially divergent pathways of assimilated nitrogen through the microbial communities of three different blooms.
The genomic contribution of heterotrophic bacteria to nitrogen assimilation in Taihu represented a potentially critical contribution of heterotrophic bacteria in driving toxic freshwater blooms.

Environmental Variables and the Microbial Community Structures
Cao et al. conducted a study in 21 freshwater lakes in Yunnan Province, China [14].In their study, two hypothesized structural equation models were used to explore the bacterial community structure dynamics responding to environmental variables in the investigated plateau lakes.The models highlighted the role of the physical environment, land use, lake morphology and nutrients influencing the bacterial community structure in the ecological processes.Water transparency was demonstrated to be a major driving force in determining the taxon composition of the bacterial community.In contrast with what had been presented in the response of the cyanobacteria community to lake morphology, a relatively weak relationship between the bacterioplankton community and lake morphology was observed, especially lake depth.In addition, the models also showed that TN was more significant than TP for determining the bacterioplankton community structure.The threshold analyses for nonlinear responses suggested substantial changes of the bacterioplankton community structure were strikingly observed at 7.36 for pH and at 25.6% for the percentage of the agricultural area, while the distinct change point of the cyanobacteria community structure responding to pH was at 7.74.Finally, following analyses indicated that there was an apparent shift in dominance from Proteobacteria to Cyanobacteria with increasing nutrient loads.Using weekly data from western Lake Erie in 2014, Berry et al. investigated how the cyanobacterial community varied over space and time, and whether the bloom affected non-cyanobacterial (nc-bacterial) diversity and composition [24].In the study, extracted DNA was amplified using primer set 515f/806r, which targets the V4 hypervariable regions of the 16S rRNA gene.Both microbial community parameters and environmental parameters were examined in the study.They found that bacterial community exhibited changes in diversity and composition during the bloom season, the evenness of Alphaproteobacteria and Betaproteobacteria showed differential responses to algal pigment levels, suggesting that the bloom affected niche diversity for these phylogenetic groups.
Their observations supported a link between CHABs and disturbances to bacterial community diversity and composition.They concluded that changes in community composition could be represented in three coordinates, with the first coordinate associated most strongly with bloom measures, the second coordinate associated with temperature, and the third coordinate associated with physical water mass movements.These results supported work by others demonstrating that bacterial communities are impacted by CHABs, and identifies the acI clade as a particularly affected group.The short recovery of many taxa after the bloom indicates that bacterial communities may exhibit resilience to CHABs.
Tromas et al. used a deep 16S amplicon sequencing approach to profile the 32 bacterial communities in eutrophic Lake Champlain over time, to characterize the composition and repeatability of cyanobacterial blooms, and to determine the potential for blooms to be predicted based on time-course sequence data [25].The analysis, based on 143 samples between 2006 and 2013, spans multiple bloom events.They found that the microbial community varied substantially over months and seasons, while remaining stable from year to year.Bloom events significantly altered the bacterial community but did not reduce overall diversity, suggesting that a distinct microbial community-including noncyano-bacteria-prospers during the bloom.Blooms tended to be dominated by one or two genera of cyanobacteria: Microcystis or Dolichospermum.Blooms were thus relatively repeatable at the genus level, but more unpredictable at finer taxonomic scales.They classified their samples into bloom or non-bloom bins, achieving up to 92% accuracy.They confirmed that cyanobacterial blooms respond significantly to total phosphorus and total nitrogen as previously described.Temperature was also an important factor shaping the lake microbial community, as previously documented.However, in this study, they observed that these predictors explained only a part of the variation between bloom and no-bloom samples.Other predictors might include water column stability and mixing, and the interactions of predictors, especially nutrients and temperature.
In addition to environmental factors, they showed that biological factors, in the form of bacterial OTUs or genera, could also help to characterize the bloom.
They indicated that Cyanobacterial blooms alter the local environment, likely altering the surrounding microbial community.As a result, these assemblages likely included bacteria that were reliant on cyanobacterial metabolites and biomass.Using symbolic regression, they were able to predict the start date of a bloom with 78-91% explained variance over tested data (depending on the data used for model training).They stated that sequence data appeared to be a strong predictor, similar or better than prediction with environmental variables.This showed that, although blooms in Lake Champlain (and other temperate lakes) were clearly correlated with seasonality (i.e.blooms occur mainly during summer, at warmer temperatures), the state of the microbial community may contain more information than environmental factors alone about the likelihood of an impending bloom.This could be because one microbial taxon contains information about numerous environmental parameters, resulting in parsimonious predictive models based on a small number of taxonomic biomarkers.

Microcystis Appears in Most of the Studies Lakes
Microcystins (MCs) are the most common and potent cyanotoxins in freshwater systems worldwide.The most of the lakes examined in this section 4 were associated with this genus.Lake Taihu has experienced Microcystis bloom events for decade, Xie et al. [10] and Steffen et al. [21] focused their studies on this lake; Cao et al. [14] reported that analysis at the genus level of Cyanobacteria identified that Microcystis was among the most abundant genus in the 21 plateau lakes in Yunnan, China; Steffen et al. [21] stated that Microcystis-dominated blooms had been observed in the western basin of Lake Erie annually since the 1990s; the study by Berry et al. [24] in western Lake Erie also showed that Cyanobacterial community composition fluctuated dynamically during the bloom, but was dominated by Microcystis and Synechococcus OTUs; Tromas et al. [25] reported that blooms in their study site (Lake Champlain, North America) tended to be dominated by one or two genera of cyanobacteria: Microcystis or Dolichospermum; in the study of Touzet [19]

Machine Learning (ML) Approaches Possess Dual Significance in the Metagenomics and Cyanobacteria Blooming Study
Metagenomic data analyses aims at identifying the taxonomic composition of microbes and their relative counts and annotating the functional roles as encoded by micro biomes and finding association of microbes with their functional metadata phenotypes [55] [56].Differentiate between microbial communities or associated functional conditions can be realized through analysing relative OTU abundance across metagenomic samples and their relationships.Machine learning (ML) techniques are used as a powerful tool in the metagenomic data analyses [57] [58], as it depends on computational tools for analysing sheer data sets, gaining information from the microbial community.This is reflected in the afore-mentioned case studies.Whilst in the early studies focusing on the environmental parameters, machine learning was applied to establish relationships between physical-chemical factors and the blooming occurrence.Therefore, machine learning (ML) has dual implications in the cyanobacteria blooming study.Researchers wish to improve the methodology used both in the metagenomic analysis and in physical-chemical oriented ML modeling approaches as well.

Summary Remarks
Cyanobacteria blooms studies have been undertaken for decades, concerning the harmfulness of the blooming to environment and human beings.Conventional studies mainly focus on investigating the environmental factors influencing the blooms, possessing their limitation in lack of viewing the microbial population of the blooming.Metagenomics study provides insight into the internal community structure of the cyanobacteria at the blooming, and there researchers reported that sequence data was a better predictor than environmental factors.This further manifests the significance of the metagenomic study.However, large number of the latter appears to be confined only to present snapshoot of the microbial community diversity and structure.This type of investigation has been valuable and important, whilst an effort to integrate and coordinate the conventional approaches that largely focus on the environmental factors control, and the metagenomic approaches that reveals the microbial community structure and diversity, implemented through machine learning techniques, for a holistic and more comprehensive insight into the cause and control of Cyanobacteria blooms, appear to be a trend and challenge of the study of this field.
River were richer than those in Ganjiang River.In a study carried out by Zhao et al., high-throughput sequencing was employed to investigate the seasonal varia- tions in the composition of bacterioplankton communities in six eutrophic urban lakes of Nanjing City, China[54].The results showed that temperature, pH and NO 3− -N were the most important factors influencing the composition of the bacterioplankton community.The length and direction of temperature arrow suggested strong impact to the summer community.Temperature was orthogonal with the other two arrows (pH and N O3− -N), suggesting temperature explains variation not explained by pH and NO 3− -N.The results demonstrated that co-occurrence in freshwater bacterioplankton communities within six urban lakes varied in different seasons.Moreover, Cyanobacteria played different roles in the ecological network of each season.
Actinobacteria and Bacteroidetes were induced a sharp decrease and increase crossing the change point along the gradient of the agricultural area.A study undertook by Touzet et al. investigated the dynamics in summer diversity of planktonic cyanobacterial communities and microcystin toxin concentrations in two inter-connected lakes from the west of Ireland, Lough Corrib and Ballyquirke Lough [19].Phytoplankton biomass was estimated through chlorophyll-a analysis, and Cyanobacteria community fingerprinting was examined by 16S rDNA DGGE analysis.Analyzed quantitative variables included temperature, Secchi depth, chlorophyll-a concentration, dissolved inorganic nitrogen and phosphorus, microcystin concentrations and DGGE-based estimate of cyanobacterial abundance.They observed community change throughout the summer, and identified cyanobacterial genotypes both unique and shared to both lakes.Microcystin concentrations were greater in August than in July and June in both lakes.They indicated that this was concomitant to the increased occurrence of Microcystis as evidenced by DGGE band excision and subsequent sequencing and BLAST analysis.RFLP analysis of PCR amplified mcy-A/E genes clustered together the August samples of both lakes, highlighting a potential change in microcystin producers across the two lakes.The multiple factor analysis of the combined environmental data set for the two lakes highlighted the expected pattern opposing greater water temperature and chlorophyll concentration against macronutrient concentrations, but also indicated a negative relationship between microcystin concentration and cyanobacterial diversity, possibly underlining allelopathic interactions.Despite some element of connectivity, the dissimilarity in the composition of the cyanobacterial assemblages and the timing of community change in the two lakes likely were a reflexion of niche differences determined by meteorologically-forced variation in physico-chemical parameters in the two water bodies.

Table 1 .
Some environmental parameters seen in Cyanobacteria bloom study.

Table 2 .
Summary of the lakes examined in this section.
et al.Microcystins were extracted from the samples of Lough Corrib and Ballyquirke Lough.Microcystins (MCs) are predominantly produced by Microcystis spp.which is considered a serious health hazard due to its potent liver toxicity and carcinogenic potential, and has been seriously concerned.