A Review on Phylogenetic Analysis : A Journey through Modern Era

Phylogenetic analysis may be considered to be a highly reliable and important bioinformatics tool. The importance of phylogenetic analysis lies in its simple manifestation and easy handling of data. The simple tree representation of the evolution makes the phylogenetic analysis easier to comprehend and represent as well. The varied applications of phylogenetics in different fields of biology make this analysis an absolute necessity. The different aspects of phylogenetic analysis have been described in a comprehensive manner. This review may be useful to those who would like to have a firsthand knowledge of phylogenetics.


Introduction
The basic definition of "evolution" can be given in versatile ways in different contexts.From the biologist point of view evolution can be defined as the development of a biological form from other preexisting forms or its origin to the current existing form through natural selections and modifications i.e. change across successive generation.The driving force behind evolution is natural selection in which "unfit" forms are eliminated through changes of environmental conditions or sexual selection so that only the fittest are selected (Darwinism).The underlying mechanism of evolution is genetic mutations that occur spontaneously.The mutations on the genetic material provide the biological diversity within a population; hence, the variability of individuals within the population to survive successfully in a given environment.Genetic diversity thus provides the source of raw material for the natural selection to act on.
The term "phylogenetics" derived from the Greek terms phyle and phylon means "tribe" and "race"; and the term "genetikos" imply "relative to birth", from "genesis" i.e. "birth".Phylogenetics is the study of evolutionary relatedness among groups of organisms (e.g.species, populations).In other words, phylogenetic analysis of a family is to determine how the family might have been derived during evolution.

Representation of Phylogenetic Relationship: Phylogenetic Tree
Phylogenetic tree is a two dimensional representation of relatedness among various biological species.It is a line drawing that provides a visual means of representation for a group of sequences or species and indicates their time series of origin.The phylogenetic tree is represented in three forms: Phylogram, Dendrogram, Cladogram.

Merits and Demerits of Tree Building Methods
A phylogenetic tree may be built by mainly either distance based methods or character based methods.
Character based method derives trees that optimize the distribution of the actual data pattern for each character.The most commonly used character based method includes Maximum Parsimony (MP) method [5] and Maximum Likelihood (ML) method [6].
There are some important criteria such as computational speed, consistency of estimated topology, statistical consistency of phylogenetic trees, probability of obtaining the correct topology, reliability of estimated branch length, depending on which we can compare different established tree-building methods.The computational speed of each tree-building method depends on the algorithms that have been used in each case.According to this criterion (i.e., computational speed), the NJ method is the superior one from other tree-building methods which are currently in use.This method can handle a large number of sequences with bootstrap tests with ease whereas MP, ME, and ML methods examine all possible topologies searching for the MP, ME and ML trees, respectively.We all know that the possible number of topologies increases sharply with number of input sequences and it becomes hard to use these methods when number of experimental sequences is high.We are hopping for simplified algorithms to be developed for these methods as well.In the case of ME, simplified advanced algorithms has been developed which is efficient in frame of timescale for obtaining the correct tree and also for MP methods the branch and bound method is often used when number of sequence is relatively high.During nineties algorithm suggested by Rzhetsky & Nei may be used for determining trees rapidly.If no bias is applied during the estimation of distance through substitution NJ, ME methods are found consistent for estimating trees but MP is often inconsistent.A tree-building method is considered as a "consistent estimator" if the method tends to give the correct topology as the number of experimental sequence tends to infinity.ML methods on the other hand have the additional advantage of being more flexible in choosing the evolutionary model.But this method is leangthy and time consuming.

Dimension of Evolution: Evolutionary Time
The time period taken for evolution of a group of protein or DNA from a common ancestor is called as evolutionary time.Number of changes occurring in evolution can be identified by these phylogenetic analysis methods.It estimates No. of changes i.e.No. of mutations in protein sequence.It is done by multiple sequence alignment as the first step.Therefore it is based upon distance scores/sequence similarity score.

Molecular Clock Hypothesis [7]-[9]
In 1960's Zuckerkandl & Pauling proposed the molecular clock hypothesis, which changes the concepts in modern evolutionary biology, proposes that genes and gene products evolve at rates that are roughly constant over time and across evolutionary lineages.It gives the idea about time scales of natural events even in the absence of fossil evidence.Molecular clock hypothesis is defined as the nucleic acids and proteins evolves at rates that are constant over times, also this evolution relates to mutations that an organism uses to progress to next generation without loss of function and not lethal.
Molecular clock simply aims at finding the number of mutations in a given protein given the time it has taken to evolve since rates of evolution are constant i.e. all the mutations occurs in same rate in all the branches and the rate of mutations are same for all the positions along the sequence.The protein that functions well in keeping up with a molecular clock is alpha globins although at the structural level this clock does not tick without variation.

Divergence of Molecular Clock Hypothesis
The difference in rate of molecular evolution among lineages is only one of the potential problems faced by the evolutionary biologist interested in using molecular clocks to date divergence events.All molecular clocks must be calibrated using independent evidence, such as dates of speciation events inferred from the fossil record or dates estimated for particular biogeographic events.Attempts to estimate divergence times are obviously simpler when the taxa in question share a similar rate of molecular evolution.However, in the real world researchers may often be faced with rate variation among lineages.There are a number of potential methods available to solve this problem.Many methods like linearized tree method [10] [11] and the quartet method [12] estimate the divergence times by removing the nonclock-like subsets of the data.These methods have been used in diversecases such as avian biogeography [13], molecular evolution [14], and mammalian [15] [16] diversification.Quartet method identifies the pairs of taxa that have good fossil data with which we can calibrate absolute rates of molecular evolution between the pair.These pairs can in turn be assembled into quartets consisting of two pairs of taxa, each of which has a known fossil date of divergence.The problem of envisioning non-clock-like data was solved by two methods [16] [17] which includes nonparametric rate smoothing (NPRS) and penalized likelihood.These two are distinct from the previous methods because, rather than throwing out non-clock-like data, aforementioned methods estimate local rates, i.e., for specific branches or clades.This is possible because these methods use constraint during the calculation of rate of molecular evolution, which can vary among lineages.

Evolution of Function [18]-[22]
Selection of advantageous mutations by natural procedure i.e., positive selection, is an exciting field for evolutionary biologists to work on, because adaptive changes in genes are eventually responsible for evolutionary modernism.So natural selection has become a powerful approach for molecular biologists, biochemists, and virologists to understand the functions of new genes.Some studies using phylogenetic approaches have identified a number of genes under positive selection, especially genes involved in host-pathogen interactions.In a recent issue of PNAS described a remarkable study in which phylogenetic sequence comparison identified a small segment of the primate TRIM5α protein to be under positive selection, and functional analysis using mutagenesis confirmed the importance of the segment in species-specific retroviral inhibition.
So we can say that information about protein sequences of ancestral organisms is important for identifying critical amino acid substitutions that have caused the functional change of proteins in evolution.

Ancestral Sequences Prediction [23]-[25]
The prediction of ancestral protein sequences from multiple sequence alignments is useful for many bioinformatics analyses.Predicting ancestral sequences is not a simple procedure and it depends on accuracies of alignments and proper phylogenetic analysis.Several algorithms exist based on Maximum Parsimony or Maximum Likelihood methods but many current implementations are unable to process residues with gaps, which may represent insertion/deletion (INDEL) events or sequence fragments.Predicting ancestral protein sequences from a multiple sequence alignment is a useful tool in bioinformatics.Many evolutionary sequence analyses require such predictions in order to map substitutions to a particular lineage.In other situations, the predicted ancestral sequence alone may provide a more representative functional sequence than a simple consensus sequence constructed from an alignment.Strict consensus methods are quick but can suffer from overrepresentation of larger clades of related sequences, which contribute more sequences to the consensus than more sparsely populated clades.Maximum Parsimony (MP) method overcomes this problem by minimising mutational steps, rather than maximising agreement with the terminal sequences.MP, however, cannot distinguish between several equally parsimonious predictions.More sophisticated likelihood-based methods exist that can give probabilities for different ancestral sequences and implementation such as CODEML and FASTML provides good balance between speed and accuracy.

Amino Acid Sites under Positive Selection: Prediction of Adaptive Evolution [26]-[29]
Modern researcher of molecular evolutionary genomics shows their interest in the detection of positive selection on protein-coding DNA sequences.Nucleotide substitutions in the coding genes of amino acids of proteins can be either synonymous where amino acid changes or non-synonymous i.e., silent substitutions where amino acid remains same.Usually, most non-synonymous changes would be expected to be eliminated by purifying selection, but under certain conditions Darwinian selection may allow their retention.Estimation of synonymous and non-synonymous substitution rates is important for revealing the dynamics of molecular evolution.In parsimony methods, substitutions are determined using parsimony reconstruction of ancestral sequences, and an excess of non-synonymous substitutions is tested independently for each site.The two methods differ in a way, first estimated the average ratio of non-synonymous rate (dN) to the synonymous rate (dS) i.e., dN/dS along the sequence and then compared the non-synonymous/synonymous rate ratio at each site against this average.Likelihood method is a two-step procedure in which firstly "likehood ratio test" is done for positive selection in the whole gene.If this test indicates statistical evidence for the presence of a proportion of sites evolving under positive selection, identification of putative positively selected sites can then proceed.The likelihood methods are used in the PAML package.

Simulation of Molecular Evolution [30]-[41]
What is the origin of life?A highly questionable field.In this context, computer simulation is played an important role.The idea was, there was once a prehistoric stage wherein RNA carried both the genetic function and the catalytic function, named "the RNA World".However, still there was question to answer, how did the RNA World arise?A relatively direct and simple consideration is that, the RNA World originated de novo from non-living world, which involves several stages: stage 1, prebiotic synthesis of nucleotides; stage 2, prebiotic formation of poly-nucleotides from the nucleotides; stage 3, emergence of special RNA molecules catalyzing its own replication primordial "RNA replicases"; stage 4, evolution of the primordial replicases towards more efficient ones; stage 5, emergence and evolution of other catalytic RNA molecules favoring replication or existence in the background of natural selection.However, experimental evidence in this field still stays at level one of these stages i.e., mineral-catalyzed synthesis of polynucleotides and non-enzymatic template-directed ligation of oligoribonucleotides or polymerization; RNA-catalyzed template-directed ligation or polymerization and recreating RNA replicases via in vitro directed molecular evolution; artificial construction of an autoevolvingreplicase system.Up to now, researchers seem to have outlined all the basic reaction mechanisms of these stages, but they were not sure if these stages could happen as a continuous and integrated process.This is a point where computer simulation provides the assistance.Monte Carlo simulation is a kind of computer simulation that mimicking random events in reality by determining the relative probabilities based on definitive rules.For instance, the scenario concerning the genesis of the widely accepted RNA World remains blurry, though we have gathered some circumstantial evidence and fragmented knowledge on several supposed stages, including formation of polynucleotides from a prebiotic nucleotide pool, emergence of RNA replicases (RNA molecules catalyzing their own replication), and evolution of RNA replicases.It is highly valuable to simulate the stages as a continuous process to evaluate the plausibility of the supposition and study the rules involved.

Modern Trends in Phylogenetics [42]-[57]
With third-generation sequencing technology rapidly approaching, it will become more feasible to obtain large multilocus data sets to infer evolutionary relationships (Genome 10 k Community of Scientists 2009).These enormous quantities of data have spawned the development of several new programs for phylogenetic inference for these highly heterogeneous data sets.From multiple sequence alignment (MSA) to species tree construction, these new methods are changing the way we gather and manipulate data and analyze and interpret results.Following the construction of an MSA for the traditional 2-step MSA phylogeny estimation procedure, the researcher is left with the decision of how to handle the gaps inserted into the data set by the MSA algorithm to account for INDEL events.For most traditional maximum parsimony (MP) analyses, gaps have been either coded as missing data (most cases) or coded as a fifth character state.Both of these methods are potentially problematic in that the former completely discards relevant evolutionary information, whereas the latter assumes that gaps represent independent evolutionary events; a highly unlikely scenario.These issues also extended into probabilistic phylogenetic inference in that parameters were estimated without taking indel events into account.An alternative to constructing an MSA prior to phylogenetic inference is to use DO (direct optimization) procedures.DO is different from other approaches in that the alignment and phylogenetic tree are estimated simultaneously.Optimization can be performed either under parsimony or under a probabilistic framework.The program POY, for example, estimates both the phylogenetic tree and the best alignment based on the MP criterion.Previous versions of POY were also able to implement DO in a likelihood framework.Newer programs such as Stat-Align, BAli-Phy, and BEAST incorporate models of sequence evolution to estimate the posterior distribution of a set of trees and alignments based on Bayesian inference (BI).The Bali-Phy software shows exceptional promise in that its models allow for nested or overlapping indel events, whereas other methods utilize the more common TKF1 and TKF2 indel models.However, joint estimation of alignment and phylogeny in a probabilistic framework is currently computationally intensive and feasible only with smaller data sets.These methods also fit a single model to the data, which may not be justified with multi-locus data sets.As multi-locus data sets become the norm across laboratories, some of the most commonly employed techniques for both MSA and tree reconstruction will no longer be adequate for generating phylogenetic hypotheses.Instead, alternate and more sophisticated search algorithms are required in order to fully exploit the information contained in these large quantities of data.As highly heterogeneous data sets become available, testing the accuracy of both modern alignment algorithms and DO methods through simulation will become even more important.For traditional phylogenetic inference, MP analysis will no doubt continue to play a role.In this regard, TNT (Tree Analysis Using New Technology) is showing promise for dealing with difficult phylogenetic problems.Furthermore, model-based concatenation methods using mixture models in Bayes Phylogenies seem promising for multi-locus data sets.However, there have been few simulations to quantify the accuracy of the model compared with other methods including direct species tree inference.