A Brief Review of the Common Tree-building Methods Used in Phylogenetic Inference.

Patricia Strickler

Phylogenetic inference can be defined as the process of determining the estimated evolutionary history by analysis of a given data set (Swofford 1996). Increasingly, molecular data sets, such as DNA and protein sequences, are used to develop these phylogenies. The reasons for building a phylogenetic tree are as diverse as the methods used to produce the trees. The process of phylogenetic analysis can be summarized in five steps. The first two steps are preparatory for the subsequent steps that involve tree-building and evaluation of the resultant tree. The first step is the alignment of either the nucleotide or the amino acid sequences for the taxa of interest. It is generally agreed upon that amino acid sequences produce a tree closest to the true tree (Felsenstein 1996, Hershkovitz 1998, Russo, 1996). This is due to the higher rate of conservation of amino acid sequences and protein structure (Felsenstein 1996). Manual alignment editing is recommended over fully computational multiple alignment as the algorithms and programs are not yet optimal for phylogenetic alignment (Hershkovitz 1998). Regardless of the method for alignment, the final alignment should be carefully scrutinized with any independent phylogenetic evidence and other assumptions of structure and function. Once one proceeds to tree-building, the computer generated alignment will be blind to any errors in alignment.

The second step will be to determine the presence of a phylogenetic signal. Most of the sequence analyses fall between two extremes: identical sequences and sequences which have become so divergent as to become randomized in relation to the phylogenetic history. The former case dictates no further analysis. The latter will result in an inferred phylogeny, though the randomness of the resulting phylogeny may not be worth the effort. Those sequence alignments that fall in between will have a mixture of conserved and random positions and will be the most useful in phylogenetic inference (Hillis 1993). Once the alignment is complete, the next steps in phylogenetic inference are to decide the most appropriate tree-building method for a specified data set followed by choosing a strategy to find the best tree under the selected optimality criterion. Finally, the tree obtained must be scrutinized to determine the level of confidence that can be placed on the results (Hillis 1993).

One of the most complex issues faced during the process of phylogenetic inference is choosing the tree- building method. Tree-building methods can be classified in two ways (Swofford 1996, Hershkovitz 1998). The first way to classify these methods is to define them as either algorithm-based or criterion-based. Though the procedure involved in each of these methods is different, the same algorithm could potentially be used in either method. An algorithm-based method generates a tree by following a series of steps, whereas criterion-based methods define an optimality criterion for comparing alternative phylogenies to one another and deciding which one is better. Therefore, there is a big advantage when working with criteria-based methods because scores are assigned to every examined tree and can be used to rank the resultant phylogenies in order of preference. This provides the user with immediate knowledge about the strength of support for that tree. Strictly-algorithmic methods are computationally much faster than the criteria-based methods because they do not require evaluation of a large number of competing trees. Due to the large number of possible solutions, criteria-based methods do not produce exact results for data sets with more than 8-20 taxa.

Alternatively, tree-building methods can be classified into distance-based versus character-based methods. A distance-based method computes pairwise distances according to some measure. Then, the actual data is discarded and the fixed distances are used in the derivation of trees. Trees derived by way of a character-based method have been optimized according to the distribution of actual data patterns in relation to a specified character.

Cluster analysis and neighbor joining are examples of methods defined solely on the basis of an algorithm or of methods that are unable to separate the task of finding an optimal tree from that of evaluating a specific tree, unlike the criteria-based methods. Cluster analysis constructs trees by linking the least distant pairs of taxa, followed by successively more distant taxa, or groups of taxa. Once two taxa are linked, they lose their individual identities and are subsequently referred to as a single cluster. Neighbor joining is related to this traditional cluster analysis except it removes the assumption that data are ultrameric. The tree is constructed by linking the least distant pairs of nodes as defined by a modified matrix. The modified distance matrix is constructed by adjusting the separation between each pair of nodes on the basis of the average divergence from all other nodes (Swofford 1996, Hershkovitz 1998).

There are three common types of optimality criteria that will be briefly discussed including parsimony, likelihood, and pairwise distance. Both maximum parsimony and maximum likelihood use the original data set for inference. Maximum parsimony falls under the philosophy of " the simpler hypotheses are preferable to the complicated ones" (Swofford 1996). It works in such a way as to choose from the alternative trees, the one with the fewest character-state transformations, thus minimizing homoplasy (e.g. convergence, reversal). Thus, this optimality criterion operates by selecting trees that minimize the total tree length. This method tends to yield numerous trees with the same score, which is not characteristic of other methods such as distance or maximum likelihood. Parsimony is less dependent upon assumptions about the sequence evolution than some other methods and amenable to weighting in order to accommodate any substitution bias. The drawbacks for parsimony include slowed computation with weighting and poor performance when there is substantial among-site rate heterogeneity. Though there are several modifications that can correct for heterogeneity, such as modifying the data set or reweighting positions according to their propensity to change (successive approximation), this could potentially lead to errors in a prior step if the preliminary tree contains any errors.

An area of trouble using maximum parsimony is referred to as the Felsenstein zone. This zone is created when there exists strongly unequal rates of change along different branches of a tree or even with equal rates of change in cases of long-branch attraction. Long-branch refers to a lineage that evolved so much between nodes in the phylogeny that its character states have been effectively randomized with respect to the other taxa. Once in the Felsenstein zone misleading inferences are produced. At the ends of these long branches, character states are exhibited that no longer retain genealogical information leading to a distortion in the inference. Particularly deceptive are taxa on long branches that have converged on character states present in other taxa within the analysis. This appears as a false phylogenetic signal, obscuring the true signal (Lyons-Weiler 1997). There are several ways to fix this problem including weighted parsimony, the use of Relative Apparent Synapomorphy Analysis (RASA), or using maximum likelihood that incorporates models of evolutionary change (Swofford 1996, Lyons-Weiler1997).

Parsimony once took the lead as the most favored method, however, maximum likelihood appears to be replacing parsimony, particularly as this method becomes better defined. The critical difference between these two methods is that parsimony minimizes the amount of evolutionary change required for data explanation, while maximum likelihood attempts to estimate the actual amount of change according to the evolutionary model in place. Maximum likelihood works with a prior nucleotide substitution model to compute a likelihood score for each tree given the original data. Before beginning, either an evolutionary model must be specified that can account for the conversion of one sequence into another or parameters must be selected that can be estimated from the data,. Then the maximum likelihood approach evaluates the probability that the selected evolutionary model will have generated the observed sequences. The trees yielding the highest likelihoods are used to infer phylogeny. The substitution model should be optimized to fit the observed data as modifying the substitution parameters modify the likelihood of the data associated with particular trees. The greatest drawback to using maximum likelihood is the vast amount of computation time required (Swofford 1996, Hershkovitz 1998).

Maximum likelihood is not always available for use. An alternative method that also minimizes the impact of the underestimation problem present in parsimony is the pairwise distance method. It works on the idea that corrected distances to account for superimposed changes can be obtained by estimating the number of unseen events using the same models used with maximum likelihood. The corrected distances are estimates of the true evolutionary distance. The drawback for this method is the loss of data during the process (Swofford 1996).

Because many of the more complex models require an enormous amount of time to complete the computations, heuristic methods are often selected in place of the alternative search methods (Swofford 1996, Lewis 1998, Mau 1999). A heuristic method does not guarantee finding the optimal solution, but does provide a large increase in speed. Such is the case with maximum likelihood. Likelihood has several advantages including consistency, lower variance in estimation, and its robustness to violations of its assumptions, so many attempts have been made to incorporate maximum likelihood into a heuristic method that will simultaneously optimize the substitution model and the tree for a give data set.

In applying any of these methods, there are two key ideas that have been emphasized in the literature throughout the development of phylogenetics. The importance of the starting data cannot be stressed enough. The type of data not only determines which method to choose for analysis, but also determines the validity of the results obtained from a selected method (Cavilla-Sforza 1967, Swofford 1996, Hershkovitz 1998). It follows, then, that the validity of the resulting tree is dependent upon the appropriateness of the model used in tree generation. Phylogenetic inference methods are under continual evaluation and improvement upon the accuracy and speed of computation in these methods is a continual process. In several cases, it has been determined that trees obtained from the simpler methods produce as good results as those obtained by more sophisticated methods. Though this may be an accurate assessment, the more complex methods have the advantage of producing several alternatives which allow for the different topologies to be evaluated for statistical significance (Russo 1996, Nei 1998). There is no agreement, yet, on which method is the best method, and more than likely there never will be as the best method is dependent upon the type of data with which one begins. As long as the method has a theoretical justification, then it is more important to choose a good gene or a large number of amino acids than it is to choose a particular tree-building method (Russo 1996).

Literature Cited

Cavilli-Sforza, L.L. and Edwards, A.W.F, 1967. Phylogenetic analysis: models and estimation procedures. Evol. 21: 550-570.

Felsenstein, J. (1996) Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. Meth. in Enzym. 266: 419-427.

Hershkovitz, M.A. and Leipe, D.D. in "Bioinformatics: A practical guide to the analysis of genes and proteins." (Baxevanis, A.D. and Ouellette, B.F.F. eds.). Wiley Interscience, New York, 1998.

Hillis, D.M., Allard, M.W., Miyamoto, M.M. (1993) Analysis of DNA sequence data: phylogenetic inference. Meth. in Enzym. 224: 456-487.

Lewis, P.O. 1998. A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data. Mol. Biol. Evol. 15(3): 277-283.

Lyons-Weiler, J. and Hoelzer, G.A. 1997. Escaping from the Felsenstein Zone by detecting long branches in phylogenetic data. Mol. Phyl. Evol. 8(3): 375-384.

Mau, B., Newton, M.A., Larget, B. 1999. Bayesian phylogenetic inference via Markov Chain Monte Carlo methods. Biometrics. 55: 1-12.

Nei, M., Kumar, S., Takahashi, K. (1998) The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small. Proc. Natl. Acad. Sci. USA. 95: 1239-12397.

Russo, C.A.M., Takezaki, N., and Nei, M. (1996) Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny. Mol. Biol. Evol. 13(3): 525-536.

 

Swofford, D.J. and G.J. Olsen, G.J., in "Molecular Systematics" (D.M. Hillis and C. Moritz, eds.), Sinauer Associates, Dunderland, Massachusetts, 1996.