Distance-based Phylogeny Reconstruction (Optimal Radius)
Abstract
A phylogeny is an evolutionary tree tracing the shared history, including common ancestors, of a set of extant taxa. Phylogenies have been historically reconstructed using character-based (parsimony) methods, but in recent years the advent of DNA sequencing, along with the development of large databases of molecular data, has led to more involved methods. Sophisticated techniques such as likelihood and Bayesian methods are used to estimate phylogenies with sound statistical justifications. However, these statistical techniques suffer from the discrete nature of tree topology space. Since the number of tree topologies increases exponentially as a function of the number of taxa, and each topology requires separate likelihood calculation, it is important to restrict the search space and to design efficient heuristics. Distance methods for phylogeny reconstruction serve this purpose by inferring trees in a fraction of the time required for the more statistically rigorous methods. They allow dealing with thousands of taxa, while the current implementations of statistical approaches are limited to a few hundreds, and distance methods also provide fairly accurate starting trees to be further refined by more sophisticated methods. Moreover, the input of distance methods is the matrix of pairwise evolutionary distances among taxa, which are estimated by maximum likelihood, so that distance methods also have sound statistical justifications.
Mathematically, a phylogenetic tree is a triple T =(V,E,l) where V is the set of nodes representing extant taxa and ancestral species, E is the set of edges (branches), and l is a function that assigns positive lengths to each edge in E. Evolution proceeds through the tree structure as a stochastic process with a finite state space corresponding to the DNA bases or amino acids present in the DNA or protein sequences, respectively.
Any phylogenetic tree T defines a metric DT on its leaf set L(T) : let PT (u,v) define the unique path through T from u to v, then the distance from u to v is set to DT(u,v)= ∑l(e). e∈P (u,v)
Distance methods for phylogeny reconstruction rely on the observation [13] that the map T → DT is reversible; i.e., a tree T can be reconstructed from its tree metric. While in practice DT is not known, by using models of evolution (e.g. [10], reviewed in [5]) one can use molecular sequence data to estimate a distance matrix D that approximates DT . As the amount of sequence data increases, the consistency of the various models of sequence evolution implies that D should converge to DT . Thus for a distance method to be consistent, it is necessary that for any tree T, and for distance matrices D “close enough” to DT , the algorithm will output T.
The present chapter deals with the question of when any distance algorithm for phylogeny reconstruction can be guaranteed to output the correct phylogeny as a function of the divergence between the metric underlying the true phylogeny and the metric estimated from the data. Atteson [1] demonstrated that this consistency can be shown for Neighbor Joining (NJ) [11], the most popular distance method, and a number of NJ’s variants.
Origin | Files produced by the author(s) |
---|
Loading...