Phylogenetic networks: what can we reconstruct?
Résumé
Phylogenies are used to describe the history of evolutionarily related biological entities (e.g. genes, individuals, species) and are central in many biological applications, including functional genomics, epidemiology and biodiversity assessment. Many methods for reconstructing and studying phylogenies have been proposed, almost all of which use trees to represent them. Although in many cases this is reasonable, in many others phylogenies should be represented as networks (more precisely directed acyclic graphs). This is due to a number of biological phenomena collectively known as reticulation events, whereby a species or a gene inherits genetic material from more than one parent organism. This may be caused by events such as hybridization (e.g. in plants), horizontal gene transfer (e.g. in bacteria) or recombination (e.g. in viruses or in genomes of sexually reproducing species).
Network inference methods are in their infancy, but they are almost invariably based on the following idea: the goodness of a candidate network is evaluated on the basis of how well the subtrees it contains fit the data. This poses a problem: different networks may contain exactly the same set of subtrees (up to isomorphism), meaning that these networks will be considered "indistinguishable" by most network inference methods, no matter the input data. We propose a novel definition of what constitutes a "uniquely reconstructible" network: for each class of indistinguishable networks, we define a canonical form. Under mild assumptions, the canonical form is unique. Given data coming from any phylogenetic network, only its canonical equivalent can be uniquely reconstructed. This is a fundamental limitation that implies a drastic reduction of the solution space in phylogenetic network inference.