Identifiability of phylogenetic networks
Résumé
Phylogenies are almost invariably represented as trees. Although in many cases this is reasonable, in many others phylogenies should be represented as networks (more precisely directed acyclic graphs). This is due to a number of biological phenomena collectively known as reticulation events, whereby a species or a gene inherits genetic material from more than one parent organism. This may be caused by events such as hybrid speciation, introgression, horizontal gene transfer, or recombination. Phylogenetic network inference methods are in their infancy, but they are almost invariably based on the following idea: the goodness of a candidate network is evaluated on the basis of how well the trees it contains fit the data. This poses a problem: different networks may contain exactly the same set of trees, meaning that these networks will be considered "indistinguishable" by most network inference methods, no matter the input data. We propose a novel definition of what constitutes a "uniquely reconstructible" network: for each class of indistinguishable networks, we define a canonical form. Under mild assumptions, the canonical form is unique. Given data coming from any phylogenetic network, only its canonical equivalent can be uniquely reconstructed. This is a fundamental limitation that implies a drastic reduction of the solution space in phylogenetic network inference.