Computing Phylo-k-mers

Finding the correct position of new sequences within an established phylogenetic tree is an increasingly relevant problem in evolutionary bioinformatics and metagenomics. Recently, alignment-free approaches for this task have been proposed. One such approach is based on the concept of phylogenetically-informative k -mers or phylo- k -mers for short. In practice, phylo- k -mers are inferred from a set of related reference sequences and are equipped with scores expressing the probability of their appearance in different locations within the input reference phylogeny. Computing phylo- k -mers, however, represents a computational bottleneck to their applicability in real-world problems such as the phylogenetic analysis of metabarcoding reads and the detection of novel recombinant viruses. Here we consider the problem of phylo- k -mer computation: how can we efficiently find all k -mers whose probability lies above a given threshold for a given tree node? We describe and analyze algorithms for this problem, relying on branch-and-bound and divide-and-conquer techniques. We exploit the redundancy of adjacent windows of the alignment to save on computation. Besides computational complexity analyses, we provide an empirical evaluation of the relative performance of their implementations on simulated and real-world data. The divide-and-conquer algorithms are found to surpass the branch-and-bound approach, especially when many phylo- k -mers are found

Mots clés

Domaines

Bio-informatique [q-bio.QM]

Fichier principal

Computing_phylo_k_mers_TCBB-arxiv.pdf (1.4 Mo)

Origine	Fichiers produits par l'(les) auteur(s)
Licence	Autorisation HAL

Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-03778953

Soumis le : lundi 15 mai 2023-12:22:25

Dernière modification le : mardi 2 décembre 2025-03:18:10

Archivage à long terme le : vendredi 31 octobre 2025-13:13:17

Dates et versions

lirmm-03778953 , version 1 (16-09-2022)

lirmm-03778953 , version 2 (15-05-2023)

Licence

Autorisation HAL

Identifiants

HAL Id : lirmm-03778953 , version 2
ARXIV : 2209.09242
DOI : 10.1109/TCBB.2023.3278049
PUBMED : 37204943
WOS : 001084646300026

Citer

Nikolai Romashchenko, Benjamin Linard, Eric Rivals, Fabio Pardi. Computing Phylo-k-mers. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2023, 20 (5), pp.2889-2897. ⟨10.1109/TCBB.2023.3278049⟩. ⟨lirmm-03778953v2⟩