RedOak: a reference-free and alignment-free structure for indexing a collection of similar genomes
Abstract
Background: As the cost of DNA sequencing decreases, high-throughput sequencing technologies become increasingly accessible to many laboratories. Consequently, new issues emerge that require new algorithms, including tools for indexing and compressing hundred to thousands of complete genomes.
Results: This paper presents RedOak, a reference-free and alignment-free software package that allows for the indexing of a large collection of similar genomes. RedOak can also be applied to reads from unassembled genomes, and it provides a nucleotide sequence query function. This software is based on a k-mer approach and has been developed to be heavily parallelized and distributed on several nodes of a cluster. The source code of our RedOak algorithm is available at https://gitlab.info-ufr.univ-montp2.fr/DoccY/RedOak.
Conclusions: RedOak may be really useful for biologists and bioinformaticians expecting to extract information from large sequence datasets.