Shortest DNA cyclic cover in compressed space

Bastien Cazaux 1, 2, * Rodrigo Cánovas 1, 2 Eric Rivals 1, 2
* Auteur correspondant
2 MAB - Méthodes et Algorithmes pour la Bioinformatique
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier
Abstract : For a set of input words, finding a superstring (a string containing each word of the set as a substring) of minimal length is hard. Most approximation algorithms solve the Shortest Cyclic Cover problem before merging the cyclic strings into a linear superstring. A cyclic cover is a set of cyclic strings in which the input words occur as a substring. We investigate a variant of the Shortest Cyclic Cover problem for the case of DNA. Because the two strands that compose DNA have a reverse complementary sequence, and because the sequencing process often overlooks the strand of a read, each read or its reverse complement must occur as a substring in a cyclic cover. We exhibit a linear time algorithm based on graphs for solving the Shortest DNA Cyclic Cover problem and propose compressed data structures for storing the underlying graphs. All results and algorithms can be adapted to the case where strings are simply reversed but not complemented (e.g. in pattern recognition). Introduction Computing a shortest superstring for a set of input words is a way to compress a language. Indeed, overlaps between the words allow to save characters. The Shortest Superstring problem is crucial in text compression. Unfortunately, its hardness prevents us from finding, not only an optimal, but even an approximate solution that is arbitrarily near to optimal [5, 1]. Rather than for a single linear superstring, it is thus better to opt for a compressed version that consists in a set of cyclic superstrings, whose total length is minimal. This representation is called a shortest cyclic cover. Indeed, the length of a shortest cyclic cover is smaller than that of a shortest linear superstring. These problems are related to DNA or genome assembly in bioinformatics: A highly redundant set of sequencing reads is given as input and one wishes to reconstruct the DNA sequence. As DNA is made of two complementary and reverse strands, a read can come from any strand, and thus either word or its reverse complement should occur in the superstring/genome. Moreover, chromosomes can be linear or cyclic molecules, and a sequenced sample may contain a mixture of cyclic genomes. We investigate the Shortest DNA Cyclic Cover problem and present a linear time algorithm for it. We also propose a compressed data structure for storing the underlying graphs used to solve the problem. All results described here can be adapted to the case where strings are simply reversed, and applied for instance to pattern matching of 2D shapes [2].
Type de document :
Communication dans un congrès
Data Compression Conference, Mar 2016, Snowbird, United States
Liste complète des métadonnées

Littérature citée [14 références]  Voir  Masquer  Télécharger

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01314330
Contributeur : Bastien Cazaux <>
Soumis le : mercredi 11 mai 2016 - 11:22:25
Dernière modification le : jeudi 24 mai 2018 - 15:59:22
Document(s) archivé(s) le : mercredi 16 novembre 2016 - 00:43:36

Fichier

DCC-RR.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : lirmm-01314330, version 1

Collections

Citation

Bastien Cazaux, Rodrigo Cánovas, Eric Rivals. Shortest DNA cyclic cover in compressed space. Data Compression Conference, Mar 2016, Snowbird, United States. 〈lirmm-01314330〉

Partager

Métriques

Consultations de la notice

294

Téléchargements de fichiers

268