Estimation of Sequence Errors and Prediction Capacity in Transcriptomic and DNA-Protein Interaction Assays

Eric Rivals

Communication Dans Un Congrès Année : 2009

Estimation of Sequence Errors and Prediction Capacity in Transcriptomic and DNA-Protein Interaction Assays

(1)

Eric Rivals

Fonction : Auteur correspondant
PersonId : 2002
IdHAL : eric-rivals
ORCID : 0000-0003-3791-3973
IdRef : 118021850

Connectez-vous pour contacter l'auteur

Méthodes et Algorithmes pour la Bioinformatique

Résumé

Next-generation sequencing technologies, able to yield millions of sequences in a single run, allow to interrogate the transcriptome or to assay protein-DNA interactions (by Chromatin ImmunoPrecipitation by sequencing or ChIP-seq) at a genome-wide scale. These assays yield short sequences (<40 bp), called tags, that need to be mapped to the genome sequence. To each tag is associated the number of times that the same sequence has been experimentally detected: its occurrence number. For transcriptomic assays, for instance, a tag with a high occurrence number likely is the biologically valid signature of an abundant transcript, while a tag with a low occurrence number may either result from a sequencing error or identify a rare RNA. The mapping is a compulsory step to first predict, and then annotate regions of interest on the genome. Usually, only genomic locations that are unambiguously mapped by a tag are further analysed. Those high-throughput assays are intended to predict a maximum number of genomic locations of interest. Obviously, this induces a balance between the number of mapped tags and the number of tags that map a unique genomic location, and this balance is controlled by the tag length. The sequencing technique generally dictates the tag length. Nevertheless, once a certain length is sequenced (e.g., 36 bp with a Solexa/Illumina 1G machine) it is still possible to map only sub-parts (a prefix, a suffix, a substring) of the tags to the genome, thereby artificially reducing the tag length and modifying the balance. Presently, we lack a statistical method to evaluate the influence of the tag length on the capacity of prediction for different assays and sequencing techniques, as well as the importance of sequence errors. Our contribution is threefold. Based on word statistics, we design a program that computes the theoretical probability of mapping a genomic location by chance for a given tag length, a background distribution. Using an efficient algorithm to map short tags on complete genome sequence, called mpscan, we investigate how the prediction capacity varies with tag length. Finally, we propose a method to estimate the probability of a tag to be altered by a sequencing error. We apply it to derive a probability of having an erroneous nucleotide at a given position in the tag for the Sanger and Solexa sequencing techniques, and for both transcriptomic and ChIP-seq experiments. This enables a technical assessment of such assays and the indirect measurement of the impact of some biological phenomena, like SNPs.

Mots clés

Next Generation Sequencers ChIP-seq Digital Transcriptomics Digital Gene Expression sequence analysis bioinformatics

Domaines

Bio-informatique [q-bio.QM] Bio-Informatique, Biologie Systémique [q-bio.QM]

Eric Rivals : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00375010

Soumis le : vendredi 10 avril 2009-18:14:26

Dernière modification le : vendredi 24 mars 2023-14:52:51

Dates et versions

lirmm-00375010 , version 1 (10-04-2009)

Identifiants

HAL Id : lirmm-00375010 , version 1

Citer

Eric Rivals. Estimation of Sequence Errors and Prediction Capacity in Transcriptomic and DNA-Protein Interaction Assays. Journée Thématique : Nouveaux Séquenceurs (NGS), Apr 2009, Rennes, France. ⟨lirmm-00375010⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS MAB LIRMM MIPS UNIV-MONTPELLIER

179 Consultations

0 Téléchargements

Estimation of Sequence Errors and Prediction Capacity in Transcriptomic and DNA-Protein Interaction Assays

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager