Estimation of Sequence Errors and Prediction Capacity in Transcriptomic and DNA-Protein Interaction Assays

Abstract : Next-generation sequencing technologies, able to yield millions of sequences in a single run, allow to interrogate the transcriptome or to assay protein-DNA interactions (by Chromatin ImmunoPrecipitation by sequencing or ChIP-seq) at a genome-wide scale. These assays yield short sequences (<40 bp), called \emph{tags}, that need to be mapped to the genome sequence. To each tag is associated the number of times the same sequence has been experimentally detected: its \emph{occurrence number}. For transcriptomic assays, for instance, a tag with a high occurrence number likely is the biologically valid signature of an abundant transcript, while a tag with a low occurrence number may either result from a sequencing error or identify a rare RNA. The mapping is a compulsory step to first predict, and then annotate regions of interest on the genome. Usually, only genomic locations that are unambiguously mapped by a tag are further analysed. Those high-throughput assays are intended to predict a maximum number of genomic locations of interest. Obviously, this induces a balance between the number of mapped tags and the number of tags that map a unique genomic location, and this balance is controlled by the tag length. The sequencing technique generally dictates the tag length. Nevertheless, once a certain length is sequenced (e.g., 36 bp with a Solexa/Illumina 1G machine) it is still possible to map only sub-parts (a prefix, a suffix, a substring) of the tags to the genome, thereby artificially reducing the tag length and modifying the balance. Presently, we lack a statistical method to evaluate the influence of the tag length on the capacity of prediction for different assays and sequencing techniques, as well as the importance of sequence errors. Our contribution is threefold. Based on word statistics, we design a program that computes the theoretical probability of mapping a genomic location by chance for a given tag length. Using an efficient algorithm to map short tags on complete genome sequence, we investigate how the prediction capacity varies with tag length. Finally, we propose a method to estimate the probability of a tag to be altered by a sequencing error. We apply it to derive a probability of having an erroneous nucleotide at a given position in the tag for the Sanger and Solexa sequencing techniques, and for both transcriptomic and ChIP-seq experiments. This enables a technical assessment of such assays and the indirect measurement of the impact of some biological phenomena.
Type de document :
Communication dans un congrès
SMPGD'09: Statistical Methods for Post-genomic Data Workshop, Jan 2009, Paris, France. pp.3, 2009, 〈http://www.math.univ-toulouse.fr/biostat/6-17723-SMPGD09.php〉
Liste complète des métadonnées

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00375005
Contributeur : Eric Rivals <>
Soumis le : vendredi 10 avril 2009 - 18:02:24
Dernière modification le : mercredi 13 juin 2018 - 18:36:02

Identifiants

  • HAL Id : lirmm-00375005, version 1

Collections

Citation

Nicolas Philippe, Anthony Boureux, Laurent Brehelin, Jorma Tarhio, Thérèse Commes, et al.. Estimation of Sequence Errors and Prediction Capacity in Transcriptomic and DNA-Protein Interaction Assays. SMPGD'09: Statistical Methods for Post-genomic Data Workshop, Jan 2009, Paris, France. pp.3, 2009, 〈http://www.math.univ-toulouse.fr/biostat/6-17723-SMPGD09.php〉. 〈lirmm-00375005〉

Partager

Métriques

Consultations de la notice

188