Estimation of Sequence Errors and Prediction Capacity in Transcriptomic and DNA-Protein Interaction Assays - LIRMM - Laboratoire d’Informatique, de Robotique et de Microélectronique de Montpellier Access content directly
Conference Papers Year : 2009

Estimation of Sequence Errors and Prediction Capacity in Transcriptomic and DNA-Protein Interaction Assays


Next-generation sequencing technologies, able to yield millions of sequences in a single run, allow to interrogate the transcriptome or to assay protein-DNA interactions (by Chromatin ImmunoPrecipitation by sequencing or ChIP-seq) at a genome-wide scale. These assays yield short sequences (<40 bp), called tags, that need to be mapped to the genome sequence. To each tag is associated the number of times that the same sequence has been experimentally detected: its occurrence number. For transcriptomic assays, for instance, a tag with a high occurrence number likely is the biologically valid signature of an abundant transcript, while a tag with a low occurrence number may either result from a sequencing error or identify a rare RNA. The mapping is a compulsory step to first predict, and then annotate regions of interest on the genome. Usually, only genomic locations that are unambiguously mapped by a tag are further analysed. Those high-throughput assays are intended to predict a maximum number of genomic locations of interest. Obviously, this induces a balance between the number of mapped tags and the number of tags that map a unique genomic location, and this balance is controlled by the tag length. The sequencing technique generally dictates the tag length. Nevertheless, once a certain length is sequenced (e.g., 36 bp with a Solexa/Illumina 1G machine) it is still possible to map only sub-parts (a prefix, a suffix, a substring) of the tags to the genome, thereby artificially reducing the tag length and modifying the balance. Presently, we lack a statistical method to evaluate the influence of the tag length on the capacity of prediction for different assays and sequencing techniques, as well as the importance of sequence errors. Our contribution is threefold. Based on word statistics, we design a program that computes the theoretical probability of mapping a genomic location by chance for a given tag length, a background distribution. Using an efficient algorithm to map short tags on complete genome sequence, called mpscan, we investigate how the prediction capacity varies with tag length. Finally, we propose a method to estimate the probability of a tag to be altered by a sequencing error. We apply it to derive a probability of having an erroneous nucleotide at a given position in the tag for the Sanger and Solexa sequencing techniques, and for both transcriptomic and ChIP-seq experiments. This enables a technical assessment of such assays and the indirect measurement of the impact of some biological phenomena, like SNPs.
No file

Dates and versions

lirmm-00375010 , version 1 (10-04-2009)


  • HAL Id : lirmm-00375010 , version 1


Eric Rivals. Estimation of Sequence Errors and Prediction Capacity in Transcriptomic and DNA-Protein Interaction Assays. Journée Thématique : Nouveaux Séquenceurs (NGS), Apr 2009, Rennes, France. ⟨lirmm-00375010⟩
174 View
0 Download


Gmail Facebook X LinkedIn More