Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity

Nicolas Philippe; Anthony Boureux; Laurent Brehelin; Jorma Tarhio; Thérèse Commes; Eric Rivals

doi:10.1093/nar/gkp492

Article Dans Une Revue Nucleic Acids Research Année : 2009

Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity

(1) , (2) , (1) , (3) , (2) , (1)

1
2
3

Nicolas Philippe

Fonction : Auteur
PersonId : 938668

Méthodes et Algorithmes pour la Bioinformatique

Anthony Boureux

Fonction : Auteur
PersonId : 740617
IdHAL : anthony-boureux
IdRef : 253129192

Institut de génétique humaine

Laurent Brehelin

Fonction : Auteur
PersonId : 21572
IdHAL : laurent-brehelin
ORCID : 0000-0002-2582-2831
IdRef : 068625626

Méthodes et Algorithmes pour la Bioinformatique

Jorma Tarhio

Fonction : Auteur

Laboratory of Software Technology

Thérèse Commes

Fonction : Auteur
PersonId : 743232
IdHAL : therese-commes
ORCID : 0000-0002-7918-0176
IdRef : 112246419

Institut de génétique humaine

Eric Rivals

Fonction : Auteur correspondant
PersonId : 2002
IdHAL : eric-rivals
ORCID : 0000-0003-3791-3973
IdRef : 118021850

Connectez-vous pour contacter l'auteur

Méthodes et Algorithmes pour la Bioinformatique

Résumé

Ultra high-throughput sequencing is used to analyse the transcriptome or interactome at unprecedented depth on a genome-wide scale. These techniques yield short sequence reads that are then mapped on a genome sequence to predict putatively transcribed or protein-interacting regions. We argue that factors such as background distribution, sequence errors, and read length impact on the prediction capacity of sequence census experiments. Here we suggest a computational approach to measure these factors and analyse their influence on both transcriptomic and epigenomic assays. This investigation provides new clues on both methodological and biological issues. For instance, by analysing chromatin immunoprecipitation read sets, we estimate that 4.6% of reads are affected by SNPs. We show that, although the nucleotide error probability is low, it significantly increases with the position in the sequence. Choosing a read length above 19 bp practically eliminates the risk of finding irrelevant positions, while above 20 bp the number of uniquely mapped reads decreases. With our procedure, we obtain 0.6% false positives among genomic locations. Hence, even rare signatures should identify biologically relevant regions, if they are mapped on the genome. This indicates that digital transcriptomics may help to characterize the wealth of yet undiscovered, low-abundance transcripts.

Mots clés

Next Generation Sequencers ChIP-seq DGE transcriptomics SAGE tags bioinformatics

Domaines

Bio-informatique [q-bio.QM] Bio-Informatique, Biologie Systémique [q-bio.QM] Génomique, Transcriptomique et Protéomique [q-bio.GN]

Fichier principal

e104.pdf (240.12 Ko)

Origine	Fichiers éditeurs autorisés sur une archive ouverte

Nicolas Philippe : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00415899

Soumis le : vendredi 11 septembre 2009-14:27:38

Dernière modification le : mardi 11 juin 2024-03:26:35

Archivage à long terme le : mardi 15 juin 2010-21:43:37

Dates et versions

lirmm-00415899 , version 1 (11-09-2009)

Identifiants

HAL Id : lirmm-00415899 , version 1
DOI : 10.1093/nar/gkp492

Citer

Nicolas Philippe, Anthony Boureux, Laurent Brehelin, Jorma Tarhio, Thérèse Commes, et al.. Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity. Nucleic Acids Research, 2009, 37 (15 e104), pp.11. ⟨10.1093/nar/gkp492⟩. ⟨lirmm-00415899⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS MAB LIRMM MIPS BS UNIV-MONTPELLIER FRANCE-GENOMIQUE MGX

423 Consultations

229 Téléchargements

Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager