Estimation of sequence errors and capacity of genomic annotation in transcriptomic and DNA-protein interaction assays based on next generation sequencers
Résumé
The transcriptome or the interactome at unprecedented depth. These techniques yield short sequence reads that are then mapped on a genome sequence to predict putatively transcribed or protein-interacting regions. We argue that factors such as false locations, sequence errors, and read length impact on the mapping prediction capacity of these short reads. Here we suggest a computational approach to measure those factors and analyse their influence on both transcriptomic and epigenomic assays. This investigation provides new clues on both methodological and biological issues. First, we estimate that 4.6% of reads are affected by SNPs. Second, we show that the nucleotide error probability is low, and it significantly increases with the position in the sequence. Third, by choosing a read length above 19 bp, we practically eliminates the risk of finding irrelevant positions. However, the number of uniquely mapped reads decreases with sequences above 20 bp. Following our procedure, we obtain 0.6% of false positives among genomic locations. Therefore, even rare signatures, if they are mapped on the genome, should identify biologically relevant regions. This indicates that digital transcriptomics may help to characterise the wealth of yet undiscovered, low abundance transcripts.