An integrated approach to read analysis
Abstract
Next-generation sequencing technologies are presently being used to answer key biological questions at the scale of the entire genome and with unprecedented depth. Whether determining genetic or genomic variations, cataloguing transcripts and assessing their expression levels, finding recurrent mutations in cancer, identifying DNA-protein interactions or chromatin modifications, surveying the species diversity in an environmental sample, all these tasks are now tackled with High Throughput Sequencing (HTS). For genomics and transcriptomics data set, the current paradigm of analysis of large read sets consists in: 1. mapping the reads to a reference genome contigously allowing as many differences as one expects to be necessary to accomodate sequence errors and small polymorphisms; 2. using uniquely mapped reads to determine covered genomic regions, either for computing a local coverage to predict SNPs and filter out sequence errors (cf. program ERANGE), or for delimiting expressed exons approximately (with RNA-seq; cf. programs TopHat GMORSE), 3. re-aligning unmapped reads, which were not mapped contigously at step one, to reveal exon boundaries or larger indels. As shown by the results of approaches following this paradigm, a number of pitfalls/drawbacks must be accomodated: mapping errors induce false predictions at further steps, indels larger than 4 bp are not handled, the impossibility to distinguish SNPs from sequence errors at mapping stage, the lack of precision on exon boundaries, etc. On the other hand, we have developped an exact mapper, called MPSCAN, for short reads [2], and analysed its performance in detecting uniquely mapped regions in function of tag length [1]. We could show that one can estimate depending on the genome length, a length k of substring that will in average point to a single genomic location. Building on this work, we have conceived a new approach to analyse nowadays longer reads (> 50 bp). We record for all the k-mers along the read their matching genomic positions and number of occurrences in the reads, and then analyse jointly these profiles to determine whether a read can be mapped contigously or detect multiple causes of alignment disruption: large indels, introns, rearrangements. In this talk, we will present this procedure, the underlying data structures, show that it distinguishes SNP from sequence errors, and allies sensitivity and specificity in the prediction of exon boundaries, indels, and rearrangements. Work in collaboration with M. Salson (U. Rouen), N. Philippe et T. Commes (U. Montpellier 2). Related publications: [1] Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity N. Philippe*, A. Boureux*, L. Bréhèlin, J. Tarhio, T. Commes, E. Rivals Nucleic Acids Research (NAR) doi:10.1093/nar/gkp492; 2009. [2] MPSCAN: fast localisation of multiple reads in genomes E. Rivals, L. Salmela, P. Kiiskinen, P. Kalsi, J. Tarhio Proc. 9th Workshop on Algorithms in Bioinformatics. Lecture Notes in BioInformatics (LNBI), Springer-Verlag, Vol. 5724, p. 246-260, 2009.