Annotation of Human Transcriptome Using Tandem SAGE Tags
Abstract
The sequential analysis of several million expressed gene signatures (tags) has revealed an increasing number of different sequences, largely exceeding the number of annotated gene loci in mammalian genomes. Serial Analysis of Gene Expression (SAGE) has the potential to reveal new polyadenylated RNAs transcribed from previously unrecognized chromosomal sites. However, conventional SAGE tags are not long enough to identify unambiguously unique sites in large genomes. In this work, we design a novel strategy in which two SAGE libraries are built in parallel from the same polyadenylated RNA sample, with tags anchored on two different restrictions sites of cDNAs. New transcripts are then tentatively defined by the two SAGE tags in tandem and by the spanning sequence read on the genome between the two tagged sites. Having developed a new algorithm to obtain these tag-delimited genomic sequences (TDGS), we evaluated its capacity to recognize already known genes and its ability to reveal new transcripts. This new strategy extends the power of tag-based approaches when dealing with complex genomes.