Detecting microsatellites within genomes: no exact solution?
Résumé
Microsatellites are short tandem repeats (period of 1 to 6 pb) that are present in the genomes of all living organisms. For some species, they account for a significant DNA proportion, with approximatively 3% of the Homo sapiens genome for example (International HGSC, 2001). Some of these elements have a remarkable hypermutability, with an average mutation rate of the order of 0.001 for the human, which is primarilly caused by insertion or deletion of one or more repeats. These length variations are the consequence of a specific molecular mechanism named DNA slippage, which is not well understood yet (Goldstein & Schlotterer, 1999). Microsatellites are extensively used as molecular markers since many years, but the question of their evolution started to be studied a dozen of years ago. One technique is to compare length theorical distributions (generated from mutation models) to real distributions. The latter are obtained from known microsatellite loci (Jarne et al 1998, Dettmann et Taylor 2004), or by extraction from genomic sequences either with personal algorithms (Kruglyak et al. 1998, Dieringer et Schlotterer 2003, Sainudiin et al. 2004) or with dedicated sofwares based on advanced algorithmic notions. A dozen of these algorithms were published since 1997, and can be grouped them into 4 major classes : 1- methods of alignment against a consensus sequence, 2- combinatorial algorithms of repeat identification, 3- heuristic approches based on statitical criterias, 4- methods based on the compression capacity of repeated sequences. We propose here to expose some of these algorithms, and to compare major differences. Four softwares were chosen, each representing one of the above classes : RepeatMasker (http://repeatmasker.org) for the sequence alignment, Mreps (Kolpakov et al 2003) for the combinatorial method, TRF (Benson 1999) for the statistical method and finally STAR (Delgrange & Rivals 2004) for the compression method. Each software have specific parameters, constraints and output formats, that impose to normalize datas before doing inter-algorithm comparisons. These comparisons are based on 4 microsatellite features : their length, their perfection degree (i.e. the percentage of mutation), the repeat length and the chromosomal position. First observations show that, on the scope of a single algorithm, parameter choice can have a significant influence on detected microsatellite distributions. For example, TRF detection number can vary by a factor 20 simply by changing the minimum score parameter. To take these variations into account in the inter-algorithm comparison, we chose the STAR distribution as a reference (STAR does not take parameter), and we calibrated the parameters for each other algorithm to obtain a distribution the closest to this reference. Results for inter-algorithm comparison on the human X chromosome show a significant detection divergence. TRF and Mreps detect much more tandem repeats than STAR and RepeatMasker, and particularily for small lengths. On the other hand, Star and TRF are more stringent for highly degraded microsatellites. This study highlights the fact that the way the microsatellites are detected can change biological models fitted on, and finally lead to mistaken interpretations.