Skip to Main content Skip to Navigation
Poster communications

Search algorithms for dinucleotide Position Weight Matrices

Abstract : Transcription regulation is an important cellular process. Specialized proteins, called Transcription Factors (TF), bind on short, specific, DNA sequences to regulate the expression of nearby genes. The sequences recognized by a TF in the vicinity of different genes are not identical, but similar. One captures the similarity of those binding site in different representations, which are generally called /motifs/. The most widely used sort of motifs are Position Weight Matrix (PWM) (also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM)). A PWM is built from a multiple alignment of "true" binding sequences and capture the observed variation of nucleotides at the different positions. Several databases (JASPAR, TRANSFAC, etc.) collect PWMs for known TFs. Those PWMs are used to scan new DNA sequences to find putative binding sites and possibly to annotate them. In the case of complete genomes, the scanning procedure for many PWM may last a long time \cite{Pizzi2011}. PWM assume that the distinct positions of the bound sequence are independent of each other. However, several works have observed that a mutation at given position influence the probability of mutation at neighboring positions. To overcome this limitation of PWMs, Kulakovskiy et al. have proposed a more complex sort of motif, called di-PWMs, which model the frequency of occurrence of dinucleotides in the binding sites (instead of mononucleotides for PWMs) \cite{KULAKOVSKIY2013}. Their studies show that di-PWMs improve in sensitivity compared to PWMs, and thus produce less false positives when scanning a sequence. Our aim is to design new search algorithms for di-PWMs, either online or offline, and to implement them. Our online scanning algorithm computes a partial score for some positions in the current window, and estimates the maximum achievable score for the whole window. If this score does not match requested threshold, the window can be discarded. Our offline approach works in two steps. As for read mapping, the genome is first preprocessed to produce an index data structure. Then, for any given motif and a threshold score, we enumerate potential matching words (that word whose score lies above the threshold) and search their occurrences in the index in optimal time. The difficulty is to design an efficient enumeration algorithm. Such an approach was developed for PWMs in the MOTIF software \cite{MOTIF}. If time suffices, we plan to compare experimentally our implementations to that of an existing search tool for di-PWMs, called SPRY-SARUS \cite{Pizzi2011}.
Document type :
Poster communications
Complete list of metadata

https://hal-lirmm.ccsd.cnrs.fr/lirmm-03442434
Contributor : Eric Rivals Connect in order to contact the contributor
Submitted on : Tuesday, November 23, 2021 - 10:34:59 AM
Last modification on : Wednesday, November 24, 2021 - 3:48:31 AM

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

  • HAL Id : lirmm-03442434, version 1

Collections

Citation

Marie Mille, Julie Ripoll, Bastien Cazaux, Eric Rivals. Search algorithms for dinucleotide Position Weight Matrices. Société Française de Bioinformatique (S.F.B.I.). Journées Ouvertes en Biologie, Informatique et Mathématiques 2021, Jul 2021, Paris, France. pp.100, 2021, Actes des Journées Ouvertes en Biologie, Informatique et Mathématiques 2021. ⟨lirmm-03442434⟩

Share

Metrics

Record views

13

Files downloads

2