Search algorithms for dinucleotide Position Weight Matrices

Transcription regulation is an important cellular process. Specialized proteins, called Transcription Factors (TF), bind on short, specific, DNA sequences to regulate the expression of nearby genes. The sequences recognized by a TF in the vicinity of different genes are not identical, but similar. One captures the similarity of those binding site in different representations, which are generally called /motifs/. The most widely used sort of motifs are Position Weight Matrix (PWM) (also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM)). A PWM is built from a multiple alignment of "true" binding sequences and capture the observed variation of nucleotides at the different positions. Several databases (JASPAR, TRANSFAC, etc.) collect PWMs for known TFs. Those PWMs are used to scan new DNA sequences to find putative binding sites and possibly to annotate them. In the case of complete genomes, the scanning procedure for many PWM may last a long time \cite{Pizzi2011}. PWM assume that the distinct positions of the bound sequence are independent of each other. However, several works have observed that a mutation at given position influence the probability of mutation at neighboring positions. To overcome this limitation of PWMs, Kulakovskiy et al. have proposed a more complex sort of motif, called di-PWMs, which model the frequency of occurrence of dinucleotides in the binding sites (instead of mononucleotides for PWMs) \cite{KULAKOVSKIY2013}. Their studies show that di-PWMs improve in sensitivity compared to PWMs, and thus produce less false positives when scanning a sequence. Our aim is to design new search algorithms for di-PWMs, either online or offline, and to implement them. Our online scanning algorithm computes a partial score for some positions in the current window, and estimates the maximum achievable score for the whole window. If this score does not match requested threshold, the window can be discarded. Our offline approach works in two steps. As for read mapping, the genome is first preprocessed to produce an index data structure. Then, for any given motif and a threshold score, we enumerate potential matching words (that word whose score lies above the threshold) and search their occurrences in the index in optimal time. The difficulty is to design an efficient enumeration algorithm. Such an approach was developed for PWMs in the MOTIF software \cite{MOTIF}. If time suffices, we plan to compare experimentally our implementations to that of an existing search tool for di-PWMs, called SPRY-SARUS \cite{Pizzi2011}.

Mots clés

Domaines

Bio-informatique [q-bio.QM]

Fichier principal

JOBIM2021_paper_173.pdf (176.04 Ko)

JOBIM2021_poster_173.png (4.98 Mo)

Origine	Fichiers produits par l'(les) auteur(s)
Licence	CC BY 4.0 - Attribution

Licence	CC BY 4.0 - Attribution

Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-03442434

Soumis le : mardi 23 novembre 2021-10:34:59

Dernière modification le : mardi 16 janvier 2024-10:33:27

Archivage à long terme le : jeudi 24 février 2022-19:28:34

Dates et versions

lirmm-03442434 , version 1 (23-11-2021)

Licence

CC BY 4.0 - Attribution

Identifiants

HAL Id : lirmm-03442434 , version 1

Citer

Marie Mille, Julie Ripoll, Bastien Cazaux, Eric Rivals. Search algorithms for dinucleotide Position Weight Matrices. Société Française de Bioinformatique (S.F.B.I.). Journées Ouvertes en Biologie, Informatique et Mathématiques 2021, Jul 2021, Paris, France. , pp.100, 2021, Actes des Journées Ouvertes en Biologie, Informatique et Mathématiques 2021. ⟨lirmm-03442434⟩

Exporter

Collections

270 Consultations

145 Téléchargements