Search algorithms for dinucleotide Position Weight Matrices

Marie Mille; Julie Ripoll; Bastien Cazaux; Eric Rivals

Poster De Conférence Année : 2021

Search algorithms for dinucleotide Position Weight Matrices

(1) , (1) , (1) , (1)

Marie Mille

Fonction : Auteur
PersonId : 1117720

Méthodes et Algorithmes pour la Bioinformatique

Julie Ripoll

Fonction : Auteur
PersonId : 746444
IdHAL : julie-ripoll
ORCID : 0000-0003-1590-8313
IdRef : 19036775X

Méthodes et Algorithmes pour la Bioinformatique

Bastien Cazaux

Fonction : Auteur
PersonId : 3129
IdHAL : cazaux-bastien
IdRef : 226877523

Méthodes et Algorithmes pour la Bioinformatique

Eric Rivals

Fonction : Auteur
PersonId : 2002
IdHAL : eric-rivals
ORCID : 0000-0003-3791-3973
IdRef : 118021850

Méthodes et Algorithmes pour la Bioinformatique

Résumé

Transcription regulation is an important cellular process. Specialized proteins, called Transcription Factors (TF), bind on short, specific, DNA sequences to regulate the expression of nearby genes. The sequences recognized by a TF in the vicinity of different genes are not identical, but similar. One captures the similarity of those binding site in different representations, which are generally called /motifs/. The most widely used sort of motifs are Position Weight Matrix (PWM) (also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM)). A PWM is built from a multiple alignment of "true" binding sequences and capture the observed variation of nucleotides at the different positions. Several databases (JASPAR, TRANSFAC, etc.) collect PWMs for known TFs. Those PWMs are used to scan new DNA sequences to find putative binding sites and possibly to annotate them. In the case of complete genomes, the scanning procedure for many PWM may last a long time \cite{Pizzi2011}. PWM assume that the distinct positions of the bound sequence are independent of each other. However, several works have observed that a mutation at given position influence the probability of mutation at neighboring positions. To overcome this limitation of PWMs, Kulakovskiy et al. have proposed a more complex sort of motif, called di-PWMs, which model the frequency of occurrence of dinucleotides in the binding sites (instead of mononucleotides for PWMs) \cite{KULAKOVSKIY2013}. Their studies show that di-PWMs improve in sensitivity compared to PWMs, and thus produce less false positives when scanning a sequence. Our aim is to design new search algorithms for di-PWMs, either online or offline, and to implement them. Our online scanning algorithm computes a partial score for some positions in the current window, and estimates the maximum achievable score for the whole window. If this score does not match requested threshold, the window can be discarded. Our offline approach works in two steps. As for read mapping, the genome is first preprocessed to produce an index data structure. Then, for any given motif and a threshold score, we enumerate potential matching words (that word whose score lies above the threshold) and search their occurrences in the index in optimal time. The difficulty is to design an efficient enumeration algorithm. Such an approach was developed for PWMs in the MOTIF software \cite{MOTIF}. If time suffices, we plan to compare experimentally our implementations to that of an existing search tool for di-PWMs, called SPRY-SARUS \cite{Pizzi2011}.

Mots clés

PWM PSSM online search offline search index DNA RNA genomics pattern matching transcription factor binding site motif

Domaines

Bio-informatique [q-bio.QM]

Fichier principal

JOBIM2021_paper_173.pdf (176.04 Ko)

JOBIM2021_poster_173.png (4.98 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Eric Rivals : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-03442434

Soumis le : mardi 23 novembre 2021-10:34:59

Dernière modification le : mardi 16 janvier 2024-10:33:27

Archivage à long terme le : jeudi 24 février 2022-19:28:34

Dates et versions

lirmm-03442434 , version 1 (23-11-2021)

Licence

Paternité

Identifiants

HAL Id : lirmm-03442434 , version 1

Citer

Marie Mille, Julie Ripoll, Bastien Cazaux, Eric Rivals. Search algorithms for dinucleotide Position Weight Matrices. Société Française de Bioinformatique (S.F.B.I.). Journées Ouvertes en Biologie, Informatique et Mathématiques 2021, Jul 2021, Paris, France. , pp.100, 2021, Actes des Journées Ouvertes en Biologie, Informatique et Mathématiques 2021. ⟨lirmm-03442434⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS MAB LIRMM MIPS UNIV-MONTPELLIER ANR NUMEV

80 Consultations

34 Téléchargements

Search algorithms for dinucleotide Position Weight Matrices

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Partager