Counting overlapping pairs of strings

Eric Rivals; Pengfei Wang

Communication Dans Un Congrès Année : 2024

Counting overlapping pairs of strings

(1) , (1)

Eric Rivals

Fonction : Auteur
PersonId : 2002
IdHAL : eric-rivals
ORCID : 0000-0003-3791-3973
IdRef : 118021850

LIRMM | MAB - Méthodes et Algorithmes pour la Bioinformatique

Pengfei Wang

Fonction : Auteur
PersonId : 1165647
ORCID : 0000-0001-8172-5270

LIRMM | MAB - Méthodes et Algorithmes pour la Bioinformatique

Résumé

A word $u$ overlaps a word $v$ if a suffix of $u$ equals a prefix of $v$. The shared suffix-prefix is called a border for the ordered pair of words $(u, v)$ (note that other authors call this a right border, see [1]). If $(u, v)$ has no border it is said unbordered. These notions generalize to pairs of words, the well studied notions of border, bordered and unbordered words, that were originally defined for single words. Example: Consider the binary alphabet {a, b} and the following three words denoted by u, v, w: abaaa, aaabb, and abbbb. The pairs (u, v) and (v, w) both have a longest border of length 3, but (u, v) has 3 distinct non empty borders aaa, aa, and a, while (v, w) has only one abb. The pairs (v, u) and (w, v) have no borders, which illustrates the asymmetry of this notion.

Overlapping and unbordered words are central in many applications: bioinformatics, pattern matching, code design, or word statistics, among others.

Other authors have proposed to encode the starting position of such overlaps in a binary vector called a correlation [2]. In our example, the correlation of the pair (u, v) is 00111, while that of (v, w) is 00100. For any word z, the correlation of (z, z) is called the autocorrelation of z. Clearly, multiple pairs can have the same correlation, and hence there are less correlations of length n than pairs of words of length n. Recently, Gabric [1] gave three recurrences to count bordered, mutually bordered, mutually unbordered pairs of words of length n over a k-ary alphabet [1]. In his conclusion, he raised challenging open questions: 1/ count the number of pairs having the longest border of length i (with i satisfying 0 < i < n), and 2/ what is the expected length of the longest border of a pair of words? Here, we exhibit two solutions to compute the population size of any correlation, that is the number of pairs of words having the same correlation. For this, we exploit two recurrences to compute the population size of autocorrelations [2, 3]. With this in hand, we derive a formula for the abovementioned open question 1/ and show that the expected length of the longest border of words of length n asymptotically diverges (open question 2/). Besides this, we provide bounds for the asymptotic of the population ratio of any correlation, which extend the result known for autocorrelations [2]. An article presenting these results is available on ArXiV online [4].

Mots clés

Domaines

Fichier principal

correlations_abstract_seqbim.pdf (120.31 Ko)

2024_comb_ov_pairs_e.pdf (524.32 Ko)

Origine	Fichiers produits par l'(les) auteur(s)
Licence	CC BY-NC-ND 4.0 - Attribution - Utilisation non commerciale - Pas d’oeuvre dérivée

Licence	CC BY-NC-ND 4.0 - Attribution - Utilisation non commerciale - Pas d’oeuvre dérivée

Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-04831119

Soumis le : mercredi 11 décembre 2024-14:08:49

Dernière modification le : mardi 2 décembre 2025-03:18:10

Dates et versions

lirmm-04831119 , version 1 (11-12-2024)

Licence

CC BY-NC-ND 4.0 - Attribution - Utilisation non commerciale - Pas d’oeuvre dérivée

Identifiants

HAL Id : lirmm-04831119 , version 1
ARXIV : 2405.09393

Citer

Eric Rivals, Pengfei Wang. Counting overlapping pairs of strings. Workshop SeqBIM 2025, IRISA, Univ Rennes, Nov 2024, Rennes, France. ⟨lirmm-04831119⟩

Counting overlapping pairs of strings

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager