Conference Papers Year : 2024

Counting overlapping pairs of strings

Abstract

A word $u$ overlaps a word $v$ if a suffix of $u$ equals a prefix of $v$. The shared suffix-prefix is called a border for the ordered pair of words $(u, v)$ (note that other authors call this a right border, see [1]). If $(u, v)$ has no border it is said unbordered. These notions generalize to pairs of words, the well studied notions of border, bordered and unbordered words, that were originally defined for single words. Example: Consider the binary alphabet {a, b} and the following three words denoted by u, v, w: abaaa, aaabb, and abbbb. The pairs (u, v) and (v, w) both have a longest border of length 3, but (u, v) has 3 distinct non empty borders aaa, aa, and a, while (v, w) has only one abb. The pairs (v, u) and (w, v) have no borders, which illustrates the asymmetry of this notion.

Overlapping and unbordered words are central in many applications: bioinformatics, pattern matching, code design, or word statistics, among others.

Other authors have proposed to encode the starting position of such overlaps in a binary vector called a correlation [2]. In our example, the correlation of the pair (u, v) is 00111, while that of (v, w) is 00100. For any word z, the correlation of (z, z) is called the autocorrelation of z. Clearly, multiple pairs can have the same correlation, and hence there are less correlations of length n than pairs of words of length n. Recently, Gabric [1] gave three recurrences to count bordered, mutually bordered, mutually unbordered pairs of words of length n over a k-ary alphabet [1]. In his conclusion, he raised challenging open questions: 1/ count the number of pairs having the longest border of length i (with i satisfying 0 < i < n), and 2/ what is the expected length of the longest border of a pair of words? Here, we exhibit two solutions to compute the population size of any correlation, that is the number of pairs of words having the same correlation. For this, we exploit two recurrences to compute the population size of autocorrelations [2, 3]. With this in hand, we derive a formula for the abovementioned open question 1/ and show that the expected length of the longest border of words of length n asymptotically diverges (open question 2/). Besides this, we provide bounds for the asymptotic of the population ratio of any correlation, which extend the result known for autocorrelations [2]. An article presenting these results is available on ArXiV online [4].

Fichier principal
Vignette du fichier
correlations_abstract_seqbim.pdf (120.31 Ko) Télécharger le fichier
2024_comb_ov_pairs_e.pdf (524.32 Ko) Télécharger le fichier

Dates and versions

lirmm-04831119 , version 1 (11-12-2024)

Licence

Identifiers

Cite

Eric Rivals, Pengfei Wang. Counting overlapping pairs of strings. Workshop SeqBIM 2025, IRISA, Univ Rennes, Nov 2024, Rennes, France. ⟨lirmm-04831119⟩
7 View
5 Download

Altmetric

Share

More