Counting overlapping pairs of strings
Abstract
Overlapping and unbordered words are central in many applications: bioinformatics, pattern matching, code design, or word statistics, among others.
Other authors have proposed to encode the starting position of such overlaps in a binary vector called a correlation [2]. In our example, the correlation of the pair (u, v) is 00111, while that of (v, w) is 00100. For any word z, the correlation of (z, z) is called the autocorrelation of z. Clearly, multiple pairs can have the same correlation, and hence there are less correlations of length n than pairs of words of length n. Recently, Gabric [1] gave three recurrences to count bordered, mutually bordered, mutually unbordered pairs of words of length n over a k-ary alphabet [1]. In his conclusion, he raised challenging open questions: 1/ count the number of pairs having the longest border of length i (with i satisfying 0 < i < n), and 2/ what is the expected length of the longest border of a pair of words? Here, we exhibit two solutions to compute the population size of any correlation, that is the number of pairs of words having the same correlation. For this, we exploit two recurrences to compute the population size of autocorrelations [2, 3]. With this in hand, we derive a formula for the abovementioned open question 1/ and show that the expected length of the longest border of words of length n asymptotically diverges (open question 2/). Besides this, we provide bounds for the asymptotic of the population ratio of any correlation, which extend the result known for autocorrelations [2]. An article presenting these results is available on ArXiV online [4].
Origin | Files produced by the author(s) |
---|---|
Licence |