Incremental Computation of the Set of Period Sets

Eric Rivals

doi:10.1007/978-3-031-82697-9_19

Résumé

Überschneidungen zwischen Wörtern sind in vielen Bereichen der Informatik von entscheidender Bedeutung, beispielsweise im Code-Design, in der Stringologie und in der Bioinformatik. Ein sich selbst überschneidendes Wort zeichnet sich durch seine Perioden und Borders aus. Eine Periode eines Wortes $u$ ist die Startposition eines Suffixes von $u$, das gleichzeitig ein Präfix von $u$ ist, und ein solches Suffix wird als Border bezeichnet. Jedes Wort der Länge $n>0$ hat eine Menge von Perioden, aber nicht alle Kombinationen von ganzen Zahlen sind Mengen von Perioden. Die Berechnung der Periodenmenge eines Wortes $u$ erfordert lineare Zeit in der Länge von $u$. Wir befassen uns mit der Frage der Berechnung der Menge $\Gamma_n$ aller Periodenmengen von Wörtern der Länge $n$. Obwohl Periodensätze charakterisiert wurden, gibt es keine Formel zur Berechnung der Kardinalität von $\Gamma_n$ (die exponentiell in $n$ ist), und der bekannte dynamische Programmieralgorithmus zur Aufzählung von $\Gamma_n$ leidet unter seiner Raumkomplexität. Wir stellen einen inkrementellen Algorithmus zur Berechnung von $\Gamma_n$ aus $\Gamma_{n-1}$ vor, der die Raumkomplexität reduziert, und anschließend einen konstruktiven Zertifizierungsalgorithmus, der für Verifizierungszwecke nützlich ist. Der inkrementelle Ansatz definiert eine Eltern-Kind-Beziehung zwischen Mengen in $\Gamma_{n-1}$ und $\Gamma_n$, wodurch man die Dynamik von Periodenmengen und ihre interessanten statistischen Eigenschaften untersuchen kann. Darüber hinaus ist die Periodenmengen eines Wortes $u$ der Schlüssel zur Berechnung der Abwesenheitswahrscheinlichkeit von $u$ in zufälligen Texten. Daher ist die Kenntnis von $\Gamma_n$ nützlich, um die Bedeutung von Wortstatistiken, wie beispielsweise die Anzahl fehlender Wörter in einem zufälligen Text, zu bewerten.

Overlaps between words are crucial in many areas of computer science, such as code design, stringology, and bioinformatics. A self overlapping word is characterized by its periods and borders. A period of a word $u$ is the starting position of a suffix of $u$ that is also a prefix $u$, and such a suffix is called a border. Each word of length, say $n>0$, has a set of periods, but not all combinations of integers are sets of periods. Computing the period set of a word $u$ takes linear time in the length of $u$. We address the question of computing, the set, denoted $\Gamma_n$, of all period sets of words of length $n$. Although period sets have been characterized, there is no formula to compute the cardinality of $\Gamma_n$ (which is exponential in $n$), and the known dynamic programming algorithm to enumerate $\Gamma_n$ suffers from its space complexity. We present an incremental approach to compute $\Gamma_n$ from $\Gamma_{n-1}$, which reduces the space complexity, and then a constructive certification algorithm useful for verification purposes. The incremental approach defines a parental relation between sets in $\Gamma_{n-1}$ and $\Gamma_n$, enabling one to investigate the dynamics of period sets, and their intriguing statistical properties. Moreover, the period set of a word $u$ is the key for computing the absence probability of $u$ in random texts. Thus, knowing $\Gamma_n$ is useful to assess the significance of word statistics, such as the number of missing words in a random text.

Les chevauchements entre les mots sont essentiels dans de nombreux domaines de l'informatique, tels que la conception de code, l'algorithmique du texte et la bio-informatique. Un mot auto-chevauchant est caractérisé par ses périodes et ses bords. Une période d'un mot $u$ est la position de départ d'un suffixe de $u$ qui est également un préfixe $u$, et un tel suffixe est appelé un bord de $u$. Chaque mot de longueur, disons $n>0$, possède un ensemble de périodes, mais toutes les combinaisons d'entiers ne sont pas des ensembles de périodes. Le calcul de l'ensemble des périodes d'un mot $u$ prend un temps linéaire par rapport à la longueur de $u$. Nous abordons la question du calcul de l'ensemble, noté $\Gamma_n$, de tous les ensembles de périodes de mots de longueur $n$. Bien que les ensembles de périodes aient été caractérisés, il n'existe aucune formule pour calculer la cardinalité de $\Gamma_n$ (qui est exponentielle en $n$), et l'algorithme de programmation dynamique connu pour énumérer $\Gamma_n$ souffre de sa complexité en espace. Nous présentons une approche incrémentale pour calculer $\Gamma_n$ à partir de $\Gamma_{n-1}$, approche qui réduit la complexité en espace, puis un algorithme de certification constructif utile à des fins de vérification. L'approche incrémentale définit une relation parentale entre les ensembles dans $\Gamma_{n-1}$ et $\Gamma_n$, ce qui permet d'étudier la dynamique des ensembles de périodes et leurs propriétés statistiques intrigantes. De plus, l'ensemble de périodes d'un mot $u$ est la clé pour calculer la probabilité d'absence de $u$ dans des textes aléatoires, ce qui souligne son importance pratique.

Origine	Fichiers produits par l'(les) auteur(s)
Licence	CC BY-NC-SA 4.0 - Attribution - Utilisation non commerciale - Partage dans les mêmes conditions

Inkrementelle Berechnung der Menge aller Periodenmengen

Incremental Computation of the Set of Period Sets

Calcul incrémental de l'ensemble des ensembles de périodes

Résumé

Mots clés

Domaines

Cite 10.5281/zenodo.13826259 Jeu de données Rivals, E. (2024). Sets of period sets for words of length n. (Version 1.0.1) [Data set]. Zenodo. https://doi.org/10.5281/ZENODO.13826259

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager