Academic

A Zipf-preserving, long-range correlated surrogate for written language and other symbolic sequences

arXiv:2603.02213v1 Announce Type: new Abstract: Symbolic sequences such as written language and genomic DNA display characteristic frequency distributions and long-range correlations extending over many symbols. In language, this takes the form of Zipf's law for word frequencies together with persistent correlations spanning hundreds or thousands of tokens, while in DNA it is reflected in nucleotide composition and long-memory walks under purine-pyrimidine mappings. Existing surrogate models usually preserve either the frequency distribution or the correlation properties, but not both simultaneously. We introduce a surrogate model that retains both constraints: it preserves the empirical symbol frequencies of the original sequence and reproduces its long-range correlation structure, quantified by the detrended fluctuation analysis (DFA) exponent. Our method generates surrogates of symbolic sequences by mapping fractional Gaussian noise (FGN) onto the empirical histogram through a fr

M
Marcelo A. Montemurro, Mirko Degli Esposti
· · 1 min read · 12 views

arXiv:2603.02213v1 Announce Type: new Abstract: Symbolic sequences such as written language and genomic DNA display characteristic frequency distributions and long-range correlations extending over many symbols. In language, this takes the form of Zipf's law for word frequencies together with persistent correlations spanning hundreds or thousands of tokens, while in DNA it is reflected in nucleotide composition and long-memory walks under purine-pyrimidine mappings. Existing surrogate models usually preserve either the frequency distribution or the correlation properties, but not both simultaneously. We introduce a surrogate model that retains both constraints: it preserves the empirical symbol frequencies of the original sequence and reproduces its long-range correlation structure, quantified by the detrended fluctuation analysis (DFA) exponent. Our method generates surrogates of symbolic sequences by mapping fractional Gaussian noise (FGN) onto the empirical histogram through a frequency-preserving assignment. The resulting surrogates match the original in first-order statistics and long-range scaling while randomising short-range dependencies. We validate the model on representative texts in English and Latin, and illustrate its broader applicability with genomic DNA, showing that base composition and DFA scaling are reproduced. This approach provides a principled tool for disentangling structural features of symbolic systems and for testing hypotheses on the origin of scaling laws and memory effects across language, DNA, and other symbolic domains.

Executive Summary

This article introduces a novel surrogate model for generating symbolic sequences, such as written language and genomic DNA, that preserves both the frequency distribution and long-range correlation structure of the original sequence. The model uses fractional Gaussian noise mapped onto an empirical histogram, resulting in surrogates that match the original sequence in first-order statistics and long-range scaling while randomising short-range dependencies. The authors validate the model on representative texts in English and Latin, and illustrate its broader applicability with genomic DNA. This approach provides a valuable tool for disentangling structural features of symbolic systems and testing hypotheses on the origin of scaling laws and memory effects.

Key Points

  • The surrogate model preserves both frequency distribution and long-range correlation structure of the original sequence.
  • The model uses fractional Gaussian noise mapped onto an empirical histogram.
  • The surrogates match the original sequence in first-order statistics and long-range scaling.

Merits

Strength in Preserving Frequency Distribution

The model accurately preserves the empirical symbol frequencies of the original sequence, making it a reliable tool for studying symbolic systems.

Strength in Capturing Long-Range Correlations

The model reproduces the long-range correlation structure of the original sequence, quantified by the detrended fluctuation analysis (DFA) exponent, making it a valuable tool for understanding memory effects.

Demerits

Limitation in Handling Short-Range Dependencies

The model randomises short-range dependencies, which may limit its ability to capture certain aspects of symbolic systems, such as short-range correlations or patterns.

Limitation in Generalizability

The model's performance on diverse symbolic sequences, including non-textual data, requires further investigation to establish its broader applicability.

Expert Commentary

This article presents a significant contribution to the field of symbolic sequence analysis, offering a novel approach to generating surrogate sequences that preserve both frequency distribution and long-range correlation structure. The model's strengths lie in its ability to accurately capture the empirical symbol frequencies and long-range correlations of the original sequence. However, its limitations in handling short-range dependencies and generalizability require further investigation. The article's implications are far-reaching, with potential applications in natural language processing, bioinformatics, and other domains where symbolic sequence analysis is critical.

Recommendations

  • Future research should focus on extending the model to handle short-range dependencies and explore its applicability to diverse symbolic sequences.
  • The development of more sophisticated surrogate models that can capture complex patterns and structures in symbolic sequences is an important area for further research.

Sources