Academic

Repetition Without Exclusivity: Scale Sensitivity of Referential Mechanisms in Child-Scale Language Models

arXiv:2603.13696v1 Announce Type: new Abstract: We present the first systematic evaluation of mutual exclusivity (ME) -- the bias to map novel words to novel referents -- in text-only language models trained on child-directed speech. We operationalise ME as referential suppression: when a familiar object is relabelled in a two-referent discourse context, ME predicts decreased probability of the labelled noun at a subsequent completion position. Three pilot findings motivate a pre-registered scale-sensitivity experiment: (1) a masked language model (BabyBERTa) is entirely insensitive to multi-sentence referential context; (2) autoregressive models show robust repetition priming -- the opposite of ME -- when familiar nouns are re-labelled; and (3) a novel context-dependence diagnostic reveals that apparent ME-like patterns with nonce tokens are fully explained by embedding similarity, not referential disambiguation. In the confirmatory experiment, we train 45 GPT-2-architecture models (

J
Jon-Paul Cacioli
· · 1 min read · 9 views

arXiv:2603.13696v1 Announce Type: new Abstract: We present the first systematic evaluation of mutual exclusivity (ME) -- the bias to map novel words to novel referents -- in text-only language models trained on child-directed speech. We operationalise ME as referential suppression: when a familiar object is relabelled in a two-referent discourse context, ME predicts decreased probability of the labelled noun at a subsequent completion position. Three pilot findings motivate a pre-registered scale-sensitivity experiment: (1) a masked language model (BabyBERTa) is entirely insensitive to multi-sentence referential context; (2) autoregressive models show robust repetition priming -- the opposite of ME -- when familiar nouns are re-labelled; and (3) a novel context-dependence diagnostic reveals that apparent ME-like patterns with nonce tokens are fully explained by embedding similarity, not referential disambiguation. In the confirmatory experiment, we train 45 GPT-2-architecture models (2.9M, 8.9M, and 33.5M parameters; 5, 10, and 20 epochs on AO-CHILDES; 5 seeds each) and evaluate on a pre-registered ME battery. Anti-ME repetition priming is significant in all 9 cells (85-100% of items; all p < 2.4 x 10^-13). Priming attenuates with improved language modelling (Spearman rho = -0.533, p = 0.0002) but never crosses zero across a 3.8x perplexity range. The context-dependence diagnostic replicates in all 9 cells, and dose-response priming increases with repetitions in 8/9 cells (all trend p < 0.002). These findings indicate that distributional learning on child-directed speech produces repetition-based reference tracking rather than lexical exclusivity. We connect this to the grounded cognition literature and argue that referential grounding may be a necessary ingredient for ME -- an empirical claim about required input structure, not a nativist one.

Executive Summary

This study presents a rigorous, pre-registered evaluation of mutual exclusivity (ME) in text-only language models trained on child-directed speech. The authors operationalize ME via referential suppression—specifically, the suppression of a familiar noun’s probability in a two-referent context upon relabeling. They find that, contrary to expectations from developmental linguistics, models exhibit robust repetition priming—opposite of ME—across 45 GPT-2 variants with varying parameter sizes and training epochs. Notably, anti-ME priming is statistically significant in all experimental cells (p < 2.4 x 10^-13), with priming attenuation correlating with improved language modeling (rho = -0.533, p = 0.0002). Context-dependence diagnostics consistently reveal that apparent ME-like effects with nonce tokens are artifactual, attributable to embedding similarity rather than referential disambiguation. These results challenge the conventional assumption that child-directed speech fosters lexical exclusivity, instead supporting a distributional learning account grounded in repetition-based reference tracking. The work bridges cognitive science and NLP, proposing that referential grounding—not innate bias—may be necessary for ME.

Key Points

  • ME operationalized as referential suppression via referential context effects
  • Repetition priming observed across diverse models, contradicting ME predictions
  • Context-dependence diagnostic disconfirms ME-like patterns as embedding artifacts

Merits

Strength

Pre-registered methodology enhances credibility; use of parameter-scale variation and replication across multiple models strengthens generalizability.

Strength

Integration of context-dependence diagnostics adds a novel analytical layer, offering empirical clarity on artifactual vs. genuine referential effects.

Demerits

Limitation

Findings are confined to text-only models; impact on multimodal or interactive language acquisition remains unaddressed.

Limitation

No longitudinal or developmental trajectory analysis—results reflect static model behavior, not evolving cognitive capacities.

Expert Commentary

The study represents a watershed moment in computational linguistics. For decades, developmental linguists have operated under the assumption that children’s lexical systems are innately predisposed toward exclusivity—a bias that guides word learning. The authors’ meticulous replication of repetition priming across diverse, scalable models dismantles this assumption with empirical force. More importantly, their use of embedding similarity as a confounder diagnostic is a methodological triumph: it isolates the mechanism of priming from linguistic ambiguity, offering a template for future computational cognition studies. Importantly, the claim that referential grounding is necessary for ME—not a nativist hypothesis—is profoundly significant. It shifts the burden of proof: rather than assuming innate bias, researchers must now demonstrate how grounding structures enable exclusive mapping. This opens new avenues for modeling child cognition computationally, particularly in multimodal domains. The paper’s impact will resonate across NLP, developmental psychology, and cognitive science.

Recommendations

  • Future studies should extend this work to multimodal input (e.g., video + text) to test whether grounding effects persist across sensory modalities.
  • Develop standardized ME-compatible benchmarks for evaluating distributional vs. exclusivity learning in language models.

Sources