Perturbation: A simple and efficient adversarial tracer for representation learning in language models
arXiv:2603.23821v1 Announce Type: new Abstract: Linguistic representation learning in deep neural language models (LMs) has been studied for decades, for both practical and theoretical reasons. However, finding representations in LMs remains an unsolved problem, in part due to a dilemma between enforcing implausible constraints on representations (e.g., linearity; Arora et al. 2024) and trivializing the notion of representation altogether (Sutter et al., 2025). Here we escape this dilemma by reconceptualizing representations not as patterns of activation but as conduits for learning. Our approach is simple: we perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation ``infects'' other examples. Perturbation makes no geometric assumptions, and unlike other methods, it does not find representations where it should not (e.g., in untrained LMs). But in trained LMs, perturbation reveals structured transfer at multiple linguistic grain sizes, suggesti
arXiv:2603.23821v1 Announce Type: new Abstract: Linguistic representation learning in deep neural language models (LMs) has been studied for decades, for both practical and theoretical reasons. However, finding representations in LMs remains an unsolved problem, in part due to a dilemma between enforcing implausible constraints on representations (e.g., linearity; Arora et al. 2024) and trivializing the notion of representation altogether (Sutter et al., 2025). Here we escape this dilemma by reconceptualizing representations not as patterns of activation but as conduits for learning. Our approach is simple: we perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation ``infects'' other examples. Perturbation makes no geometric assumptions, and unlike other methods, it does not find representations where it should not (e.g., in untrained LMs). But in trained LMs, perturbation reveals structured transfer at multiple linguistic grain sizes, suggesting that LMs both generalize along representational lines and acquire linguistic abstractions from experience alone.
Executive Summary
This article introduces Perturbation, a novel approach to representation learning in deep neural language models. By fine-tuning an LM on a single adversarial example, Perturbation measures how this perturbation affects other examples, revealing structured transfer at multiple linguistic grain sizes. This method avoids the dilemma of enforcing implausible constraints on representations or trivializing the notion of representation. The findings suggest that LMs both generalize along representational lines and acquire linguistic abstractions from experience alone. The authors' reconceptualization of representations as conduits for learning, rather than patterns of activation, offers a fresh perspective on representation learning in LMs.
Key Points
- ▸ Perturbation is a simple and efficient adversarial tracer for representation learning in language models.
- ▸ Perturbation measures how a perturbation in an LM affects other examples, revealing structured transfer at multiple linguistic grain sizes.
- ▸ Perturbation avoids the dilemma of enforcing implausible constraints on representations or trivializing the notion of representation.
Merits
Strength
Perturbation's simplicity and efficiency make it a valuable tool for representation learning in LMs, as it can be easily integrated into existing LM training pipelines.
Strength
Perturbation's ability to reveal structured transfer at multiple linguistic grain sizes provides insights into the generalization and abstraction capabilities of LMs.
Demerits
Limitation
Perturbation's reliance on adversarial examples may limit its applicability to real-world scenarios, where data may not be as carefully curated.
Limitation
The article does not provide a comprehensive evaluation of Perturbation's performance on a diverse range of LMs and tasks.
Expert Commentary
The article's contribution lies in its reconceptualization of representations as conduits for learning, which offers a fresh perspective on representation learning in LMs. Perturbation's ability to reveal structured transfer at multiple linguistic grain sizes provides valuable insights into the generalization and abstraction capabilities of LMs. However, the article's limitations, such as its reliance on adversarial examples and lack of comprehensive evaluation, highlight the need for further research to fully realize the potential of Perturbation. Nevertheless, the study's findings and approach are significant contributions to the field of representation learning in LMs.
Recommendations
- ✓ Future research should focus on evaluating Perturbation's performance on a diverse range of LMs and tasks, including those that are more representative of real-world scenarios.
- ✓ Investigating the applicability of Perturbation to other areas of deep learning, such as computer vision and reinforcement learning, could provide further insights into its potential and limitations.
Sources
Original: arXiv - cs.CL