Academic

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

arXiv:2603.11331v1 Announce Type: new Abstract: Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. To explain this phenomenon, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. Within this framework, we analyze prompt injection-based jailbreaking. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield e

I
Indranil Halder, Annesya Banerjee, Cengiz Pehlevan
· · 1 min read · 2 views

arXiv:2603.11331v1 Announce Type: new Abstract: Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. To explain this phenomenon, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. Within this framework, we analyze prompt injection-based jailbreaking. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield exponential scaling. We derive these behaviors analytically and confirm them empirically on large language models. This transition between two regimes is due to the appearance of an ordered phase in the spin chain under a strong magnetic field, which suggests that the injected jailbreak prompt enhances adversarial order in the language model.

Executive Summary

This article proposes a theoretical generative model to explain the phenomenon of polynomial-exponential crossover in the scaling laws of large language models. The authors propose a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. They analyze prompt injection-based jailbreaking and derive power-law and exponential scaling behaviors analytically and empirically confirm them on large language models. The transition between these regimes is attributed to the appearance of an ordered phase in the spin chain under a strong magnetic field, suggesting that the injected jailbreak prompt enhances adversarial order in the language model. This work contributes to the understanding of adversarial attacks on large language models and has significant implications for their safety and security.

Key Points

  • Theoretical generative model of proxy language in terms of a spin-glass system
  • Derivation of polynomial-exponential crossover in scaling laws of large language models
  • Empirical confirmation of analytical results on large language models

Merits

Strength of Theoretical Framework

The article proposes a novel theoretical framework that provides a clear and coherent explanation for the observed phenomenon of polynomial-exponential crossover in the scaling laws of large language models.

Empirical Validation

The authors provide empirical evidence to support their analytical results, demonstrating the robustness of their theoretical framework and its ability to accurately predict the behavior of large language models.

Demerits

Limited Scope

The article focuses on a specific type of adversarial attack (prompt injection-based jailbreaking) and may not be generalizable to other types of attacks or scenarios.

Complexity of Theoretical Framework

The theoretical framework proposed in the article is complex and may be challenging for non-experts to understand, potentially limiting the article's accessibility and impact.

Expert Commentary

This article makes a significant contribution to the field of adversarial attacks on large language models, providing a novel theoretical framework that explains the observed phenomenon of polynomial-exponential crossover in the scaling laws of these models. The article's empirical validation of its analytical results is a major strength, and its implications for the safety and security of artificial intelligence are far-reaching. However, the article's complexity and limited scope may limit its accessibility and impact. Nevertheless, this work is a crucial step forward in the development of robust and effective defenses against adversarial attacks on large language models, and its findings have significant practical and policy implications.

Recommendations

  • Future research should focus on generalizing the theoretical framework proposed in this article to other types of adversarial attacks and scenarios.
  • Developers of large language models should prioritize the implementation of robust and effective defenses against adversarial attacks, including those proposed in this article.

Sources