Academic

From Refusal Tokens to Refusal Control: Discovering and Steering Category-Specific Refusal Directions

arXiv:2603.13359v1 Announce Type: new Abstract: Language models are commonly fine-tuned for safety alignment to refuse harmful prompts. One approach fine-tunes them to generate categorical refusal tokens that distinguish different refusal types before responding. In this work, we leverage a version of Llama 3 8B fine-tuned with these categorical refusal tokens to enable inference-time control over fine-grained refusal behavior, improving both safety and reliability. We show that refusal token fine-tuning induces separable, category-aligned directions in the residual stream, which we extract and use to construct categorical steering vectors with a lightweight probe that determines whether to steer toward or away from refusal during inference. In addition, we introduce a learned low-rank combination that mixes these category directions in a whitened, orthonormal steering basis, resulting in a single controllable intervention under activation-space anisotropy, and show that this interven

arXiv:2603.13359v1 Announce Type: new Abstract: Language models are commonly fine-tuned for safety alignment to refuse harmful prompts. One approach fine-tunes them to generate categorical refusal tokens that distinguish different refusal types before responding. In this work, we leverage a version of Llama 3 8B fine-tuned with these categorical refusal tokens to enable inference-time control over fine-grained refusal behavior, improving both safety and reliability. We show that refusal token fine-tuning induces separable, category-aligned directions in the residual stream, which we extract and use to construct categorical steering vectors with a lightweight probe that determines whether to steer toward or away from refusal during inference. In addition, we introduce a learned low-rank combination that mixes these category directions in a whitened, orthonormal steering basis, resulting in a single controllable intervention under activation-space anisotropy, and show that this intervention is transferable across same-architecture model variants without additional training. Across benchmarks, both categorical steering vectors and the low-rank combination consistently reduce over-refusals on benign prompts while increasing refusal rates on harmful prompts, highlighting their utility for multi-category refusal control.

Executive Summary

This article introduces a novel approach to refusal control in language models, leveraging categorical refusal tokens to enable fine-grained control over refusal behavior. The authors demonstrate that refusal token fine-tuning induces separable directions in the residual stream, which can be extracted and used to construct categorical steering vectors. These vectors can be used to steer the model toward or away from refusal during inference, resulting in improved safety and reliability. The approach is shown to be effective across benchmarks, reducing over-refusals on benign prompts while increasing refusal rates on harmful prompts.

Key Points

  • Categorical refusal tokens enable fine-grained control over refusal behavior
  • Refusal token fine-tuning induces separable directions in the residual stream
  • Categorical steering vectors can be used to steer the model toward or away from refusal

Merits

Improved Safety and Reliability

The approach enables more accurate and reliable refusal behavior, reducing the risk of harmful responses

Flexibility and Customizability

The use of categorical steering vectors allows for fine-grained control over refusal behavior, enabling customization to specific use cases

Demerits

Limited Transferability

The approach may not be directly transferable to different model architectures or domains, requiring additional training or fine-tuning

Increased Complexity

The introduction of categorical steering vectors and low-rank combinations may add complexity to the model, potentially impacting performance or interpretability

Expert Commentary

The article presents a significant advancement in refusal control for language models, offering a more nuanced and customizable approach to safety alignment. The use of categorical steering vectors and low-rank combinations enables fine-grained control over refusal behavior, which can be critical in high-stakes applications. However, further research is needed to address potential limitations, such as transferability and complexity. As the field continues to evolve, it is essential to prioritize transparency, explainability, and adversarial robustness in AI decision-making, ensuring that refusal behavior is not only accurate but also trustworthy and reliable.

Recommendations

  • Further research should investigate the transferability of the approach to different model architectures and domains
  • Developers should prioritize transparency and explainability in refusal behavior, enabling better understanding and trust in AI decision-making

Sources