Academic

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

arXiv:2603.10243v1 Announce Type: new Abstract: Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downst

Z
Zhouxiang Fang, Jiawei Zhou, Hanjie Chen
· · 1 min read · 13 views

arXiv:2603.10243v1 Announce Type: new Abstract: Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at https://github.com/chili-lab/gr-sap.

Executive Summary

The article GR-SAP introduces a novel framework to preserve safety alignment in large language models during fine-tuning by leveraging generative replay principles. Given the susceptibility of safety alignment to degradation via non-adversarial fine-tuning, the authors propose synthesizing domain-specific alignment data using LLMs themselves as a proxy for inaccessible original alignment data. Empirical and theoretical evaluations indicate that this synthetic data effectively mitigates safety degradation without compromising downstream task performance. The work addresses a critical gap in fine-tuning safety preservation, particularly where original alignment data is unavailable, offering a scalable and practical solution.

Key Points

  • GR-SAP uses generative replay to synthesize alignment data
  • Synthetic data serves as a proxy for inaccessible original alignment data
  • Empirical results show mitigation of safety degradation without performance loss

Merits

Innovative Solution

GR-SAP introduces a creative approach to preserve safety alignment using available resources without requiring external alignment data.

Empirical Validation

The framework is validated through rigorous experiments across multiple models and tasks, demonstrating consistent effectiveness.

Demerits

Assumption Dependency

The efficacy of GR-SAP hinges on the ability of LLMs to generate reliable synthetic alignment data, which may vary depending on model architecture and training nuances.

Generalizability Concern

Results may not fully translate to all fine-tuning scenarios, particularly those involving highly specialized or domain-specific models with unique safety constraints.

Expert Commentary

GR-SAP represents a significant advancement in the ongoing challenge of preserving safety alignment during LLM adaptation. The synthesis of synthetic alignment data via generative replay is a sophisticated yet pragmatic solution that circumvents the logistical barriers of unavailable original alignment data. The authors’ integration of theoretical validation with empirical results strengthens the credibility of their claims. Notably, the framework’s applicability across diverse models and tasks suggests broader utility beyond the specific datasets tested. While the dependency on synthetic data generation introduces a layer of complexity, the demonstrated mitigation of safety degradation without compromising downstream metrics is compelling. This work fills a critical void in current LLM adaptation literature and offers a replicable methodology that may inform future research on safety-aware model adaptation.

Recommendations

  • Researchers should validate GR-SAP across additional model variants and fine-tuning pipelines to assess scalability.
  • Academic and industry stakeholders should consider incorporating GR-SAP into standard fine-tuning protocols as a baseline safety preservation mechanism.

Sources