Residual Stream Analysis of Overfitting And Structural Disruptions
arXiv:2603.13318v1 Announce Type: new Abstract: Ensuring that large language models (LLMs) remain both helpful and harmless poses a significant challenge: fine-tuning on repetitive safety datasets, …
Quan Liu, Han Zhou, Wenquan Wu, Hua Wu, Sen Su
13 views