Steering Code LLMs with Activation Directions for Language and Library Control
arXiv:2603.23629v1 Announce Type: new Abstract: Code LLMs often default to particular programming languages and libraries under neutral prompts. We investigate whether these preferences are encoded as approximately linear directions in activation space that can be manipulated at inference time. Using a difference-in-means method, we estimate layer-wise steering vectors for five language/library pairs and add them to model hidden states during generation. Across three open-weight code LLMs, these interventions substantially increase generation toward the target ecosystem under neutral prompts and often remain effective even when prompts explicitly request the opposite choice. Steering strength varies by model and target, with common ecosystems easier to induce than rarer alternatives, and overly strong interventions can reduce output quality. Overall, our results suggest that code-style preferences in LLMs are partly represented by compact, steerable structure in activation space.
arXiv:2603.23629v1 Announce Type: new Abstract: Code LLMs often default to particular programming languages and libraries under neutral prompts. We investigate whether these preferences are encoded as approximately linear directions in activation space that can be manipulated at inference time. Using a difference-in-means method, we estimate layer-wise steering vectors for five language/library pairs and add them to model hidden states during generation. Across three open-weight code LLMs, these interventions substantially increase generation toward the target ecosystem under neutral prompts and often remain effective even when prompts explicitly request the opposite choice. Steering strength varies by model and target, with common ecosystems easier to induce than rarer alternatives, and overly strong interventions can reduce output quality. Overall, our results suggest that code-style preferences in LLMs are partly represented by compact, steerable structure in activation space.
Executive Summary
This study investigates whether code language and library preferences in large language models (LLMs) can be influenced through the manipulation of activation directions in their hidden states. The researchers discovered that these preferences can be encoded as linear directions in activation space, which can be steered at inference time. This was demonstrated through layer-wise steering vectors added to model hidden states, resulting in increased generation of code in the target ecosystem. The study highlights the potential of this approach for controlling LLMs' coding preferences, but also notes that overly strong interventions can reduce output quality. The findings suggest that code-style preferences in LLMs are partly represented by compact, steerable structure in activation space.
Key Points
- ▸ Code LLMs often default to specific programming languages and libraries under neutral prompts.
- ▸ Activation directions in LLMs' hidden states can be manipulated to steer the model towards target ecosystems.
- ▸ Steering strength varies by model and target, with common ecosystems easier to induce than rarer alternatives.
Merits
Strength
The study provides empirical evidence that activation directions in LLMs' hidden states can be manipulated to steer the model towards target ecosystems, offering a potential solution for controlling code language and library preferences.
Demerits
Limitation
The study relies on a difference-in-means method, which may not capture the full range of variability in LLMs' behavior, and the results may not generalize to other models or tasks.
Generalization
The study's findings may not generalize to other models or tasks, as the results are specific to the three open-weight code LLMs used in the study.
Expert Commentary
This study makes a significant contribution to the field of natural language processing by providing empirical evidence that activation directions in LLMs' hidden states can be manipulated to steer the model towards target ecosystems. The findings have implications for the development of more controllable and transparent LLMs, which can be used in a wider range of applications. However, the study's reliance on a difference-in-means method and the potential for overly strong interventions to reduce output quality are limitations that need to be addressed in future research. Furthermore, the study's findings may have implications for the regulation of LLMs, as the ability to control their behavior could raise questions about accountability and bias.
Recommendations
- ✓ Future research should aim to replicate the study's findings using a wider range of models and tasks to improve the generalizability of the results.
- ✓ Developers and regulators should consider the potential implications of LLMs' controllability for accountability and bias, and explore ways to mitigate these risks.
Sources
Original: arXiv - cs.LG