Academic

Instruction-Tuned, but Not More Verifiable Instruction-Following: A Cross-Task Diagnosis for LoRA Adapters

arXiv:2603.22379v1 Announce Type: new Abstract: Adapters are often selected and deployed based on nominal labels (e.g., instruction-tuned), which implicitly suggest what capability improves after adaptation. We test whether nominal training objectives reliably align with realized cross-task capability gains by evaluating the same LoRA adapter across tasks. Our strongest evidence is tied to strict, automatically verifiable instruction following as measured by IFEval: across multiple seeds, base models, and LoRA settings, nominal labels recurrently but not universally fail to predict improvements on this verifiable target, with clear configuration sensitivity including a near-zero or negative case. As an illustrative strongest-case example in a controlled instruction-versus-numeric setting, an instruction-tuned adapter substantially improves off-target NM-based numeric benchmark performance from 0.133 to 0.632 while not improving verifiable instruction following on IFEval (ILA: 0.313 to

J
Junyi Zou
· · 1 min read · 0 views

arXiv:2603.22379v1 Announce Type: new Abstract: Adapters are often selected and deployed based on nominal labels (e.g., instruction-tuned), which implicitly suggest what capability improves after adaptation. We test whether nominal training objectives reliably align with realized cross-task capability gains by evaluating the same LoRA adapter across tasks. Our strongest evidence is tied to strict, automatically verifiable instruction following as measured by IFEval: across multiple seeds, base models, and LoRA settings, nominal labels recurrently but not universally fail to predict improvements on this verifiable target, with clear configuration sensitivity including a near-zero or negative case. As an illustrative strongest-case example in a controlled instruction-versus-numeric setting, an instruction-tuned adapter substantially improves off-target NM-based numeric benchmark performance from 0.133 to 0.632 while not improving verifiable instruction following on IFEval (ILA: 0.313 to 0.271; PLA: 0.250 to 0.143; values rounded to three decimals). We refer to this nominal-versus-realized mismatch pattern as capability drift as a descriptive label. The mismatch is visible in the raw cross-task performance matrix; we use a drift score only as a compact summary in the same units as the underlying metrics, not as a new formal metric contribution. Evidence from broader instruction-following benchmarks is benchmark-dependent and mixed, reflecting heterogeneity in how instruction following is operationalized; we therefore do not treat cross-benchmark agreement as a premise. Overall, the practical takeaway is to perform routine cross-task evaluation before deployment and to avoid treating nominal labels as reliable capability proxies.

Executive Summary

This article explores the relationship between nominal training objectives and realized cross-task capability gains in LoRA adapters. The authors evaluate the performance of instruction-tuned adapters across multiple tasks and find that nominal labels recurrently fail to predict improvements on verifiable targets. This phenomenon is referred to as 'capability drift.' The study highlights the importance of routine cross-task evaluation before deployment and cautions against relying on nominal labels as reliable capability proxies. The findings have implications for the development and deployment of adapters in natural language processing applications.

Key Points

  • The relationship between nominal training objectives and realized cross-task capability gains in LoRA adapters is investigated.
  • Instruction-tuned adapters recurrently fail to improve verifiable instruction following, despite improving other tasks.
  • Routine cross-task evaluation is essential before deployment to avoid capability drift.

Merits

Methodological Rigor

The study employs a robust evaluation framework, including multiple seeds, base models, and LoRA settings, to assess the generalizability of the findings.

Cross-Task Evaluation

The authors' emphasis on evaluating adapters across multiple tasks provides a comprehensive understanding of their capabilities and limitations.

Demerits

Limited Generalizability

The study focuses on LoRA adapters and may not be applicable to other types of adapters or models, limiting the generalizability of the findings.

Lack of Formal Metric Contribution

The drift score used in the study is not a new formal metric contribution, which may limit its utility in evaluating adapters.

Expert Commentary

This study provides a timely and insightful exploration of the relationship between nominal training objectives and realized cross-task capability gains in LoRA adapters. The findings have significant implications for the development and deployment of adapters in natural language processing applications. However, the study's limitations, such as the focus on LoRA adapters, may limit its generalizability. Nevertheless, the emphasis on cross-task evaluation and the introduction of the concept of capability drift contribute to the growing field of model explainability and transparency. As such, this study is a valuable addition to the literature and should be considered by developers, deployers, and researchers in the field.

Recommendations

  • Developers and deployers of adapters should prioritize routine cross-task evaluation before deployment.
  • Researchers should explore the development of new formal metrics to evaluate adapters and mitigate the risk of capability drift.

Sources

Original: arXiv - cs.LG