Academic

Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation

arXiv:2603.13260v1 Announce Type: new Abstract: Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain-of-Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for student-centric distillation. TSD-KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD-KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re-ranks those candidates as indirect feedback without enforcing its entir

M
Minsang Kim, Seung Jun Baek
· · 1 min read · 8 views

arXiv:2603.13260v1 Announce Type: new Abstract: Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain-of-Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for student-centric distillation. TSD-KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD-KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re-ranks those candidates as indirect feedback without enforcing its entire distribution. Direct distillation uses distribution matching; however, it selectively distills tokens based on the relative confidence between teacher and student. Finally, we add entropy regularization to maintain the student's confidence during distillation. Overall, our method provides the student with targeted and indirect feedback to support its own reasoning process and to facilitate self-improvement. The experiments show the state-of-the-art performance of TSD-KD on 10 challenging reasoning benchmarks, outperforming the baseline and runner-up in accuracy by up to 54.4\% and 40.3\%, respectively. Notably, a student trained by TSD-KD even outperformed its own teacher model in four cases by up to 20.3\%. The source code is available at https://github.com/kmswin1/TSD-KD.

Executive Summary

The article introduces Token-Selective Dual Knowledge Distillation (TSD-KD), a novel framework addressing the limitations of traditional Knowledge Distillation (KD) in reasoning tasks. While conventional KD forces student models to mimic the teacher’s full output distribution—often overwhelming smaller models with excessive supervision—TSD-KD introduces a student-centric approach by selectively distilling only critical tokens for reasoning. It combines indirect distillation via preference-based feedback and direct distillation via selective token-matching based on confidence differentials. Additionally, entropy regularization is employed to preserve student confidence. Experimental results demonstrate TSD-KD’s superior performance across 10 reasoning benchmarks, achieving up to 54.4% higher accuracy than baselines and outperforming its own teacher in select cases. These findings indicate a significant advancement in scalable, effective reasoning transfer.

Key Points

  • TSD-KD shifts from full distribution mimicry to token-selective distillation
  • Combines indirect (preference-based) and direct (confidence-based) distillation mechanisms
  • Entropy regularization supports student confidence maintenance

Merits

Performance Superiority

TSD-KD outperforms baselines and even its own teacher in multiple benchmarks by substantial margins

Demerits

Complexity in Implementation

The selective token-matching and confidence differential computation may introduce added complexity for practitioners unfamiliar with dual distillation paradigms

Expert Commentary

The innovation in TSD-KD lies in its nuanced recognition that forcing full distribution alignment is counterproductive for constrained student architectures. By shifting focus to token-level relevance in reasoning contexts—particularly through confidence-weighted selection—the authors align the distillation process with the student’s capacity constraints. The integration of indirect feedback via preference ranking is particularly elegant, as it avoids over-constraining while still providing directional guidance. Furthermore, the entropy regularization component is a sophisticated yet practical mechanism to counteract the common phenomenon of confidence erosion during knowledge transfer. This framework represents a paradigm shift: rather than replicating teacher outputs, it empowers students to reason independently with targeted, context-aware support. The empirical validation across diverse benchmarks confirms the robustness of the approach. Notably, the fact that students occasionally outperform their teachers—a phenomenon rarely documented—suggests TSD-KD may unlock latent reasoning potential in compressed models. This work is poised to influence both research and deployment strategies in AI reasoning.

Recommendations

  • Researchers should integrate TSD-KD into benchmark evaluations for reasoning tasks as a standard comparison
  • Practitioners deploying student models should consider TSD-KD as a preferred distillation protocol when computational efficiency and accuracy coexist as constraints

Sources