QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions
arXiv:2603.12165v1 Announce Type: new Abstract: Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that L
arXiv:2603.12165v1 Announce Type: new Abstract: Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.
Executive Summary
The article QAQ introduces a novel framework for improving synthetic code data quality by shifting the evaluation axis from model generation difficulty (A|Q) to answer-to-query predictability (Q|A). Utilizing Reverse Mutual Information (RMI) as a novel metric, QAQ identifies quality issues at both low and high RMI extremes—semantic misalignment and deceptive patterns that LLMs detect easily. The authors propose a selection strategy leveraging model disagreement to isolate challenging yet valid samples. Empirical results on WarriorCoder demonstrate that stratified RMI-based selection achieves comparable performance to full-data training while reducing computational costs by 75%. This bidirectional coherence approach offers a scalable, cost-effective alternative to conventional data selection methods.
Key Points
- ▸ Introduction of Reverse Mutual Information (RMI) as a new metric for data quality assessment
- ▸ Shift from generation-based (A|Q) to prediction-based (Q|A) evaluation
- ▸ Identification of quality indicators at both ends of RMI spectrum—low and high RMI
Merits
Innovative Metric
QAQ’s RMI metric represents a paradigm shift in detecting synthetic data hallucinations by focusing on predictability rather than generation difficulty, offering greater sensitivity to semantic coherence.
Efficiency Gains
The selection strategy reduces training data usage by 75% without compromising model performance, demonstrating significant computational savings.
Demerits
Complexity of Implementation
Calibrating RMI thresholds for different LLMs and code domains may require additional tuning and domain-specific validation, potentially adding overhead in practical deployment.
Expert Commentary
This work marks a significant advancement in the field of synthetic data curation. The shift from evaluating model generation to evaluating answer-to-query predictability is both theoretically sound and empirically validated. The use of RMI as a proxy for bidirectional semantic coherence introduces a level of precision previously absent in synthetic data selection. Moreover, the disagreement-based selection mechanism cleverly exploits model uncertainty as a signal for sample quality—a nuanced insight that aligns with Bayesian principles of uncertainty quantification. The scalability of the 25% data selection achieving full performance suggests that QAQ may become a new standard in code generation training pipelines. Notably, while the authors acknowledge the calibration challenge, their empirical validation on a domain-specific benchmark (WarriorCoder) lends credibility to the generalizability potential. This paper should be considered a foundational contribution in the evolution of synthetic data evaluation.
Recommendations
- ✓ Researchers should incorporate RMI as a benchmark metric in synthetic data evaluation studies for code generation models.
- ✓ Engineers deploying LLM-based code assistants should consider integrating QAQ-style selection strategies to optimize training data efficiency without sacrificing accuracy.