From Stochastic Answers to Verifiable Reasoning: Interpretable Decision-Making with LLM-Generated Code
arXiv:2603.13287v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for high-stakes decision-making, yet existing approaches struggle to reconcile scalability, interpretability, and reproducibility. Black-box models obscure their reasoning, while recent LLM-based rule systems rely on per-sample evaluation, causing costs to scale with dataset size and introducing stochastic, hallucination-prone outputs. We propose reframing LLMs as code generators rather than per-instance evaluators. A single LLM call generates executable, human-readable decision logic that runs deterministically over structured data, eliminating per-sample LLM queries while enabling reproducible and auditable predictions. We combine code generation with automated statistical validation using precision lift, binomial significance testing, and coverage filtering, and apply cluster-based gap analysis to iteratively refine decision logic without human annotation. We instantiate this framewor
arXiv:2603.13287v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for high-stakes decision-making, yet existing approaches struggle to reconcile scalability, interpretability, and reproducibility. Black-box models obscure their reasoning, while recent LLM-based rule systems rely on per-sample evaluation, causing costs to scale with dataset size and introducing stochastic, hallucination-prone outputs. We propose reframing LLMs as code generators rather than per-instance evaluators. A single LLM call generates executable, human-readable decision logic that runs deterministically over structured data, eliminating per-sample LLM queries while enabling reproducible and auditable predictions. We combine code generation with automated statistical validation using precision lift, binomial significance testing, and coverage filtering, and apply cluster-based gap analysis to iteratively refine decision logic without human annotation. We instantiate this framework in venture capital founder screening, a rare-event prediction task with strong interpretability requirements. On VCBench, a benchmark of 4,500 founders with a 9% base success rate, our approach achieves 37.5% precision and an F0.5 score of 25.0%, outperforming GPT-4o (at 30.0% precision and an F0.5 score of 25.7%) while maintaining full interpretability. Each prediction traces to executable rules over human-readable attributes, demonstrating verifiable and interpretable LLM-based decision-making in practice.
Executive Summary
This article proposes a novel approach to interpretable decision-making using large language models (LLMs) as code generators. By reframing LLMs to produce executable, human-readable decision logic, the authors achieve reproducible and auditable predictions. The approach is demonstrated in venture capital founder screening, outperforming GPT-4o while maintaining full interpretability. The results show 37.5% precision and an F0.5 score of 25.0%, with each prediction tracing to executable rules over human-readable attributes.
Key Points
- ▸ LLMs are used as code generators rather than per-instance evaluators
- ▸ The approach enables reproducible and auditable predictions
- ▸ The method is demonstrated in venture capital founder screening with improved performance and interpretability
Merits
Improved Interpretability
The approach provides verifiable and interpretable LLM-based decision-making, with each prediction tracing to executable rules over human-readable attributes.
Reproducibility and Auditability
The use of executable, human-readable decision logic enables reproducible and auditable predictions, addressing concerns around scalability and reliability.
Demerits
Limited Generalizability
The approach is demonstrated in a specific domain (venture capital founder screening), and its applicability to other domains and tasks is not fully explored.
Expert Commentary
The article presents a significant contribution to the field of interpretable AI, demonstrating the potential of LLMs as code generators to produce verifiable and reproducible decision logic. The approach addresses key challenges in AI decision-making, including scalability, reliability, and interpretability. While the results are promising, further research is needed to fully explore the generalizability and applicability of the approach to diverse domains and tasks. The implications of this work are far-reaching, with potential applications in high-stakes decision-making and contributions to ongoing policy discussions around AI regulation and accountability.
Recommendations
- ✓ Further research is needed to explore the generalizability of the approach to other domains and tasks.
- ✓ The development of more robust evaluation metrics and benchmarks is necessary to fully assess the performance and reliability of the approach.