Academic

From Stochastic Answers to Verifiable Reasoning: Interpretable Decision-Making with LLM-Generated Code

Anirudh Jaidev Mahesh, Ben Griffin, Fuat Alican, Joseph Ternasky, Zakari Salifu, Kelvin Amoaba, Yagiz Ihlamur, Aaron Ontoyin Yin, Aikins Laryea, Afriyie Samuel, Yigit Ihlamur · March 17, 2026 · 1 min read · 22 views

#cs.LG #cs.AI

arXiv:2603.13287v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for high-stakes decision-making, yet existing approaches struggle to reconcile scalability, interpretability, and reproducibility. Black-box models obscure their reasoning, while recent LLM-based rule systems rely on per-sample evaluation, causing costs to scale with dataset size and introducing stochastic, hallucination-prone outputs. We propose reframing LLMs as code generators rather than per-instance evaluators. A single LLM call generates executable, human-readable decision logic that runs deterministically over structured data, eliminating per-sample LLM queries while enabling reproducible and auditable predictions. We combine code generation with automated statistical validation using precision lift, binomial significance testing, and coverage filtering, and apply cluster-based gap analysis to iteratively refine decision logic without human annotation. We instantiate this framework in venture capital founder screening, a rare-event prediction task with strong interpretability requirements. On VCBench, a benchmark of 4,500 founders with a 9% base success rate, our approach achieves 37.5% precision and an F0.5 score of 25.0%, outperforming GPT-4o (at 30.0% precision and an F0.5 score of 25.7%) while maintaining full interpretability. Each prediction traces to executable rules over human-readable attributes, demonstrating verifiable and interpretable LLM-based decision-making in practice.

Executive Summary

This article proposes a novel approach to interpretable decision-making using large language models (LLMs) as code generators. By reframing LLMs to produce executable, human-readable decision logic, the authors achieve reproducible and auditable predictions. The approach is demonstrated in venture capital founder screening, outperforming GPT-4o while maintaining full interpretability. The results show 37.5% precision and an F0.5 score of 25.0%, with each prediction tracing to executable rules over human-readable attributes.

Key Points

▸ LLMs are used as code generators rather than per-instance evaluators
▸ The approach enables reproducible and auditable predictions
▸ The method is demonstrated in venture capital founder screening with improved performance and interpretability

Merits

Improved Interpretability

The approach provides verifiable and interpretable LLM-based decision-making, with each prediction tracing to executable rules over human-readable attributes.

Reproducibility and Auditability

The use of executable, human-readable decision logic enables reproducible and auditable predictions, addressing concerns around scalability and reliability.

Demerits

Limited Generalizability

The approach is demonstrated in a specific domain (venture capital founder screening), and its applicability to other domains and tasks is not fully explored.

Expert Commentary

The article presents a significant contribution to the field of interpretable AI, demonstrating the potential of LLMs as code generators to produce verifiable and reproducible decision logic. The approach addresses key challenges in AI decision-making, including scalability, reliability, and interpretability. While the results are promising, further research is needed to fully explore the generalizability and applicability of the approach to diverse domains and tasks. The implications of this work are far-reaching, with potential applications in high-stakes decision-making and contributions to ongoing policy discussions around AI regulation and accountability.

Recommendations

✓ Further research is needed to explore the generalizability of the approach to other domains and tasks.
✓ The development of more robust evaluation metrics and benchmarks is necessary to fully assess the performance and reliability of the approach.

Sources

arXiv - cs.LG

From Stochastic Answers to Verifiable Reasoning: Interpretable Decision-Making with LLM-Generated Code

AI Commentary

Executive Summary

Key Points

Merits

Improved Interpretability

Reproducibility and Auditability

Demerits

Limited Generalizability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs