Academic

CODE-GEN: A Human-in-the-Loop RAG-Based Agentic AI System for Multiple-Choice Question Generation

arXiv:2604.03926v1 Announce Type: new Abstract: We present CODE-GEN, a human-in-the-Loop, retrieval-augmented generation (RAG)-based agentic AI system for generating context-aligned multiple-choice questions to develop student code reasoning and comprehension abilities. CODE-GEN employs an agentic AI architecture in which a Generator agent produces multiple-choice coding comprehension questions aligned with course-specific learning objectives, while a Validator agent independently assesses content quality across seven pedagogical dimensions. Both agents are augmented with specialized tools that enhance computational accuracy and verify code outputs. To evaluate the effectiveness of CODE-GEN, we conducted an evaluation study involving six human subject-matter experts (SMEs) who judged 288 AI-generated questions. The SMEs produced a total of 2,016 human-AI rating pairs, indicating agreement or disagreement with the assessments of Validator, along with 131 instances of qualitative feedba

X
Xiaojing Duan, Frederick Nwanganga, Chaoli Wang
· · 1 min read · 5 views

arXiv:2604.03926v1 Announce Type: new Abstract: We present CODE-GEN, a human-in-the-Loop, retrieval-augmented generation (RAG)-based agentic AI system for generating context-aligned multiple-choice questions to develop student code reasoning and comprehension abilities. CODE-GEN employs an agentic AI architecture in which a Generator agent produces multiple-choice coding comprehension questions aligned with course-specific learning objectives, while a Validator agent independently assesses content quality across seven pedagogical dimensions. Both agents are augmented with specialized tools that enhance computational accuracy and verify code outputs. To evaluate the effectiveness of CODE-GEN, we conducted an evaluation study involving six human subject-matter experts (SMEs) who judged 288 AI-generated questions. The SMEs produced a total of 2,016 human-AI rating pairs, indicating agreement or disagreement with the assessments of Validator, along with 131 instances of qualitative feedback. Analyses of SME judgments show strong system performance, with human-validated success rates ranging from 79.9% to 98.6% across the seven pedagogical dimensions. The analysis of qualitative feedback reveals that CODE-GEN achieves high reliability on dimensions well suited to computational verification and explicit criteria matching, including question clarity, code validity, concept alignment, and correct answer validity. In contrast, human expertise remains essential for dimensions requiring deeper instructional judgment, such as designing pedagogically meaningful distractors and providing high-quality feedback that reinforces understanding. These findings inform the strategic allocation of human and AI effort in AI-assisted educational content generation.

Executive Summary

The article introduces CODE-GEN, an innovative human-in-the-loop RAG-based agentic AI system designed to generate context-aligned multiple-choice questions for assessing student code reasoning and comprehension. The system employs a dual-agent architecture—Generator and Validator—augmented with specialized tools to ensure computational accuracy and validate code outputs. An evaluation involving six subject-matter experts (SMEs) analyzed 288 AI-generated questions, yielding 2,016 human-AI rating pairs and 131 qualitative feedback instances. Results demonstrate strong system performance, with human-validated success rates ranging from 79.9% to 98.6% across seven pedagogical dimensions. Notably, the system excels in computationally verifiable dimensions (e.g., question clarity, code validity) but requires human oversight for pedagogically nuanced tasks (e.g., distractor design, feedback quality). The findings underscore the strategic integration of AI and human expertise in educational content generation.

Key Points

  • CODE-GEN leverages a dual-agent RAG-based architecture (Generator and Validator) to produce and assess multiple-choice coding comprehension questions, enhancing alignment with course-specific learning objectives.
  • Human-in-the-loop evaluation by six SMEs revealed strong system performance, with human-validated success rates of 79.9% to 98.6% across seven pedagogical dimensions, based on 2,016 rating pairs and 131 qualitative feedback instances.
  • The study highlights the complementary roles of AI and human expertise: AI excels in computationally verifiable dimensions (e.g., question clarity, code validity), while human judgment is critical for pedagogically complex tasks (e.g., distractor design, feedback quality).

Merits

Innovative Agentic AI Architecture

The dual-agent system (Generator and Validator) with specialized tools for computational verification represents a significant advancement in AI-assisted educational content generation, ensuring both efficiency and reliability.

Empirical Rigor and Scalability

The study’s robust evaluation framework—encompassing 288 questions, 2,016 rating pairs, and 131 qualitative feedback instances—demonstrates methodological rigor and scalability, providing credible evidence of CODE-GEN’s effectiveness.

Strategic Human-AI Collaboration

The findings offer a nuanced understanding of where AI and human expertise are most effectively deployed, optimizing the allocation of resources in educational content generation.

Demerits

Limited Generalizability

The evaluation was conducted with a small cohort of six SMEs and focused on a specific domain (coding comprehension), which may limit the generalizability of the results to other subjects or educational contexts.

Dependence on High-Quality Inputs

The system’s performance relies heavily on the quality of course-specific learning objectives and the availability of subject-matter experts for validation, which may pose challenges in resource-constrained settings.

Potential Over-Reliance on Computational Verification

While computationally verifiable dimensions are well-handled, the system may inadvertently prioritize these metrics over more subjective, pedagogically critical aspects, such as the effectiveness of feedback.

Expert Commentary

The CODE-GEN system represents a significant leap forward in AI-assisted educational content generation, demonstrating how agentic architectures and RAG can be harnessed to produce high-quality, context-aligned multiple-choice questions. The study’s rigorous evaluation framework and the nuanced findings regarding the strengths and limitations of AI versus human judgment are particularly noteworthy. The delineation of where AI excels (e.g., computational verification) and where human expertise is irreplaceable (e.g., pedagogical nuance) offers a compelling model for future AI-human collaborations in education. However, the study’s limited sample size and domain specificity suggest that further research is needed to validate these findings across diverse educational contexts. Additionally, the potential for over-reliance on computationally verifiable metrics warrants careful consideration to ensure that pedagogical richness is not sacrificed for efficiency. Overall, CODE-GEN sets a high standard for integrating AI into educational practices while underscoring the indispensable role of human expertise.

Recommendations

  • Expand the evaluation to include a broader range of subjects and educational levels to assess the generalizability of CODE-GEN’s performance and adaptability to different pedagogical contexts.
  • Develop mechanisms to capture and incorporate qualitative pedagogical feedback more systematically, ensuring that the system evolves to address the nuances of instructional design beyond computationally verifiable dimensions.
  • Establish ethical guidelines and best practices for the deployment of AI in educational content generation, emphasizing transparency, accountability, and the preservation of human judgment in pedagogically critical tasks.

Sources

Original: arXiv - cs.AI