Academic

Prompt Engineering for Scale Development in Generative Psychometrics

arXiv:2603.15909v1 Announce Type: new Abstract: This Monte Carlo simulation examines how prompt engineering strategies shape the quality of large language model (LLM)--generated personality assessment items within the AI-GENIE framework for generative psychometrics. Item pools targeting the Big Five traits were generated using multiple prompting designs (zero-shot, few-shot, persona-based, and adaptive), model temperatures, and LLMs, then evaluated and reduced using network psychometric methods. Across all conditions, AI-GENIE reliably improved structural validity following reduction, with the magnitude of its incremental contribution inversely related to the quality of the incoming item pool. Prompt design exerted a substantial influence on both pre- and post-reduction item quality. Adaptive prompting consistently outperformed non-adaptive strategies by sharply reducing semantic redundancy, elevating pre-reduction structural validity, and preserving substantially larger item pool, pa

L
Lara Lee Russell-Lasalandra, Hudson Golino
· · 1 min read · 11 views

arXiv:2603.15909v1 Announce Type: new Abstract: This Monte Carlo simulation examines how prompt engineering strategies shape the quality of large language model (LLM)--generated personality assessment items within the AI-GENIE framework for generative psychometrics. Item pools targeting the Big Five traits were generated using multiple prompting designs (zero-shot, few-shot, persona-based, and adaptive), model temperatures, and LLMs, then evaluated and reduced using network psychometric methods. Across all conditions, AI-GENIE reliably improved structural validity following reduction, with the magnitude of its incremental contribution inversely related to the quality of the incoming item pool. Prompt design exerted a substantial influence on both pre- and post-reduction item quality. Adaptive prompting consistently outperformed non-adaptive strategies by sharply reducing semantic redundancy, elevating pre-reduction structural validity, and preserving substantially larger item pool, particularly when paired with newer, higher-capacity models. These gains were robust across temperature settings for most models, indicating that adaptive prompting mitigates common trade-offs between creativity and psychometric coherence. An exception was observed for the GPT-4o model at high temperatures, suggesting model-specific sensitivity to adaptive constraints at elevated stochasticity. Overall, the findings demonstrate that adaptive prompting is the strongest approach in this context, and that its benefits scale with model capability, motivating continued investigation of model--prompt interactions in generative psychometric pipelines.

Executive Summary

This study investigates the impact of prompt engineering strategies on the quality of LLM-generated personality assessment items within the AI-GENIE framework. Utilizing Monte Carlo simulations across zero-shot, few-shot, persona-based, and adaptive prompting designs, the research evaluates structural validity pre- and post-reduction via network psychometric methods. Findings indicate that adaptive prompting significantly enhances item quality by reducing semantic redundancy, improving structural validity pre-reduction, and preserving larger item pools with newer models. These benefits are generally robust across temperature settings, with an exception noted for GPT-4o at high temperatures. The study concludes that adaptive prompting is the most effective strategy for generative psychometrics, particularly as model capabilities advance.

Key Points

  • Adaptive prompting outperforms non-adaptive strategies
  • Reduction in semantic redundancy enhances structural validity
  • Benefits scale with model capability

Merits

Strength of Adaptive Prompting

Adaptive prompting consistently reduced semantic redundancy and improved pre-reduction structural validity, offering scalable advantages across model capabilities.

Robustness Across Conditions

Findings showed consistent gains across temperature settings for most models, indicating generalizability of adaptive prompting effects.

Demerits

Model-Specific Sensitivity

An exception was observed with GPT-4o at high temperatures, suggesting potential model-specific limitations under elevated stochasticity.

Expert Commentary

The article presents a compelling empirical validation of adaptive prompting’s superiority in the context of generative psychometrics. The empirical rigor of using Monte Carlo simulations across multiple prompting modalities and network psychometric evaluation methods lends substantial credibility to the conclusions. Importantly, the observed trade-off dynamics—particularly the mitigation of the creativity-coherence tension through adaptive prompting—align with broader computational linguistics theories on constraint-induced coherence. The exception noted with GPT-4o at high temperatures warrants deeper scrutiny; it may reflect either architectural constraints in the model’s stochasticity tolerance or a latent interaction between prompt complexity and model-specific activation patterns. This nuanced finding elevates the study beyond a generalizable claim to a more sophisticated, model-aware recommendation. As the field moves toward scalable, high-capacity generative models, the implications extend beyond item generation to broader psychometric infrastructure design, including validation protocols and item bank construction. The study sets a new benchmark for evidence-based prompting in AI-mediated psychological assessment.

Recommendations

  • Adopt adaptive prompting as the default strategy in LLM-based personality assessment item generation.
  • Conduct targeted investigations into model-specific sensitivities to adaptive constraints, particularly with high-capacity variants like GPT-4o at elevated temperatures.

Sources