Academic

In harmony with gpt-oss

arXiv:2604.00362v1 Announce Type: new Abstract: No one has independently reproduced OpenAI's published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model's in-distribution tools: when prompted without tool definitions, gpt-oss still calls tools from its training distribution with high statistical confidence -- a strong prior, not a hallucination. We then built a native harmony agent harness (https://github.com/borislavmavrin/harmonyagent.git) that encodes messages in the model's native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI's published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).

B
Borislav Mavrin
· · 1 min read · 0 views

arXiv:2604.00362v1 Announce Type: new Abstract: No one has independently reproduced OpenAI's published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model's in-distribution tools: when prompted without tool definitions, gpt-oss still calls tools from its training distribution with high statistical confidence -- a strong prior, not a hallucination. We then built a native harmony agent harness (https://github.com/borislavmavrin/harmonyagent.git) that encodes messages in the model's native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI's published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).

Executive Summary

This article presents a groundbreaking study on the OpenAI model, gpt-oss-20b, by reverse-engineering its in-distribution tools and creating a native harmony agent harness. The authors successfully reproduce OpenAI's published scores, demonstrating the first independent verification of the model's performance. The study highlights the importance of transparency and access to tools in AI research and development. The findings have significant implications for the field of natural language processing and AI development, emphasizing the need for more open and collaborative approaches to model evaluation and assessment.

Key Points

  • Reverse-engineering of in-distribution tools enables independent reproduction of OpenAI's published scores.
  • Native harmony agent harness bypasses lossy Chat Completions conversion, improving model performance.
  • Study demonstrates the importance of transparency and access to tools in AI research and development.

Merits

Strengths the study's methodology and results.

The authors employ a robust and transparent approach to reverse-engineering the model's tools, providing a rigorous evaluation of the model's performance.

Demerits

Limitation of generalizability to other models.

The study focuses on a specific model, gpt-oss-20b, and may not be representative of other AI models or architectures.

Expert Commentary

This study marks a significant milestone in the field of AI research and development, highlighting the importance of transparency and access to tools in evaluating model performance. The authors' approach demonstrates a commitment to replicability and reproducibility, which is essential for advancing scientific knowledge and ensuring the reliability of AI models. The study's findings also underscore the need for more open and collaborative approaches to AI development, emphasizing the importance of sharing knowledge and expertise to accelerate progress in the field. As AI continues to play an increasingly important role in society, it is essential that researchers and developers prioritize transparency, accountability, and explainability to ensure that models are trustworthy and reliable.

Recommendations

  • Researchers and developers should prioritize transparency and access to tools in AI research and development.
  • Policy and regulatory frameworks should be developed to promote openness and accountability in AI development.

Sources

Original: arXiv - cs.AI