This platform requires JavaScript for full functionality. Please enable JavaScript in your browser settings.

Quality follows upgrading

Abhishek Chandwani, Ishan Gupta

Articles by Abhishek Chandwani, Ishan Gupta

Academic · 1 min

Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

arXiv:2603.22744v1 Announce Type: new Abstract: Large language models excel on objectively verifiable tasks such as math and programming, where evaluation reduces to unit tests or …

4 views Mar 25

Abhishek Chandwani, Ishan Gupta

Articles by Abhishek Chandwani, Ishan Gupta

Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

JCG, PC

HSOLLC Co., Ltd.