LLM Olympiad: Why Model Evaluation Needs a Sealed Exam
arXiv:2603.23292v1 Announce Type: new Abstract: Benchmarks and leaderboards are how NLP most often communicates progress, but in the LLM era they are increasingly easy to …
Quality follows upgrading
Category
arXiv:2603.23292v1 Announce Type: new Abstract: Benchmarks and leaderboards are how NLP most often communicates progress, but in the LLM era they are increasingly easy to …
arXiv:2603.23244v1 Announce Type: new Abstract: When learning a novel complex task, people often form efficient reusable abstractions that simplify future work, despite uncertainty about the …
arXiv:2603.23234v1 Announce Type: new Abstract: Large language model (LLM)-based agents rely on memory mechanisms to reuse knowledge from past problem-solving experiences. Existing approaches typically construct …
arXiv:2603.23231v1 Announce Type: new Abstract: Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. However, prior …
arXiv:2603.23178v1 Announce Type: new Abstract: Deepfakes generated by modern generative models pose a serious threat to information integrity, digital identity, and public trust. Existing detection …
arXiv:2603.23149v1 Announce Type: new Abstract: Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for …
arXiv:2603.23114v1 Announce Type: new Abstract: A human's moral decision depends heavily on the context. Yet research on LLM morality has largely studied fixed scenarios. We …
arXiv:2603.23085v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) …
arXiv:2603.23059v1 Announce Type: new Abstract: Recent advances in game AI, such as AlphaZero and Ath\'enan, have achieved superhuman performance across a wide range of board …
arXiv:2603.23004v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated great capabilities across diverse natural language tasks; yet their ability to solve abstraction and …
arXiv:2603.23003v1 Announce Type: new Abstract: The comparison of dental records is a standardized technique in forensic dentistry used to speed up the identification of individuals …
arXiv:2603.22978v1 Announce Type: new Abstract: In the maintenance of complex systems, fault trees are used to locate problems and provide targeted solutions. To enable fault …