OpenSage: Self-programming Agent Generation Engine
arXiv:2602.16891v1 Announce Type: new Abstract: Agent development kits (ADKs) provide effective platforms and tooling for constructing agents, and their designs are critical to the constructed …
Quality follows upgrading
Academic
arXiv:2602.16891v1 Announce Type: new Abstract: Agent development kits (ADKs) provide effective platforms and tooling for constructing agents, and their designs are critical to the constructed …
arXiv:2602.16901v1 Announce Type: new Abstract: LLM agents are increasingly deployed in long-horizon, complex environments to solve challenging problems, but this expansion exposes them to long-horizon …
arXiv:2602.16902v1 Announce Type: new Abstract: We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models …
arXiv:2602.16931v1 Announce Type: new Abstract: Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and …
arXiv:2602.16935v1 Announce Type: new Abstract: While Large Language Model (LLM) capabilities have scaled, safety guardrails remain largely stateless, treating multi-turn dialogues as a series of …
arXiv:2602.16942v1 Announce Type: new Abstract: Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence …
arXiv:2602.16943v1 Announce Type: new Abstract: Large language models deployed as agents increasingly interact with external systems through tool calls--actions with real-world consequences that text outputs …
arXiv:2602.16953v1 Announce Type: new Abstract: Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback is often expensive and slow …
arXiv:2602.16958v1 Announce Type: new Abstract: Agent hijacking, highlighted by OWASP as a critical threat to the Large Language Model (LLM) ecosystem, enables adversaries to manipulate …
arXiv:2602.16976v1 Announce Type: new Abstract: Here's the corrected paragraph with all punctuation and formatting issues fixed: Financial risk systems usually follow a two-step routine: a …
arXiv:2602.16984v1 Announce Type: new Abstract: Black-box safety evaluation of AI systems assumes model behavior on test distributions reliably predicts deployment performance. We formalize and challenge …
arXiv:2602.16990v1 Announce Type: new Abstract: Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy …