Academic

Generating Expressive and Customizable Evals for Timeseries Data Analysis Agents with AgentFuel

arXiv:2603.12483v1 Announce Type: new Abstract: Across many domains (e.g., IoT, observability, telecommunications, cybersecurity), there is an emerging adoption of conversational data analysis agents that enable users to "talk to your data" to extract insights. Such data analysis agents operate on timeseries data models; e.g., measurements from sensors or events monitoring user clicks and actions in product analytics. We evaluate 6 popular data analysis agents (both open-source and proprietary) on domain-specific data and query types, and find that they fail on stateful and incident-specific queries. We observe two key expressivity gaps in existing evals: domain-customized datasets and domain-specific query types. To enable practitioners in such domains to generate customized and expressive evals for such timeseries data agents, we present AgentFuel. AgentFuel helps domain experts quickly create customized evals to perform end-to-end functional tests. We show that AgentFuel's benchmar

arXiv:2603.12483v1 Announce Type: new Abstract: Across many domains (e.g., IoT, observability, telecommunications, cybersecurity), there is an emerging adoption of conversational data analysis agents that enable users to "talk to your data" to extract insights. Such data analysis agents operate on timeseries data models; e.g., measurements from sensors or events monitoring user clicks and actions in product analytics. We evaluate 6 popular data analysis agents (both open-source and proprietary) on domain-specific data and query types, and find that they fail on stateful and incident-specific queries. We observe two key expressivity gaps in existing evals: domain-customized datasets and domain-specific query types. To enable practitioners in such domains to generate customized and expressive evals for such timeseries data agents, we present AgentFuel. AgentFuel helps domain experts quickly create customized evals to perform end-to-end functional tests. We show that AgentFuel's benchmarks expose key directions for improvement in existing data agent frameworks. We also present anecdotal evidence that using AgentFuel can improve agent performance (e.g., with GEPA). AgentFuel benchmarks are available at https://huggingface.co/datasets/RockfishData/TimeSeriesAgentEvals.

Executive Summary

The article addresses a critical gap in the evaluation of conversational data analysis agents for time-series data, particularly in domains like IoT, observability, and cybersecurity. While existing evaluations of six agents—both open-source and proprietary—reveal consistent failures in handling stateful and incident-specific queries, the authors identify two core expressivity gaps: lack of domain-customized datasets and absence of domain-specific query types. To mitigate these issues, the authors introduce AgentFuel, a tool designed to empower domain experts to generate expressive, customized evaluative benchmarks for end-to-end functional testing. The benchmarks, publicly available via Hugging Face, serve as a robust resource for evaluating agent performance and identifying systemic improvement areas. The work contributes meaningfully by bridging a practical evaluation void and offering a scalable solution for tailored testing.

Key Points

  • Identification of expressivity gaps in current evaluations (domain-customized datasets and query types)
  • Introduction of AgentFuel as a solution to enable domain experts to create customized evaluative benchmarks
  • Public availability of AgentFuel benchmarks for broader use and validation

Merits

Addressing a Critical Evaluation Gap

AgentFuel directly responds to a documented limitation in existing agent evaluations by introducing a tailored tool for domain-specific benchmark creation.

Scalability and Accessibility

By making benchmarks publicly accessible, the authors enable broader adoption and validation across the research and practitioner communities.

Demerits

Limited Scope of Agent Coverage

The evaluation of only six agents may limit the generalizability of findings to other agents not included in the study.

Potential for Overgeneralization

Anecdotal evidence of improved performance (e.g., with GEPA) may not be statistically robust or reproducible without further validation.

Expert Commentary

The work by the authors represents a significant step forward in the evolution of agent-based data analysis. Historically, the lack of standardized, domain-specific evaluation frameworks has hindered the ability to assess agent capabilities beyond generic, generic-purpose metrics. AgentFuel ingeniously fills this void by allowing experts to operationalize their domain knowledge into evaluative constructs, thereby aligning evaluation criteria with real-world use cases. This shift from abstract, generic testing to context-aware, incident-specific validation is a paradigm shift. Moreover, the open-source availability of benchmarks demonstrates a commendable commitment to reproducibility and collaborative advancement. While the limited number of agents evaluated presents a minor constraint, the methodological clarity and practical utility of AgentFuel position it as a foundational contribution to the field. The broader implications extend beyond data agent evaluation—into the design of more effective, domain-aware AI systems in general.

Recommendations

  • Researchers and practitioners should consider integrating AgentFuel benchmarks into their evaluation pipelines to better assess agent performance in domain-specific contexts.
  • Future work should expand the agent coverage to include a broader range of open-source and proprietary systems to enhance generalizability.

Sources