BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity
arXiv:2603.18019v1 Announce Type: new Abstract: Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the …
Harshita Diddee, Gregory Yauney, Swabha Swayamdipta, Daphne Ippolito
32 views