Academic

Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks

arXiv:2603.23646v1 Announce Type: new Abstract: While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. I introduce Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian), and evaluate ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and weighted kappa = 0.605, with reference answers validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy). Results reveal three descriptive performance clusters: Tier A (3

F
Fatih Uenal
· · 1 min read · 19 views

arXiv:2603.23646v1 Announce Type: new Abstract: While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. I introduce Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian), and evaluate ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and weighted kappa = 0.605, with reference answers validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy). Results reveal three descriptive performance clusters: Tier A (35-38% correct), Tier B (26-29%), and Tier C (13-21%). The benchmark proves difficult: even the top-ranked model (Qwen 3.5 Plus) achieves only 38.2% correct, with 47.3% incorrect and 14.4% partially correct. Task type difficulty varies widely: legal translation and case analysis yield 69-72% correct rates, while regulatory Q&A, hallucination detection, and gap analysis remain below 9%. Within this roster (seven open-weight, three closed-source), an open-weight model leads the ranking, and several open-weight models match or outperform their closed-source counterparts. These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero-retrieval conditions.

Executive Summary

This article introduces Swiss-Bench SBP-002, a novel benchmark evaluating frontier model performance on applied Swiss regulatory compliance tasks. The benchmark comprises 395 expert-crafted items across three regulatory domains, seven task types, and three languages. The models are assessed using a structured three-dimension scoring framework and a blind three-judge LLM panel. The results reveal varying performance levels across task types, with legal translation and case analysis yielding higher correct rates. Despite the top-ranked model achieving only 38.2% correct, the findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks. The study highlights the difficulty of the benchmark and the potential of open-weight models to match or outperform closed-source counterparts.

Key Points

  • Swiss-Bench SBP-002 is a novel benchmark for evaluating frontier model performance on applied Swiss regulatory compliance tasks.
  • The benchmark consists of 395 expert-crafted items across three regulatory domains, seven task types, and three languages.
  • The assessment framework combines a structured three-dimension scoring framework and a blind three-judge LLM panel.
  • Results show varying performance levels across task types, with legal translation and case analysis yielding higher correct rates.
  • The study highlights the difficulty of the benchmark and the potential of open-weight models to match or outperform closed-source counterparts.

Merits

Comprehensive benchmark design

The benchmark covers a wide range of regulatory domains, task types, and languages, providing a comprehensive evaluation of frontier model capabilities.

Structured assessment framework

The combination of a structured three-dimension scoring framework and a blind three-judge LLM panel ensures a robust and reliable assessment of model performance.

Demerits

Limited generalizability

The study focuses on Swiss regulatory compliance tasks, which may limit the generalizability of the findings to other jurisdictions and regulatory contexts.

Dependence on human validation

The reference answers were validated by a human legal expert, which may introduce bias and limit the objectivity of the assessment.

Expert Commentary

The article provides a valuable contribution to the field of AI and regulation, highlighting the need for more comprehensive benchmarking frameworks and the importance of model interpretability. The findings suggest that frontier models are not yet ready for widespread adoption in regulatory applications, and policymakers should exercise caution when considering their use. The study's emphasis on the potential of open-weight models is also noteworthy, as it highlights the importance of transparency and explainability in AI systems. Overall, the article provides a timely and thought-provoking analysis of the role of AI in regulatory compliance.

Recommendations

  • Future studies should focus on developing more comprehensive benchmarking frameworks that account for the complexities of regulatory applications.
  • Researchers should prioritize model interpretability and transparency to ensure that frontier models are reliable and trustworthy in regulatory contexts.

Sources

Original: arXiv - cs.CL