Academic

OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset

arXiv:2603.13933v1 Announce Type: new Abstract: Ensuring the safety and compliance of large language models (LLMs) is of paramount importance. However, existing LLM safety datasets often rely on ad-hoc taxonomies for data generation and suffer from a significant shortage of rule-grounded, real-world cases that are essential for robustly protecting LLMs. In this work, we address this critical gap by constructing a comprehensive safety dataset from a compliance perspective. Using a powerful web-searching agent, we collect a rule-grounded, real-world case dataset OmniCompliance-100K, sourced from multi-domain authoritative references. The dataset spans 74 regulations and policies across a wide range of domains, including security and privacy regulations, content safety and user data privacy policies from leading AI companies and social media platforms, financial security requirements, medical device risk management standards, educational integrity guidelines, and protections of fundament

arXiv:2603.13933v1 Announce Type: new Abstract: Ensuring the safety and compliance of large language models (LLMs) is of paramount importance. However, existing LLM safety datasets often rely on ad-hoc taxonomies for data generation and suffer from a significant shortage of rule-grounded, real-world cases that are essential for robustly protecting LLMs. In this work, we address this critical gap by constructing a comprehensive safety dataset from a compliance perspective. Using a powerful web-searching agent, we collect a rule-grounded, real-world case dataset OmniCompliance-100K, sourced from multi-domain authoritative references. The dataset spans 74 regulations and policies across a wide range of domains, including security and privacy regulations, content safety and user data privacy policies from leading AI companies and social media platforms, financial security requirements, medical device risk management standards, educational integrity guidelines, and protections of fundamental human rights. In total, our dataset contains 12,985 distinct rules and 106,009 associated real-world compliance cases. Our analysis confirms a strong alignment between the rules and their corresponding cases. We further conduct extensive benchmarking experiments to evaluate the safety and compliance capabilities of advanced LLMs across different model scales. Our experiments reveal several interesting findings that have great potential to offer valuable insights for future LLM safety research.

Executive Summary

The article introduces OmniCompliance-100K, a novel, rule-grounded, multi-domain safety compliance dataset comprising 12,985 distinct regulations and 106,009 real-world compliance cases, sourced via automated web-searching agents from authoritative domains. Covering 74 regulatory frameworks across security, privacy, content safety, financial, medical, educational, and human rights domains, the dataset offers a significant shift from ad-hoc taxonomies to structured, rule-based compliance data. The alignment between rules and cases is substantiated, and benchmarking experiments reveal insights into LLM compliance performance across scales. This marks a pivotal contribution to LLM safety research by grounding safety datasets in actual regulatory compliance.

Key Points

  • Construction of a rule-grounded, multi-domain compliance dataset
  • Comprehensive coverage of 74 regulatory domains
  • Benchmarking experiments on LLM compliance capabilities

Merits

Methodological Innovation

Use of automated web-searching agents and authoritative sources to generate a rule-aligned, real-world dataset enhances reproducibility and relevance compared to prior ad-hoc datasets.

Demerits

Scale Limitation

While extensive, the dataset may still exclude niche or emerging regulatory frameworks not captured by automated web-searching mechanisms, potentially limiting applicability in rapidly evolving compliance landscapes.

Expert Commentary

OmniCompliance-100K represents a substantive advance in the empirical foundation for LLM safety. The shift from subjective, ad-hoc taxonomies to a rule-based, real-world compliance dataset is both necessary and overdue. The dataset’s breadth—spanning financial, medical, content safety, and human rights domains—demonstrates a sophisticated understanding of the regulatory ecosystem’s complexity. Moreover, the alignment validation between rules and cases is methodologically robust and adds credibility to the findings. While the automated collection mechanism raises questions about potential bias or omission, the overall impact on safety research is significant. Future work should explore longitudinal updates of regulations and cross-jurisdictional harmonization. This dataset is poised to become a cornerstone for evaluating LLM compliance in real-world applications.

Recommendations

  • 1. Encourage open-access release of the dataset with accompanying annotation tools to facilitate reproducible research.
  • 2. Initiate collaborative efforts with regulatory agencies to validate and expand the dataset’s regulatory coverage over time.

Sources