Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models
arXiv:2603.16192v1 Announce Type: new Abstract: Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic s
arXiv:2603.16192v1 Announce Type: new Abstract: Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.
Executive Summary
This article proposes Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework designed to manipulate how malicious semantic intent is reconstructed during model inference in large language models (LLMs). S2C exploits vulnerabilities in the safety mechanisms of modern LLMs, which rely on latent semantic representations and generation-time reasoning. By strategically distributing and reshaping semantic cues, S2C degrades safety triggers and improves Attack Success Rate (ASR) by 12.4% and 9.7% over the current state-of-the-art (SOTA) on HarmBench and JBB-Behaviors. The framework's three complementary mechanisms, Contextual Reframing, Content Fragmentation, and Clue-Guided Camouflage, effectively delay and restructure semantic consolidation, rendering many surface-level obfuscation jailbreak attacks ineffective.
Key Points
- ▸ S2C is a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference.
- ▸ The framework consists of three complementary mechanisms: Contextual Reframing, Content Fragmentation, and Clue-Guided Camouflage.
- ▸ S2C improves Attack Success Rate (ASR) by 12.4% and 9.7% over the current SOTA on HarmBench and JBB-Behaviors, respectively.
Merits
Effective Exploitation of Safety Mechanisms
S2C exploits vulnerabilities in the safety mechanisms of modern LLMs, which rely on latent semantic representations and generation-time reasoning.
Improved Attack Success Rate
S2C achieves significant gains in Attack Success Rate (ASR) over the current SOTA on HarmBench and JBB-Behaviors.
Demerits
Vulnerability to Advanced Safety Mechanisms
S2C may be ineffective against advanced safety mechanisms that employ more sophisticated latent semantic representations and generation-time reasoning.
Potential for Over-Obfuscation
S2C's effectiveness may be compromised if the semantic cues are over-obfuscated, leading to reduced instruction recoverability and output quality.
Expert Commentary
The proposed S2C framework is a significant contribution to the field of adversarial attacks on LLMs. The framework's effectiveness in exploiting vulnerabilities in safety mechanisms and improving Attack Success Rate is impressive. However, the potential limitations of S2C, such as vulnerability to advanced safety mechanisms and over-obfuscation, highlight the need for more comprehensive safety and security measures in AI systems. As LLMs continue to play an increasingly important role in high-stakes applications, it is essential to prioritize the development of more robust safety mechanisms and defense strategies against adversarial attacks.
Recommendations
- ✓ Future research should focus on developing more advanced safety mechanisms that can effectively counter S2C and other adversarial attacks.
- ✓ Researchers should prioritize the development of defense strategies that can detect and mitigate the effects of S2C and other adversarial attacks on LLMs.