Academic

98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

arXiv:2603.12646v1 Announce Type: new Abstract: System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU -- an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention's $O(n^2)$ memory makes long-context classification (8K--32K tokens) impossible: at 8K tokens, three concurrent classifiers need ${\sim}$4.5\,GB for attention masks alone, far exceeding the memory left by vLLM. We present three staged optimizations for the vLLM Semantic Router, benchmarked on AMD Instinct MI300X, that solve both the latency and the memory problem. \emph{Stage~1}: a custom CK Flash Attention operator for ONNX Runtime on ROCm reduces attention memory from $O(n^2)$ to $O(n)$ and end-to-end (E2E) latency from 4{,}918\,ms to 127\,ms (\textb

arXiv:2603.12646v1 Announce Type: new Abstract: System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU -- an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention's $O(n^2)$ memory makes long-context classification (8K--32K tokens) impossible: at 8K tokens, three concurrent classifiers need ${\sim}$4.5\,GB for attention masks alone, far exceeding the memory left by vLLM. We present three staged optimizations for the vLLM Semantic Router, benchmarked on AMD Instinct MI300X, that solve both the latency and the memory problem. \emph{Stage~1}: a custom CK Flash Attention operator for ONNX Runtime on ROCm reduces attention memory from $O(n^2)$ to $O(n)$ and end-to-end (E2E) latency from 4{,}918\,ms to 127\,ms (\textbf{38.7$\times$}), enabling 8K--32K tokens where SDPA OOMs. \emph{Stage~2}: classical NLP prompt compression (TextRank, position weighting, TF-IDF, and novelty scoring) reduces all inputs to ${\sim}$512 tokens without neural inference, capping both latency and GPU memory at a constant regardless of original prompt length (E2E 127$\to$62\,ms, \textbf{2.0$\times$}). \emph{Stage~3}: near-streaming body processing with adaptive chunking and zero-copy JSON eliminates serialization overhead (E2E 62$\to$50\,ms, \textbf{1.2$\times$}). Cumulatively: \textbf{98$\times$} improvement (4{,}918\,ms to 50\,ms), 16K-token routing in 108\,ms, and a total router GPU footprint under 800\,MB -- small enough to share a GPU with LLM serving and removing the need for a dedicated accelerator. Stage~1 targets AMD ROCm (NVIDIA GPUs already have FlashAttention via cuDNN); Stages~2 and~3 are hardware-agnostic.

Executive Summary

The article presents a transformative optimization framework for low-latency, GPU-free LLM routing, addressing critical bottlenecks in high-context semantic routing. By introducing Flash Attention (Stage 1) reducing memory complexity from O(n²) to O(n), classical prompt compression (Stage 2) minimizing input dimensionality without neural overhead, and near-streaming body processing (Stage 3) eliminating serialization latency, the authors achieve a cumulative 98× speedup (from 4,918ms to 50ms) while maintaining GPU footprint under 800MB. This enables 16K-token routing on shared GPUs without dedicated accelerators, a major practical breakthrough for real-time inference systems. The optimizations are modular, hardware-agnostic where applicable, and validated on AMD Instinct MI300X, suggesting broad applicability across inference platforms.

Key Points

  • Stage 1 reduces attention memory from O(n²) to O(n) via Flash Attention
  • Stage 2 compresses prompts to ~512 tokens using classical NLP techniques without neural inference
  • Stage 3 enables near-streaming body processing via adaptive chunking and zero-copy JSON

Merits

Innovation in Architecture

The three-stage pipeline intelligently decouples memory-intensive operations (attention) from latency-critical ones (compression), enabling coexistence of routing and inference on shared hardware without compromising performance.

Practical Impact

By eliminating the need for dedicated GPUs for routing, the solution directly reduces operational cost and improves scalability for cloud-scale LLM systems, aligning with industry trends toward efficient resource utilization.

Technical Elegance

The combination of algorithmic optimization (Flash Attention), semantic preprocessing (prompt compression), and systems-level efficiency (streaming) demonstrates a holistic, end-to-end solution that avoids trade-offs between speed, memory, and functionality.

Demerits

Platform Dependency

Stage 1 is AMD ROCm-specific; NVIDIA GPU users must rely on existing FlashAttention implementations (e.g., cuDNN), creating potential fragmentation in cross-vendor deployment.

Prompt Compression Limitations

Classical NLP methods (TextRank, TF-IDF) may introduce bias or loss of semantic nuance in highly technical or domain-specific prompts, potentially affecting accuracy in niche applications.

Expert Commentary

This work represents a paradigm shift in the design of auxiliary inference infrastructure for large language models. The authors masterfully identify the dual constraints of memory and latency in routing contexts and address them not with incremental improvements, but with a synergistic triad of innovations. Flash Attention’s transformation of O(n²) to O(n) memory complexity is particularly noteworthy—it redefines the feasibility of long-context routing on commodity GPUs. The prompt compression stage, leveraging classical NLP without neural inference, is a masterstroke of engineering pragmatism: it preserves efficiency without introducing computational overhead, a rare feat in modern AI systems. Stage 3’s near-streaming architecture further cements the solution’s operational brilliance, demonstrating that efficiency gains are multiplicative when layered intelligently. Importantly, the decision to validate on AMD hardware while acknowledging NVIDIA compatibility via existing libraries reveals a mature understanding of deployment realities. This is not merely an optimization—it is a blueprint for sustainable, scalable AI infrastructure. The implications extend beyond LLM routing: any auxiliary service requiring low-latency, low-memory processing on GPU-constrained systems can benefit from this modular architecture.

Recommendations

  • Adopt the Flash Attention operator in vLLM-based routing pipelines where GPU memory is constrained.
  • Integrate prompt compression pipelines into preprocessing stages of LLM inference workflows as a standard best practice for efficiency.
  • Evaluate near-streaming architectures for auxiliary services in hybrid GPU/CPU inference stacks to identify latency/memory bottlenecks.

Sources