Academic

FlashSampling: Fast and Memory-Efficient Exact Sampling

arXiv:2603.15854v1 Announce Type: new Abstract: Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. The fused tiled kernel is exact because $\argmax$ decomposes over a partition; grouped variants for online and tensor-parallel settings are exact by hierarchical factorization of the categorical distribution. Across H100, H200, B200, and B300 GPUs, FlashSampling speeds up kernel-level decode workloads, and in end-to-end vLLM experiments, it reduces time per output token by up to $19%$ on the models we test. These results show that exact sampli

arXiv:2603.15854v1 Announce Type: new Abstract: Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. The fused tiled kernel is exact because $\argmax$ decomposes over a partition; grouped variants for online and tensor-parallel settings are exact by hierarchical factorization of the categorical distribution. Across H100, H200, B200, and B300 GPUs, FlashSampling speeds up kernel-level decode workloads, and in end-to-end vLLM experiments, it reduces time per output token by up to $19%$ on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, turning a bandwidth-bound postprocessing step into a lightweight epilogue. Project Page: https://github.com/FlashSampling/FlashSampling.

Executive Summary

FlashSampling introduces an exact sampling primitive for categorical distributions in large-vocabulary decoding, eliminating the need to materialize logits in high-bandwidth memory (HBM). By fusing sampling into the language model (LM) head matrix multiplication (matmul) via a tiled computation approach—computing logits tile-by-tile on-chip, adding Gumbel noise, and identifying maximizers per tile—it reduces memory traffic and kernel overhead. The method achieves exact sampling through hierarchical factorization and decomposition over partitioned tiles, supporting grouped variants for online and tensor-parallel settings. Empirical evaluation on modern GPUs (H100, H200, B200, B300) demonstrates significant performance gains, including up to 19% reduction in time per output token in vLLM benchmarks. This innovation transforms a traditionally bandwidth-bound postprocessing step into an efficient epilogue, offering substantial speedups without approximation.

Key Points

  • Exact sampling without approximation is achieved by leveraging the decomposability of the argmax operation over partitioned logits tiles, ensuring mathematical correctness.
  • Memory efficiency is significantly enhanced by avoiding the materialization of the full logits tensor in HBM, reducing memory bandwidth bottlenecks during decoding.
  • Performance improvements are demonstrated across multiple GPU architectures, with up to 19% reduction in time per output token in end-to-end vLLM experiments, highlighting practical scalability.

Merits

Mathematical Rigor and Exactness

FlashSampling preserves exact sampling by exploiting the decomposability of the argmax operation over partitioned tiles and hierarchical factorization, ensuring no approximation errors in contrast to many prior approximate sampling methods.

Memory and Compute Efficiency

By fusing sampling into the LM-head matmul and avoiding logits materialization, the method reduces HBM traffic and kernel overhead, addressing a critical bottleneck in large-vocabulary decoding.

Hardware Agnostic Scalability

The approach demonstrates consistent performance gains across diverse GPU architectures (H100 to B300), indicating adaptability to evolving hardware paradigms such as NVIDIA's Blackwell and Hopper architectures.

Integration with Existing Frameworks

Compatibility with vLLM and support for grouped variants (online and tensor-parallel) suggest seamless integration potential with modern inference engines and distributed training frameworks.

Demerits

Complexity in Implementation

The tiled computation and hierarchical factorization introduce significant complexity in kernel design and optimization, potentially limiting adoption to teams with advanced GPU programming expertise.

Dependence on Hardware-Specific Optimizations

While results span multiple GPUs, the method may rely on specific hardware features (e.g., on-chip memory capacity, tensor cores) that could limit portability to non-NVIDIA architectures or future hardware with reduced on-chip memory.

Limited Evaluation Scope

The reported improvements are based on a subset of models and benchmarks; broader validation across diverse model families, languages, and use cases (e.g., real-time systems) is necessary to confirm generalizability.

Expert Commentary

FlashSampling represents a paradigm shift in how we approach exact sampling in large-vocabulary decoding. The authors elegantly exploit mathematical properties of the argmax operation to decompose a seemingly intractable problem—exact sampling without materializing logits—into a series of tractable, tile-wise computations. This is not merely an engineering optimization but a rethinking of the sampling pipeline's role within the decoder architecture. The fusion of sampling into the matmul kernel is particularly noteworthy, as it addresses a critical bottleneck in modern LLMs where memory bandwidth often throttles performance. While the method's reliance on hierarchical factorization and tiled computation may pose implementation challenges, the demonstrated speedups across multiple GPU generations underscore its robustness and future-proofing potential. Moreover, the alignment with hardware trends—such as increasing on-chip memory capacity—suggests that FlashSampling could become a cornerstone technique in next-generation inference engines. For practitioners, the key takeaway is that exactness and efficiency are not mutually exclusive; with careful algorithmic design, we can achieve both.

Recommendations

  • Researchers should explore extending FlashSampling to other probabilistic operations (e.g., top-p sampling, beam search) to assess its broader applicability beyond exact argmax sampling.
  • Engineering teams should invest in prototyping and benchmarking FlashSampling within their inference stacks, particularly for high-throughput or real-time applications where decoding latency is critical.
  • Hardware architects should evaluate the feasibility of dedicated accelerators or ISA extensions to natively support tiled, fused operations like those in FlashSampling, potentially unlocking further performance gains.
  • The academic community should develop standardized benchmarks that isolate and measure the impact of sampling techniques on end-to-end LLM performance, enabling fairer comparisons across methods.

Sources