BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
arXiv:2604.03957v1 Announce Type: new Abstract: Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and
arXiv:2604.03957v1 Announce Type: new Abstract: Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.
Executive Summary
This article proposes a novel algorithm-hardware co-design framework, Binary Weights & Ternary Activations (BWTA), which aims to bridge the accuracy-efficiency gap in ultra-low-bit quantization for Transformer-based models. BWTA introduces a unique quantization scheme that preserves the accuracy of extremely low-bit models by projecting tiny values to zero. The framework combines a Smooth Multi-Stage Quantization method for training and an optimized MatMul CUDA kernel for inference. Experiments demonstrate that BWTA achieves comparable performance to full-precision models while delivering significant efficiency gains. This work has the potential to enable practical, low-latency ultra-low-bit inference on various Transformer architectures without sacrificing model quality.
Key Points
- ▸ BWTA proposes a novel quantization scheme that preserves accuracy by projecting tiny values to zero.
- ▸ The framework combines Smooth Multi-Stage Quantization for training and optimized MatMul CUDA kernel for inference.
- ▸ Experiments demonstrate comparable performance to full-precision models with significant efficiency gains.
Merits
Improved Accuracy
BWTA's unique quantization scheme preserves accuracy by projecting tiny values to zero, enabling efficient ultra-low-bit inference without sacrificing model quality.
Efficiency Gains
The optimized MatMul CUDA kernel and Smooth Multi-Stage Quantization method deliver significant kernel-level and end-to-end prefill speedups, respectively.
Demerits
Limited GPU Support
The framework's reliance on optimized CUDA kernels may limit its compatibility with non-NVIDIA GPUs.
Additional Training Complexity
The Smooth Multi-Stage Quantization method may introduce additional complexity during training, requiring careful tuning of hyperparameters.
Expert Commentary
BWTA's novel approach to ultra-low-bit quantization and optimized inference framework demonstrate a significant step forward in addressing the accuracy-efficiency tradeoff in Transformer-based models. However, the framework's reliance on optimized CUDA kernels and potential additional training complexity require careful consideration. As the field continues to evolve, the integration of BWTA with other quantization schemes and frameworks will be crucial for achieving further efficiency gains and improved model accuracy.
Recommendations
- ✓ Future work should focus on exploring the compatibility of BWTA with non-NVIDIA GPUs and developing strategies to mitigate potential training complexity issues.
- ✓ The framework's potential implications for policy and emerging technologies warrant further investigation, particularly in the context of responsible AI development and deployment.
Sources
Original: arXiv - cs.LG