Presentation
Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization
DescriptionLarge language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. Many previous studies exploit quantization methods to reduce LLM inference cost by reducing storage and accelerating computation. The state-of-the-art methods use 2-bit quantization for mainstream LLMs (\textit{e.g.} Llama2-7b, etc.). However, challenges still exist in reducing LLM inference cost with 2-bit quantization: \textbf{(1) Nonnegligible accuracy loss for 2-bit quantization.} Weights are quantized by groups, while the ranges of weights are large in some groups, resulting in large quantization errors and nonnegligible accuracy loss (\textit{e.g.} >3\% for Llama2-7b with 2-bit quantization in GPTQ and Greenbit). \textbf{(2) Limited accuracy improvement by adding 4-bit weights.} Increasing 10\% extra average bit more 4-bit weights only leads to <0.5\% accuracy improvement on a quantized Llama2-7b model. \textbf{(3) Time-consuming dequantization operations on GPUs.} Mainstream methods require a dequantization operation to perform computation on the quantized weights, and the 2-order dequantization operation is applied because scales of groups are also quantized. These dequantization operations lead to >50\% execution time, hindering the potential of reducing LLM inference cost.
To tackle these challenges and enable fast and low-cost LLM inference on GPUs, we propose the following techniques in this paper. \textbf{(1) Range-aware quantization with memory alignment.} We point out that the range of weights by groups varies. Thus, we only quantize a small fraction of groups with the larger range using 4-bit with memory alignment consideration on GPUs. \textbf{(2) Accuracy-aware sparse outlier.} We point out that the distribution of the sparse outliers with larger weights is different in 2-bit and 4-bit groups, and only a small fraction of outliers require 16-bit quantization. Such design leads to >0.5\% accuracy improvement with <3\% average increased bit for Llama2-7b. \textbf{(3) Asynchronous dequantization.} We point out that calculating the scales of each group is independent of the loading weights of each group. Thus, we design the asynchronous dequantization on GPUs, leading to up to 3.92$\times$ speedup. We conduct extensive experiments on different model families and model sizes. We achieve 2.85-bit for each weight considering all scales/zeros for different models. The end-to-end speedup for Llama2-7b is 1.74$\times$ over the original model, and we reduce both runtime cost and hardware cost by up to 2.70$\times$ and 2.81$\times$ with less GPU requirements.
To tackle these challenges and enable fast and low-cost LLM inference on GPUs, we propose the following techniques in this paper. \textbf{(1) Range-aware quantization with memory alignment.} We point out that the range of weights by groups varies. Thus, we only quantize a small fraction of groups with the larger range using 4-bit with memory alignment consideration on GPUs. \textbf{(2) Accuracy-aware sparse outlier.} We point out that the distribution of the sparse outliers with larger weights is different in 2-bit and 4-bit groups, and only a small fraction of outliers require 16-bit quantization. Such design leads to >0.5\% accuracy improvement with <3\% average increased bit for Llama2-7b. \textbf{(3) Asynchronous dequantization.} We point out that calculating the scales of each group is independent of the loading weights of each group. Thus, we design the asynchronous dequantization on GPUs, leading to up to 3.92$\times$ speedup. We conduct extensive experiments on different model families and model sizes. We achieve 2.85-bit for each weight considering all scales/zeros for different models. The end-to-end speedup for Llama2-7b is 1.74$\times$ over the original model, and we reduce both runtime cost and hardware cost by up to 2.70$\times$ and 2.81$\times$ with less GPU requirements.
Event Type
Work-in-Progress Poster
TimeTuesday, June 256:00pm - 7:00pm PDT
LocationLevel 2 Lobby
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security