Presentation

· Contributors · Organizations · Search Program · Flagged · Happening Now

QUQ: Quadruplet Uniform Quantization for Efficient Vision Transformer Inference

SessionTransforming Transformers: Accelerating Transformer Models for ViT and LLMs

DescriptionWhile exhibiting superior performance in many tasks, vision transformers (ViTs) face challenges in quantization. Some existing low-bit-width quantization techniques cannot effectively cover the whole inference process of ViTs, leading to an additional memory overhead (22.3%-172.6%) compared with the corresponding fully quantized models. To address this issue, we propose quadruplet uniform quantization (QUQ) to deal with data of various distributions in ViT. QUQ divides the entire data range into at most four subranges that are uniformly quantized with different scale factors, respectively. To determine the partition scheme and quantization parameters, an efficient relaxation algorithm is proposed accordingly. Moreover, dedicated encoding and decoding strategies are devised to facilitate the design of an efficient accelerator. Experimental results show that QUQ surpasses state-of-the-art quantization techniques; it is the first viable scheme that can fully quantize ViTs to 6-bit with acceptable accuracy. Compared with the conventional uniform quantization, QUQ results in not only a higher accuracy but also an accelerator with lower area and power.

Authors

Xinkuang Geng

Shanghai Jiao Tong University

Siting Liu

Shanghai Tech University

leibo liu

Tsinghua University

Jie Han

University of Alberta

Honglan Jiang

Shanghai Jiao Tong University

Event Type