Close

Presentation

QUQ: Quadruplet Uniform Quantization for Efficient Vision Transformer Inference
DescriptionWhile exhibiting superior performance in many tasks, vision transformers (ViTs) face challenges in quantization. Some existing low-bit-width quantization techniques cannot effectively cover the whole inference process of ViTs, leading to an additional memory overhead (22.3%-172.6%) compared with the corresponding fully quantized models. To address this issue, we propose quadruplet uniform quantization (QUQ) to deal with data of various distributions in ViT. QUQ divides the entire data range into at most four subranges that are uniformly quantized with different scale factors, respectively. To determine the partition scheme and quantization parameters, an efficient relaxation algorithm is proposed accordingly. Moreover, dedicated encoding and decoding strategies are devised to facilitate the design of an efficient accelerator. Experimental results show that QUQ surpasses state-of-the-art quantization techniques; it is the first viable scheme that can fully quantize ViTs to 6-bit with acceptable accuracy. Compared with the conventional uniform quantization, QUQ results in not only a higher accuracy but also an accelerator with lower area and power.
Event Type
Research Manuscript
TimeTuesday, June 2511:00am - 11:15am PDT
Location3002, 3rd Floor
Topics
Design
Keywords
AI/ML System and Platform Design