Presentation

· Contributors · Organizations · Search Program · Flagged · Happening Now

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

SessionThe Next Step to Efficient AI: Number Formats, Quantization and Beyond

DescriptionLarge Language Models have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization. Experiments show APTQ attains state-of-the-art zero-shot accuracy of 68.24% and 70.48% at 3.8 bitwidth in LLaMa-7B and LLaMa-13B, respectively.

Authors

Ziyi Guan

The University of Hong Kong

Hantao Huang

Southern University of Science and Technology

Yupeng Su

Southern University of Science and Technology

Hong Huang

Southern University of Science and Technology

Ngai Wong

The University of Hong Kong

Hao Yu