Close

Presentation

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models
DescriptionLarge Language Models have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization. Experiments show APTQ attains state-of-the-art zero-shot accuracy of 68.24% and 70.48% at 3.8 bitwidth in LLaMa-7B and LLaMa-13B, respectively.
Event Type
Research Manuscript
TimeTuesday, June 252:24pm - 2:42pm PDT
Location3001, 3rd Floor
Topics
AI
Keywords
AI/ML Algorithms