Presentation

· Contributors · Organizations · Search Program · Flagged · Happening Now

Oltron: Algorithm-Hardware Co-design for Outlier-Aware Quantization of LLMs with Inter-/Intra-Layer Adaptation

SessionIt's Not 8b Retro-Gaming, It's State-Of-The-Art Architectures Using Quantization, Sparsity, and Compression!

DescriptionThe recent breakthroughs in the field of large language models (LLMs) owe much of their accomplishments to the exponential growth in model size (240×every two years), creating a significant challenge in computation and memory complexity for today's hardware. Quantization has emerged as a critical technique for reducing these complexities. However, existing approaches mainly employ a fixed quantization schemes, which is in-efficient in terms of requiring more bits to maintain model accuracy. In this work, we delve into the dynamics and heterogeneity present in both inter- and intra-layer distributions, particularly focusing on the highly dynamic range and compositions of the extremely large values, commonly referred to as outliers.
We propose Oltron, an algorithm/hardware co-design solution for outlier-aware quantization of LLMs with inter-/intra-layer adaptation. Oltron employs a holistic quantization framework with three key innovations. First, we propose a novel quantization algorithm capable of determining the optimal composition ratio of outliers among different layers and various channel groups within a layer. Second, we propose a reconfigurable architecture that can adjust computation fabric based on inter- and intra-layer distributions. Third, we propose a tile-based dataflow optimizer to meticulously plan the complicated computation and memory access schedule for the mix-precision tensors. Oltron is demonstrated to surpass existing outlier-aware accelerator, OliVe, by 1.9× performance improvement and 1.6× energy efficiency improvement, with a superior model accuracy.

Authors

Chenhao Xue

Peking University

Chen Zhang

Shanghai Jiao Tong University

Xun Jiang

Peking University

Gao ZhuTianya

Shanghai Jiao Tong University

Yibo Lin

Peking University

Guangyu Sun