Close

Presentation

Oltron: Algorithm-Hardware Co-design for Outlier-Aware Quantization of LLMs with Inter-/Intra-Layer Adaptation
DescriptionThe recent breakthroughs in the field of large language models (LLMs) owe much of their accomplishments to the exponential growth in model size (240×every two years), creating a significant challenge in computation and memory complexity for today's hardware. Quantization has emerged as a critical technique for reducing these complexities. However, existing approaches mainly employ a fixed quantization schemes, which is in-efficient in terms of requiring more bits to maintain model accuracy. In this work, we delve into the dynamics and heterogeneity present in both inter- and intra-layer distributions, particularly focusing on the highly dynamic range and compositions of the extremely large values, commonly referred to as outliers.
We propose Oltron, an algorithm/hardware co-design solution for outlier-aware quantization of LLMs with inter-/intra-layer adaptation. Oltron employs a holistic quantization framework with three key innovations. First, we propose a novel quantization algorithm capable of determining the optimal composition ratio of outliers among different layers and various channel groups within a layer. Second, we propose a reconfigurable architecture that can adjust computation fabric based on inter- and intra-layer distributions. Third, we propose a tile-based dataflow optimizer to meticulously plan the complicated computation and memory access schedule for the mix-precision tensors. Oltron is demonstrated to surpass existing outlier-aware accelerator, OliVe, by 1.9× performance improvement and 1.6× energy efficiency improvement, with a superior model accuracy.
Event Type
Research Manuscript
TimeThursday, June 2710:30am - 10:45am PDT
Location3003, 3rd Floor
Topics
AI
Design
Keywords
AI/ML Architecture Design