Close

Presentation

FOTA-Quant: FPGA-Oriented Token Adaptive Quantization Framework for the Acceleration of LLMs
DescriptionThe Large Language Models (LLMs) have been popular and widely used in creative ways because of their powerful capabilities. However, the substantial model size and complexity prevent LLMs from being implemented on resource-constrained computing devices efficiently, making it challenging to sustain exciting task performance. The Field-Programmable Gate Arrays (FPGAs), which are suitable for low-latency processing tasks but contain finite resources regarding logic elements, memory size, and bandwidth, have become an intriguing choice for implementing LLMs. In this paper, we propose the FOTA-Quant, the FPGA-Oriented Token Adaptive Quantization framework to achieve LLM acceleration on resource-constrained FPGAs. On the algorithm level, to fit the memory of a single FPGA, we minimize the model size by quantizing the model weights into INT4. Then to further reduce the model complexity and maintain the task performance, we utilize a mixed precision scheme with error-regulized pruning for the activations. On the hardware level, we propose general-precision matrix multiplication to support 8x8, 4x8, 4x4, and optimize the resource utilization for chipset (multi-die) FPGAs to improve the overall quantized LLMs performances. Experiments show that FOTA-Quant achieves simultaneously quantized model weights and activations while maintaining task performance comparable to the existing weight-only quantization methods. Moreover, FOTA-Quant achieves an on-FPGA speedup of up to 5.21x compared to its FP16 counterparts, marking a pioneering advancement in this domain.
Event Type
Work-in-Progress Poster
TimeWednesday, June 265:00pm - 6:00pm PDT
LocationLevel 2 Lobby
Topics
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security