Presentation

· Contributors · Organizations · Search Program · Flagged · Happening Now

FOTA-Quant: FPGA-Oriented Token Adaptive Quantization Framework for the Acceleration of LLMs

SessionWednesday Work-in-Progress Posters

DescriptionThe Large Language Models (LLMs) have been popular and widely used in creative ways because of their powerful capabilities. However, the substantial model size and complexity prevent LLMs from being implemented on resource-constrained computing devices efficiently, making it challenging to sustain exciting task performance. The Field-Programmable Gate Arrays (FPGAs), which are suitable for low-latency processing tasks but contain finite resources regarding logic elements, memory size, and bandwidth, have become an intriguing choice for implementing LLMs. In this paper, we propose the FOTA-Quant, the FPGA-Oriented Token Adaptive Quantization framework to achieve LLM acceleration on resource-constrained FPGAs. On the algorithm level, to fit the memory of a single FPGA, we minimize the model size by quantizing the model weights into INT4. Then to further reduce the model complexity and maintain the task performance, we utilize a mixed precision scheme with error-regulized pruning for the activations. On the hardware level, we propose general-precision matrix multiplication to support 8x8, 4x8, 4x4, and optimize the resource utilization for chipset (multi-die) FPGAs to improve the overall quantized LLMs performances. Experiments show that FOTA-Quant achieves simultaneously quantized model weights and activations while maintaining task performance comparable to the existing weight-only quantization methods. Moreover, FOTA-Quant achieves an on-FPGA speedup of up to 5.21x compared to its FP16 counterparts, marking a pioneering advancement in this domain.

Authors

Xuan Shen

Northeastern University

Zhaoyang Han

Northeastern University

Peiyan Dong

Northeastern University

Yanyue Xie

Northeastern University

Zhenglun Kong

Northeastern University

Zhengang Li