Presentation

· Contributors · Organizations · Search Program · Flagged · Happening Now

CIM for LLM: A Compute-In-Memory Architecture for Efficient Large Language Model Inference

SessionWednesday Work-in-Progress Posters

DescriptionIn the field of large language model (LLM) inference, the high computational demand and extensive memory requirements for weights and key-value (KV) cache storage present significant challenges. This issue becomes especially problematic when relying exclusively on GPUs, as they often lack the capacity to accommodate the entire KV cache, particularly in larger LLMs. In the absence of direct communications like NVlink among multiple GPUs, LLMs typically require offloading the KV cache to the CPU for storage and computation, followed by transferring the multi-head attention results back to the GPU for subsequent transformer computations. Given that attention score computation is computationally demanding on the CPU and requires substantial data movement between KV caches and memory, the direct computation of attention scores and even the feed-forward layers on Compute-in-Memory (CIM) systems emerges as a viable alternative. This paper is at the forefront of integrating CIM technology in LLM inference, and proposes an innovative architecture that leverages this emerging technology to enhance inference efficiency. Specifically, we present a tailored CIM-based dataflow and hierarchy design for optimize the computation of attention scores and feed-forward layers using CIMs. The results show improvements in performance, with 0.021× inference latency and 1.23 · 10^{−4}x energy as compared to a CPU-based implementation.

Authors

Jung-Fang Ke

National Central University

En-Ming Huang

National Tsing Hua University

Zhi-Wei Liu

National Central University

Yu-Guang Chen

National Central University

Chun-Yi Lee

National Tsing Hua University

Event Type