Close

Presentation

CIM for LLM: A Compute-In-Memory Architecture for Efficient Large Language Model Inference
DescriptionIn the field of large language model (LLM) inference, the high computational demand and extensive memory requirements for weights and key-value (KV) cache storage present significant challenges. This issue becomes especially problematic when relying exclusively on GPUs, as they often lack the capacity to accommodate the entire KV cache, particularly in larger LLMs. In the absence of direct communications like NVlink among multiple GPUs, LLMs typically require offloading the KV cache to the CPU for storage and computation, followed by transferring the multi-head attention results back to the GPU for subsequent transformer computations. Given that attention score computation is computationally demanding on the CPU and requires substantial data movement between KV caches and memory, the direct computation of attention scores and even the feed-forward layers on Compute-in-Memory (CIM) systems emerges as a viable alternative. This paper is at the forefront of integrating CIM technology in LLM inference, and proposes an innovative architecture that leverages this emerging technology to enhance inference efficiency. Specifically, we present a tailored CIM-based dataflow and hierarchy design for optimize the computation of attention scores and feed-forward layers using CIMs. The results show improvements in performance, with 0.021× inference latency and 1.23 · 10^{−4}x energy as compared to a CPU-based implementation.
Event Type
Work-in-Progress Poster
TimeWednesday, June 265:00pm - 6:00pm PDT
LocationLevel 2 Lobby
Topics
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security