Close

Presentation

Distributed Inference of DL Workloads on CIM-based Heterogeneous Accelerators
DescriptionThe remarkable advancements in Neural Networks' precision have ignited a revolution in their architecture, demanding ever-expanding memory and computational resources. As we confront the limitations posed by current hardware, such as memory and processing capabilities, one innovative solution emerges: the distribution of neural network model inference across multiple devices. While most prior efforts have focused on optimizing single-device inference or partitioning models to enhance inference throughput. This work proposes a framework that searches for optimal model splits and distributes the partitions across the combination of a given set of devices taking into consideration the throughput and energy. Participating devices are strategically grouped into homogeneous and heterogeneous clusters consisting of general-purpose CPU and GPU architectures, as well as emerging Compute-In-Memory (CIM) accelerators. The framework simultaneously optimizes inference throughput and energy consumption with a weighting control parameter. Compared to the performance of a single GPU, it helps to achieve up to 4$\times$ speedup with approximately 4$\times$ per-device energy reduction in a heterogeneous setup. The algorithm also finds a smooth Pareto-like curve on the throughput-energy space for CIM devices.
Event Type
Work-in-Progress Poster
TimeWednesday, June 265:00pm - 6:00pm PDT
LocationLevel 2 Lobby
Topics
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security