Presentation

· Contributors · Organizations · Search Program · Flagged · Happening Now

DySpMM: From Fix to Dynamic for Sparse Matrix-Matrix Multiplication Accelerators

SessionIt's Not 8b Retro-Gaming, It's State-Of-The-Art Architectures Using Quantization, Sparsity, and Compression!

DescriptionSparse Matrix-Matrix Multiplication (SpMM) is one of the key operators in many fields, showing dynamic features in terms of sparsity, element distribution, and data dependency. Previous studies have proposed FPGA-based SpMM accelera- tors with fixed configurations, leaving three major challenges unsolved: 1) Partitioning matrices with the fixed sub-matrix size leads to performance loss, because the optimal feasible sub- matrix size to minimize memory access varies with dynamic sparsity. 2) The fixed row-base allocation scheme of streaming architecture leads to unbalanced workloads because of dynamic element distribution across sparse matrix rows. 3) Data conflict makes the elements in one row cannot be processed consecutively. Architectures with fixed execution order rely on time-consuming pre-processing to deal with dynamic data dependency.

Motivated by the observation that fixed configurations leads to performance loss, we propose DySpMM by introducing the dynamic design methodology to SpMM architectures. The config- urable data distribution data path is designed to enable dynamic sub-matrix size, achieving up to 3.43× speed-up. The element-wise allocation unit is introduced into hardware for dynamic workload balancing, improving utilization up to 3.74×. The interleaved reorder unit is proposed to automatically reorder the sparse elements and dynamically avoid conflicts, completely avoiding the pre-processing overhead. The evaluation of DySpMM on FPGA shows that DySpMM achieves 1.42× geomean throughput of the state-of-the-art accelerator Sextans and 1.78× energy efficiency compared with V100S GPU.

Event Type

Research Manuscript

TimeThursday, June 2711:15am - 11:30am PDT

Location3003, 3rd Floor

Topics

Keywords

Next PresentationNext Presentation

Partially-Structured Transformer Pruning with Patch-Limited XOR-Gate Compression for Stall-Free Sparse-Model Access

DAC 2024