Close

Presentation

SASDynabLE: A Compact Transformer Inference Architecture with Saturation-Approximate Softmax Enabling Dynamic-Mapping Based Layer-Fusion Execution
DescriptionTransformer neural networks demonstrate high performance on various machine learning tasks including natural language processing (NLP) and computer vision (CV); however, they rely more heavily on non-linear layers like softmax, which leads to greater latency and energy usage because of limited data reuse and the obvious pipeline bottleneck. Previous research on approximating softmax spends less attention on the high memory access cost and has overlooked, from a high-level perspective, how the presence of softmax impacts the pipeline of attention dataflow, which is more significant than softmax's computation. We present SASDynabLE, a hardware/software co-design approach for softmax computation and subsequent layer fusion. This paper proposes SASDynabLE, a compact transformer inference architecture which hides the delay of denominator accumulation of softmax by postponing the normalization stage and avoiding maximum value search. It furthermore ensures a computing pipeline of high parallelism that is free from congestion by memory intensive operations. The expensive exponential function is replaced by linear approximation for large values to save energy and area costs of softmax evaluation. We show that attention dataflow with SASDynabLE achieves up to 1.84× speedup compared to the prior state-of-the-art ASIC design with improved energy efficiency at large array sizes while maintaining high model accuracy.
Event Type
Work-in-Progress Poster
TimeTuesday, June 256:00pm - 7:00pm PDT
LocationLevel 2 Lobby
Topics
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security