Presentation

· Contributors · Organizations · Search Program · Flagged · Happening Now

SASDynabLE: A Compact Transformer Inference Architecture with Saturation-Approximate Softmax Enabling Dynamic-Mapping Based Layer-Fusion Execution

SessionTuesday Work-in-Progress Posters

DescriptionTransformer neural networks demonstrate high performance on various machine learning tasks including natural language processing (NLP) and computer vision (CV). Compared to Convolutional Neural Networks (CNNs), Transformers rely more heavily on non-linear layers like softmax, which leads to greater latency and energy usage because of limited data reuse and the obvious pipeline bottleneck. Previous research on approximating softmax has not addressed its high memory access cost and has overlooked, from a high-level perspective, how the presence of softmax impacts the pipeline of attention dataflow, which is more significant than softmax's own computation. We present POEM, a hardware/software co-design approach for softmax computation and subsequent layer fusion. POEM hides delay of denominator accumulation of softmax by postponing normalization stage and avoiding maximum value search. It furthermore ensures computing pipeline of high parallelism free from congestion by memory-intensive operations. Expensive exponential function is replaced by linear approximation for large value, which both saves LUT and incurs high energy efficiency. We show that attention dataflow with POEM achieves up to 1.84x speedup compared to prior state-of-the-art ASIC design with minimal extra energy overhead, while maintaining high model accuracy.

Authors