Presentation

· Contributors · Organizations · Search Program · Flagged · Happening Now

FNM-Trans: Efficient FPGA-based Transformer Architecture with Full N:M Sparsity

SessionEfficient Acceleration Strategies for Transformers: From Token Similarity to Weight Sparsity

DescriptionTransformer models have become popular in various AI applications due to their exceptional performance. However, their impressive performance comes with significant computing and memory costs, hindering efficient deployment of Transformer-based applications. Many solutions focus on leveraging sparsity in weight matrix and attention computation. However, previous studies fail to exploit unified sparse pattern to accelerate all three modules of Transformer (QKV generation, attention computation, FFN). In this paper, we propose FNM-Trans, an adaptable and efficient algorithm-hardware co-design aimed at optimizing all three modules of the Transformer by fully harnessing 𝑁 : 𝑀 sparsity. At the algorithm level, we fully
explore the interplay of dynamic pruning with static pruning under high 𝑁 : 𝑀 sparsity. At the hardware level, we develop a dedicated hardware architecture featuring a custom computing engine and a softmax module, tailored to support varying levels of 𝑁 :𝑀 sparsity. Experiment results show that, our algorithm optimizes accuracy by 11.03% under 2:16 attention sparsity and 4:16 weight sparsity, compared to other methods. Additionally, FNM-Trans achieves speedups of 27.13× and 21.24× over Intel i9-9900X and NVIDIA RTX 2080 Ti, respectively, and outpaces current FPGA-based Transformers by 1.88× to 36.51×.

Authors