BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240626T180034Z
LOCATION:3003\, 3rd Floor
DTSTART;TZID=America/Los_Angeles:20240625T140000
DTEND;TZID=America/Los_Angeles:20240625T141500
UID:dac_DAC 2024_sess158_RESEARCH1036@linklings.com
SUMMARY:FNM-Trans: Efficient FPGA-based Transformer Architecture with Full
  N:M Sparsity
DESCRIPTION:Research Manuscript\n\nManting Zhang, Jialin Cao, Kejia Shi, K
 eqing Zhao, Genhao Zhang, Jun Yu, and Kun Wang (Fudan University)\n\nTrans
 former models have become popular in various AI applications due to their 
 exceptional performance. However, their impressive performance comes with 
 significant computing and memory costs, hindering efficient deployment of 
 Transformer-based applications. Many solutions focus on leveraging sparsit
 y in weight matrix and attention computation. However, previous studies fa
 il to exploit unified sparse pattern to accelerate all three modules of Tr
 ansformer (QKV generation, attention computation, FFN). In this paper, we 
 propose FNM-Trans, an adaptable and efficient algorithm-hardware co-design
  aimed at optimizing all three modules of the Transformer by fully harness
 ing &#119873; : &#119872; sparsity. At the algorithm level, we fully\nexplore the interpla
 y of dynamic pruning with static pruning under high &#119873; : &#119872; sparsity. At the
  hardware level, we develop a dedicated hardware architecture featuring a 
 custom computing engine and a softmax module, tailored to support varying 
 levels of &#119873; :&#119872; sparsity. Experiment results show that, our algorithm optim
 izes accuracy by 11.03% under 2:16 attention sparsity and 4:16 weight spar
 sity, compared to other methods. Additionally, FNM-Trans achieves speedups
  of 27.13× and 21.24× over Intel i9-9900X and NVIDIA RTX 2080 Ti, respecti
 vely, and outpaces current FPGA-based Transformers by 1.88× to 36.51×.\n\n
 Topic: AI, Design\n\nKeyword: AI/ML Architecture Design\n\nSession Chair: 
 Hyoukjun Kwon (University of California, Irvine)
END:VEVENT
END:VCALENDAR
