BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240626T180034Z
LOCATION:3003\, 3rd Floor
DTSTART;TZID=America/Los_Angeles:20240625T134500
DTEND;TZID=America/Los_Angeles:20240625T140000
UID:dac_DAC 2024_sess158_RESEARCH1395@linklings.com
SUMMARY:FLAME: Fully Leveraging MoE Sparsity for Transformer on FPGA
DESCRIPTION:Research Manuscript\n\nXuanda Lin, Huinan Tian, Wenxiao Xue, L
 anqi Ma, Jialin Cao, Manting Zhang, Jun Yu, and Kun Wang (Fudan University
 )\n\nMoE (Mixture-of-Experts) mechanism has been widely adopted in transfo
 rmer-based models to facilitate further expansion of model parameter size 
 and enhance generalization capabilities. However, the practical deployment
  of MoE mechanism for transformer on resource-constrained platforms, such 
 as FPGA, remains challenging due to heavy memory footprints and impractica
 l runtime costs introduced by the MoE mechanism. Diving into the MoE mecha
 nism, we raise two key observations: (1) Expert weights are heavy but cold
 , making it ideal to leverage expert weight sparsity. (2) There exists hig
 hly skewed expert activation paths for MoE layers in transformer-based mod
 els, making it feasible to conduct expert prediction and prefetching. Moti
 vated by these two observations, we propose FLAME, the first algorithm-har
 dware co-optimized MoE accelerating framework designed to fully leverage M
 oE sparsity for efficient transformer deployment on FPGA. First, to levera
 ge expert weight sparsity, we integrate an N:M pruning algorithm, allowing
  for the pruning of expert weights without significantly compromising mode
 l accuracy. Second, to settle expert activation sparsity, we propose a cir
 cular expert prediction (CEPR) strategy. CEPR prefetches expert weights fr
 om external storage to on-chip cache before the activated expert index is 
 determined. Last, we co-optimize both MoE sparsity through the introductio
 n of an efficient pruning-aware expert buffering (PA-BUF) mechanism. Exper
 imental results demonstrate that FLAME achieves 84.4% accuracy of expert p
 rediction with merely two expert caches on-chip. In comparison with CPU an
 d GPU, FLAME achieves 4.12× and 1.49× speedup, respectively.\n\nTopic: AI,
  Design\n\nKeyword: AI/ML Architecture Design\n\nSession Chair: Hyoukjun K
 won (University of California, Irvine)
END:VEVENT
END:VCALENDAR
