BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240626T180034Z
LOCATION:Level 2 Lobby
DTSTART;TZID=America/Los_Angeles:20240626T180000
DTEND;TZID=America/Los_Angeles:20240626T190000
UID:dac_DAC 2024_sess237_RESEARCH832@linklings.com
SUMMARY:DTrans: A  Dataflow-transformation FPGA Accelerator with Nonlinear
 -operators fusion aiming for the Generative Model
DESCRIPTION:Work-in-Progress Poster\n\nXuanzheng Wang, Peng Qu, and Youhui
  Zhang (Tsinghua University)\n\nFPGA-based accelerators have emerged as an
  effective solution for GPT inference, given their inherent flexibility an
 d capacity for domain-specific customization. Despite their potential, two
  primary challenges have impeded their efficient use: the disparate comput
 e-to-memory access ratios in GPT's encoding and generation stages, and the
  rapid increase in hardware resource demands for nonlinear operations due 
 to longer text lengths and larger embedding dimensions. \n\nTo overcome th
 ese obstacles, we introduce DTrans, an FPGA accelerator tailored for GPT b
 ased on dataflow transformation and features nonlinear-operators fusion. D
 Trans features a two-pronged approach: a two-stage dataflow transformation
  to align with the unique computational and access needs of GPT's differen
 t stages, and a sequence-length decoupling method for nonlinear operators.
  This approach allows for the overlapping of computational delays in opera
 tions like Softmax and layer normalization with matrix operations in tasks
  involving long sentences. Furthermore, DTrans uses a two-level alternatin
 g input pipeline, which efficiently manages GPT's computing flow, inclusiv
 e of residual connections and variable inter-layer delays.\n\nOur comparat
 ive analyses reveal that DTrans outperforms the GPU(V100) in terms of thro
 ughput and energy efficiency, achieving improvements of 11.99x and 11.7x, 
 respectively. When compared with state-of-the-art GPT inference accelerato
 rs, DTrans demonstrates more than 5.64x and 5.22x enhancements in these me
 trics.\n\nTopic: AI, Autonomous Systems, Cloud, Design, EDA, Embedded Syst
 ems, IP, Security
END:VEVENT
END:VCALENDAR
