Close

Presentation

DTrans: A Dataflow-transformation FPGA Accelerator with Nonlinear-operators fusion aiming for the Generative Model
DescriptionFPGA-based accelerators have emerged as an effective solution for GPT inference, given their inherent flexibility and capacity for domain-specific customization. Despite their potential, two primary challenges have impeded their efficient use: the disparate compute-to-memory access ratios in GPT's encoding and generation stages, and the rapid increase in hardware resource demands for nonlinear operations due to longer text lengths and larger embedding dimensions.

To overcome these obstacles, we introduce DTrans, an FPGA accelerator tailored for GPT based on dataflow transformation and features nonlinear-operators fusion. DTrans features a two-pronged approach: a two-stage dataflow transformation to align with the unique computational and access needs of GPT's different stages, and a sequence-length decoupling method for nonlinear operators. This approach allows for the overlapping of computational delays in operations like Softmax and layer normalization with matrix operations in tasks involving long sentences. Furthermore, DTrans uses a two-level alternating input pipeline, which efficiently manages GPT's computing flow, inclusive of residual connections and variable inter-layer delays.

Our comparative analyses reveal that DTrans outperforms the GPU(V100) in terms of throughput and energy efficiency, achieving improvements of 11.99x and 11.7x, respectively. When compared with state-of-the-art GPT inference accelerators, DTrans demonstrates more than 5.64x and 5.22x enhancements in these metrics.
Event Type
Work-in-Progress Poster
TimeWednesday, June 265:00pm - 6:00pm PDT
LocationLevel 2 Lobby
Topics
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security