Close

Session

Research Manuscript: Efficient Acceleration Strategies for Transformers: From Token Similarity to Weight Sparsity
DescriptionRecent advancement in transformer models led the performance improvement in language modeling and vision tasks. Transformers are equipped with the attention mechanism that extracts useful dependency information between input tokens. Due to the nature of sequential processing, running a transformer is bounded by off-chip memory bandwidth. For vision transformers, a feedforward network that follows after the attention module further incurs significant runtime overhead. In this session, many unique approaches and their associated hardware architecture are discussed, including proactively skipping computations for tokens with low probability, leveraging token similarities, bit-slice compression technique, and exploiting sparsity in transformers.
Event TypeResearch Manuscript
TimeTuesday, June 251:30pm - 3:00pm PDT
Location3003, 3rd Floor
Topics
AI
Design
Keywords
AI/ML Architecture Design