Session
Efficient Acceleration Strategies for Transformers: From Token Similarity to Weight Sparsity
Session Chair
DescriptionRecent advancement in transformer models led the performance improvement in language modeling and vision tasks. Transformers are equipped with the attention mechanism that extracts useful dependency information between input tokens. Due to the nature of sequential processing, running a transformer is bounded by off-chip memory bandwidth. For vision transformers, a feedforward network that follows after the attention module further incurs significant runtime overhead. In this session, many unique approaches and their associated hardware architecture are discussed, including proactively skipping computations for tokens with low probability, leveraging token similarities, bit-slice compression technique, and exploiting sparsity in transformers.
Event TypeResearch Manuscript
TimeTuesday, June 251:30pm - 3:00pm PDT
Location3003, 3rd Floor
AI
Design
AI/ML Architecture Design
Presentations