Presentation

· Contributors · Organizations · Search Program · Flagged · Happening Now

Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation

SessionEfficient Acceleration Strategies for Transformers: From Token Similarity to Weight Sparsity

DescriptionThe attention mechanism in text generation is memory-bounded due to its sequential characteristics. Therefore, off-chip memory accesses should be minimized for faster execution. Although previous methods addressed this by pruning unimportant tokens, they fall short in selectively removing tokens with near-zero attention probabilities in each instance. Our method estimates the probability before the softmax function, effectively removing low probability tokens and achieving an 12.1x pruning ratio without fine-tuning. Additionally, we present a hardware design supporting seamless on-demand off-chip access. Our approach shows 2.6x reduced memory accesses, leading to an average 2.3x speedup and a 2.4x energy efficiency.

Authors

Junyoung Park

Korea Advanced Institute of Science and Technology (KAIST)

Myeonggu Kang

Korea Advanced Institute of Science and Technology (KAIST)

Yunki Han

Korea Advanced Institute of Science and Technology (KAIST)

Yang-Gon Kim

Korea Advanced Institute of Science and Technology (KAIST)

Jaekang Shin

Korea Advanced Institute of Science and Technology (KAIST)

Lee-Sup Kim