BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240626T180034Z
LOCATION:3003\, 3rd Floor
DTSTART;TZID=America/Los_Angeles:20240625T144500
DTEND;TZID=America/Los_Angeles:20240625T150000
UID:dac_DAC 2024_sess158_RESEARCH1279@linklings.com
SUMMARY:Token-Picker: Accelerating Attention in Text Generation with Minim
 ized Memory Transfer via Probability Estimation*
DESCRIPTION:Research Manuscript\n\nJunyoung Park, Myeonggu Kang, Yunki Han
 , Yang-Gon Kim, Jaekang Shin, and Lee-Sup Kim (Korea Advanced Institute of
  Science and Technology (KAIST))\n\nThe attention mechanism in text genera
 tion is memory-bounded due to its sequential characteristics. Therefore, o
 ff-chip memory accesses should be minimized for faster execution. Although
  previous methods addressed this by pruning unimportant tokens, they fall 
 short in selectively removing tokens with near-zero attention probabilitie
 s in each instance. Our method estimates the probability before the softma
 x function, effectively removing low probability tokens and achieving an 1
 2.1x pruning ratio without fine-tuning. Additionally, we present a hardwar
 e design supporting seamless on-demand off-chip access. Our approach shows
  2.6x reduced memory accesses, leading to an average 2.3x speedup and a 2.
 4x energy efficiency.\n\nTopic: AI, Design\n\nKeyword: AI/ML Architecture 
 Design\n\nSession Chair: Hyoukjun Kwon (University of California, Irvine)
END:VEVENT
END:VCALENDAR
