BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240626T180035Z
LOCATION:3003\, 3rd Floor
DTSTART;TZID=America/Los_Angeles:20240626T110000
DTEND;TZID=America/Los_Angeles:20240626T111500
UID:dac_DAC 2024_sess118_RESEARCH1260@linklings.com
SUMMARY:MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models
DESCRIPTION:Research Manuscript\n\nTaehyun Kim, Kwanseok Choi, Youngmock C
 ho, Jaehoon Cho, Hyuk-Jae Lee, and Jaewoong Sim (Seoul National University
 )\n\nMixture-of-Experts (MoE) large language models (LLM) have memory requ
 irements that often exceed the GPU memory capacity, requiring costly param
 eter movement from secondary memories to the GPU for expert computation. I
 n this work, we present Mixture of Near-Data Experts (MoNDE), a near-data 
 computing solution that efficiently enables MoE LLM inference. MoNDE reduc
 es the volume of MoE parameter movement by transferring only the hot exper
 ts to the GPU, while computing the remaining cold experts inside the host 
 memory device. By replacing the transfers of massive expert parameters wit
 h the ones of small activations, MoNDE enables far more communication-effi
 cient MoE inference, thereby resulting in substantial speedups over the ex
 isting parameter offloading frameworks for both encoder and decoder operat
 ions.\n\nTopic: Design\n\nKeyword: In-memory and Near-memory Computing Arc
 hitectures, Applications and Systems\n\nSession Chairs: Seokhyeong Kang (P
 ohang University of Science and Technology (POSTECH)) and Giacomo Pedretti
  (Hewlett Packard Enterprise)
END:VEVENT
END:VCALENDAR
