BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240626T180034Z
LOCATION:3012\, 3rd Floor
DTSTART;TZID=America/Los_Angeles:20240625T170000
DTEND;TZID=America/Los_Angeles:20240625T171500
UID:dac_DAC 2024_sess112_RESEARCH1147@linklings.com
SUMMARY:Planaria: Full Pattern Directed Heterogeneous Hardware Prefetcher 
 with Efficient Bypass
DESCRIPTION:Research Manuscript\n\nYuhang Liu and Mingyu Chen (Institute o
 f Computing Technology, Chinese Academy of Sciences)\n\nDue to the memory 
 wall, memory system performance significantly impacts the user experience 
 of mobile phones. The system cache (SC) locates on the memory side and is 
 shared by all the central processing units (CPUs) and graph processing uni
 ts (GPUs) within the mobile phone and is the last defense line before reso
 rting to the time-consuming off-chip memory access. However, it is challen
 ging to manage SC, due to the memory-side large working set and irregular 
 accessing patterns. Although SC takes up a considerable on-chip area, the 
 effectiveness of SC in terms of hit rate is rather low. It is observed tha
 t neither using the state-of-the-art cache replacement policies nor enlarg
 ing cache size can significantly benefit SC.  The prefetchers designed for
  higher-level caches cannot be used by SC, because the required program co
 unter (PC) is not available on the memory-side and/or the aggressive prefe
 tch traffic violates the stringent power constraints of mobile phones. In 
 this study, we propose Planaria, which includes two sub-prefetchers (SLP a
 nd TLP) and a coordinator (POC) to simultaneously achieve high accuracy an
 d coverage of prefetching. The two sub-prefetchers exploit the intra- and 
 inter-page regularities via self and transfer learning, respectively. The 
 coordinator POC explicitly decouples the learning and issuing phases of th
 e sub-prefetchers. The sub-prefetchers are directed by the full pattern, b
 ut are enabled in an irreversible order. The working fashion of "parallel 
 training and serial issuing'' effectively increases useful prefetches and 
 reduces useless prefetches. Experimental results show that,  Planaria has 
 improved the overall system performance in terms of instructions per cycle
  (IPC) by 28.9%,  21.9% and 15.3% on average over no prefetcher and BOP an
 d SPP, respectively.  Moreover, Planaria only incurs 0.5% power consumptio
 n overhead, while BOP and SPP increase the power consumption by 13.5% and 
 9.7%, respectively.\n\nTopic: Design\n\nKeyword: SoC, Heterogeneous, and R
 econfigurable Architectures\n\nSession Chairs: Dimitrios Soudris (National
  Technical University of Athens) and George Tzimpragos (University of Mich
 igan)
END:VEVENT
END:VCALENDAR
