BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240626T180034Z
LOCATION:3001\, 3rd Floor
DTSTART;TZID=America/Los_Angeles:20240626T161500
DTEND;TZID=America/Los_Angeles:20240626T163000
UID:dac_DAC 2024_sess161_RESEARCH1186@linklings.com
SUMMARY:Control Flow Divergence Optimization by Exploiting Tensor Cores
DESCRIPTION:Research Manuscript\n\nWeiguang Pang (Qilu University of Techn
 ology), Xu Jiang (University of Electronic Science and Technology of China
 ), Songran Liu (Northeastern University), Lei Qiao (Beijing Institute of C
 ontrol Engineering), kexue fu and longxiang Gao (Qilu University of Techno
 logy), and Wang Yi (Uppsala University)\n\nKernels are scheduled on Graphi
 cs Processing Units (GPUs) in the granularity of warp, a bunch of concurre
 ntly executing threads. When executing kernels with conditional branches, 
 threads within a warp may execute different branches sequentially, resulti
 ng in a considerable utilization loss and unpredictable execution time, kn
 own as the control flow divergence. This paper proposes a novel method to 
 predict threads' execution path before the kernel launch by deploying a br
 anch prediction network on the GPU's tensor cores, capable of parallel run
 ning with CUDA cores. Combined with a well-designed thread data reorganiza
 tion algorithm,  this solution can mitigate GPUs' control flow divergence 
 problem.\n\nTopic: Embedded Systems\n\nKeyword: Embedded Software
END:VEVENT
END:VCALENDAR
