Close

Presentation

MENDNet: Just-in-time Fault Detection and Mitigation in AI Systems with Uncertainty Quantification and Multi-Exit Networks
DescriptionDue to rapid technology scaling in recent years, computation units such as AI systems have become highly susceptible to malfunctions in the hardware. Such malfunctions, when manifested in the accelerator memory, alter the pre-trained Deep Neural Network weight parameters, thereby generating faults, which in turn reduce the inference classification accuracy. To improve the reliability of the AI system, these faults are needed to be detected and mitigated by incorporating just-in-time strategy. Existing approaches for detection/mitigation of faults techniques are not ideal for just-in-time incorporation as the approaches prevents continuous execution or add significant latency overhead. To circumvent this issue, this paper explores uncertainty quantification in deep neural networks as a means of facilitating an efficient and novel fault detection approach in AI systems. Furthermore, in order to mitigate the impact of such faults, we propose MENDNet, which leverages the properties of multi-exit neural networks, coupled with the proposed uncertainty quantification framework. By tuning the confidence threshold for inference in each exit and leveraging the energy-based uncertainty quantification metric, MENDNet can make accurate predictions even in the presence of faults in the computation units. When evaluated on state-of-the-art network-dataset configurations and with multiple fault rate-fault position combinations, our proposed approach furnishes up to 80.42% improvement in inference classification accuracy over a traditional DNN implementation, thereby instilling the reliability of the AI accelerator in mission mode.
Event Type
Research Manuscript
TimeTuesday, June 2510:45am - 11:00am PDT
Location3010, 3rd Floor
Topics
EDA
Keywords
Test, Validation and Silicon Lifecycle Management