A software scheduling solution to avoid corrupted units on GPUs

David Defour 1 Eric Petit 2
1 DALI - Digits, Architectures et Logiciels Informatiques
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, UPVD - Université de Perpignan Via Domitia
Abstract : Massively parallel processors provide high computing performance by increasing the number of concurrent execution units. Moreover, the transistor technology evolves to higher density, higher frequency and lower voltage. The combination of these factors increases significantly the probability of hardware failures. In this paper, we present a methodology to locate and mitigate hardware failures of NVidia GPUs. Results show that intermittent errors can be precisely localized and have a limited impact to a well defined architecture tile. Therefore, we propose, and demonstrate on a software prototype, a rescheduling strategy to quarantine the defective hardware and ensure correct execution. Our approach significantly improve the GPU fault-tolerance capability and GPU’s lifespan, at a reasonable overhead.
Type de document :
Article dans une revue
Journal of Parallel and Distributed Computing, Elsevier, 2016, 90-91, pp.1--8. 〈10.1016/j.jpdc.2016.01.001〉
Liste complète des métadonnées

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01267742
Contributeur : David Defour <>
Soumis le : jeudi 4 février 2016 - 19:51:27
Dernière modification le : mardi 10 octobre 2017 - 10:29:20

Identifiants

Citation

David Defour, Eric Petit. A software scheduling solution to avoid corrupted units on GPUs. Journal of Parallel and Distributed Computing, Elsevier, 2016, 90-91, pp.1--8. 〈10.1016/j.jpdc.2016.01.001〉. 〈lirmm-01267742〉

Partager

Métriques

Consultations de la notice

102