Per citar aquest document: http://ddd.uab.cat/record/157786
Fault tolerance at system level based on RADIC architecture
Castro León, Marcela (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Meyer, Hugo Daniel (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Rexachs del Rosario, Dolores Isabel (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Luque, Emilio (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)

Data: 2015
Resum: The increasing failure rate in High Performance Computing encourages the investigation of fault tolerance mechanisms to guarantee the execution of an application in spite of node faults. This paper presents an automatic and scalable fault tolerant model designed to be transparent for applications and for message passing libraries. The model consists of detecting failures in the communication socket caused by a faulty node. In those cases, the affected processes are recovered in a healthy node and the connections are reestablished without losing data. The Redundant Array of Distributed Independent Controllers architecture proposes a decentralized model for all the tasks required in a fault tolerance system: protection, detection, recovery and masking. Decentralized algorithms allow the application to scale, which is a key property for current HPC system. Three different rollback recovery protocols are defined and discussed with the aim of offering alternatives to reduce overhead when multicore systems are used. A prototype has been implemented to carry out an exhaustive experimental evaluation through Master/Worker and Single Program Multiple Data execution models. Multiple workloads and an increasing number of processes have been taken into account to compare the above mentioned protocols. The executions take place in two multicore Linux clusters with different socket communications libraries.
Nota: Número d'acord de subvenció MINECO/TIN2011-24384
Nota: Número d'acord de subvenció MINETUR/TSI-020400-2010-120
Drets: Aquest document està subjecte a una llicència d'ús Creative Commons. Es permet la reproducció total o parcial i la comunicació pública de l'obra, sempre que no sigui amb finalitats comercials, i sempre que es reconegui l'autoria de l'obra original. No es permet la creació d'obres derivades. Creative Commons
Llengua: Anglès
Document: article ; recerca ; publishedVersion
Matèria: Software fault tolerance ; Resilience ; RADIC ; Message passing ; Semi-coordinated checkpoint ; Uncoordinated checkpoint ; Socket
Publicat a: Journal of parallel and distributed computing, Vol. 86 (Dec. 2015) , p. 98-111, ISSN 0743-7315

DOI: 10.1016/j.jpdc.2015.08.005


14 p, 2.1 MB

El registre apareix a les col·leccions:
Articles > Articles de recerca
Articles > Articles publicats

 Registre creat el 2016-06-08, darrera modificació el 2016-06-20



   Favorit i Compartir