Web of Science: 1 cites, Scopus: 2 cites, Google Scholar: cites
Prediction of energy consumption by checkpoint/restart in HPC
Morán, Marina (Universidad Nacional del Comahue. Facultad de Informática)
Balladini, Javier (Universidad Nacional del Comahue. Facultad de Informática)
Rexachs del Rosario, Dolores Isabel (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Luque, Emilio (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)

Data: 2019
Resum: The fault tolerance method most used today in high-performance computing (HPC) is coordinated checkpointing. This, like any other fault tolerance method, adds additional energy consumption to that of the execution of the application. Currently, knowing and minimizing this energy consumption is a challenge. The objective of this paper is to propose a model to estimate the energy consumption of checkpoint and restart operations and a method for its construction. These estimates allow the evaluation of different scenarios in order to minimize energy consumption. We focus on coordinated checkpoint/restart at the system level, in single-program multiple-data (SPMD) applications, on homogeneous clusters. We study the behavior of the power dissipated by the compute node during a checkpoint/restart operation, as well as its execution time, considering different parameters of the system and the application. The experimentation carried out on two platforms shows the validity of the proposal. We also evaluate the impact on power and energy consumption of the processor's C states, the configuration of the network file system (NFS), where the checkpoint files are stored, and the compression of the checkpoint files. This paper contributes to the objective of predicting energy consumption in the execution of applications that use checkpoint/restart. Not counting the outliers, we can estimate the energy consumed by checkpoint/restart operations with errors lower than 7. 5%.
Ajuts: Ministerio de Economía y Competitividad TIN2017-84875-P
Drets: Aquest document està subjecte a una llicència d'ús Creative Commons. Es permet la reproducció total o parcial, la distribució, la comunicació pública de l'obra i la creació d'obres derivades, fins i tot amb finalitats comercials, sempre i quan es reconegui l'autoria de l'obra original. Creative Commons
Llengua: Anglès
Document: Article ; recerca ; Versió publicada
Matèria: Checkpointing ; Energy consumption ; Fault tolerance ; High performance computing
Publicat a: IEEE Access, Vol. 7 (2019) , p. 71791-71803, ISSN 2169-3536

DOI: 10.1109/ACCESS.2019.2919970


13 p, 7.6 MB

El registre apareix a les col·leccions:
Documents de recerca > Documents dels grups de recerca de la UAB > Centres i grups de recerca (producció científica) > Enginyeries > HPC4EAS (High Performance Computing for Efficient Applications and Simulation Research Group)
Articles > Articles de recerca
Articles > Articles publicats

 Registre creat el 2019-11-18, darrera modificació el 2021-09-26



   Favorit i Compartir