Web of Science: 3 citations, Scopus: 5 citations, Google Scholar: citations
Prediction of energy consumption by checkpoint/restart in HPC
Morán, Marina (Universidad Nacional del Comahue. Facultad de Informática)
Balladini, Javier (Universidad Nacional del Comahue. Facultad de Informática)
Rexachs del Rosario, Dolores Isabel (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Luque, Emilio (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)

Date: 2019
Abstract: The fault tolerance method most used today in high-performance computing (HPC) is coordinated checkpointing. This, like any other fault tolerance method, adds additional energy consumption to that of the execution of the application. Currently, knowing and minimizing this energy consumption is a challenge. The objective of this paper is to propose a model to estimate the energy consumption of checkpoint and restart operations and a method for its construction. These estimates allow the evaluation of different scenarios in order to minimize energy consumption. We focus on coordinated checkpoint/restart at the system level, in single-program multiple-data (SPMD) applications, on homogeneous clusters. We study the behavior of the power dissipated by the compute node during a checkpoint/restart operation, as well as its execution time, considering different parameters of the system and the application. The experimentation carried out on two platforms shows the validity of the proposal. We also evaluate the impact on power and energy consumption of the processor's C states, the configuration of the network file system (NFS), where the checkpoint files are stored, and the compression of the checkpoint files. This paper contributes to the objective of predicting energy consumption in the execution of applications that use checkpoint/restart. Not counting the outliers, we can estimate the energy consumed by checkpoint/restart operations with errors lower than 7. 5%.
Grants: Ministerio de Economía y Competitividad TIN2017-84875-P
Rights: Aquest document està subjecte a una llicència d'ús Creative Commons. Es permet la reproducció total o parcial, la distribució, la comunicació pública de l'obra i la creació d'obres derivades, fins i tot amb finalitats comercials, sempre i quan es reconegui l'autoria de l'obra original. Creative Commons
Language: Anglès
Document: Article ; recerca ; Versió publicada
Subject: Checkpointing ; Energy consumption ; Fault tolerance ; High performance computing
Published in: IEEE Access, Vol. 7 (2019) , p. 71791-71803, ISSN 2169-3536

DOI: 10.1109/ACCESS.2019.2919970


13 p, 7.6 MB

The record appears in these collections:
Research literature > UAB research groups literature > Research Centres and Groups (research output) > Engineering > HPC4EAS (High Performance Computing for Efficient Applications and Simulation Research Group)
Articles > Research articles
Articles > Published articles

 Record created 2019-11-18, last modified 2022-07-06



   Favorit i Compartir