Web of Science: 3 citas, Scopus: 5 citas, Google Scholar: citas
Prediction of energy consumption by checkpoint/restart in HPC
Morán, Marina (Universidad Nacional del Comahue. Facultad de Informática)
Balladini, Javier (Universidad Nacional del Comahue. Facultad de Informática)
Rexachs del Rosario, Dolores Isabel (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Luque, Emilio (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)

Fecha: 2019
Resumen: The fault tolerance method most used today in high-performance computing (HPC) is coordinated checkpointing. This, like any other fault tolerance method, adds additional energy consumption to that of the execution of the application. Currently, knowing and minimizing this energy consumption is a challenge. The objective of this paper is to propose a model to estimate the energy consumption of checkpoint and restart operations and a method for its construction. These estimates allow the evaluation of different scenarios in order to minimize energy consumption. We focus on coordinated checkpoint/restart at the system level, in single-program multiple-data (SPMD) applications, on homogeneous clusters. We study the behavior of the power dissipated by the compute node during a checkpoint/restart operation, as well as its execution time, considering different parameters of the system and the application. The experimentation carried out on two platforms shows the validity of the proposal. We also evaluate the impact on power and energy consumption of the processor's C states, the configuration of the network file system (NFS), where the checkpoint files are stored, and the compression of the checkpoint files. This paper contributes to the objective of predicting energy consumption in the execution of applications that use checkpoint/restart. Not counting the outliers, we can estimate the energy consumed by checkpoint/restart operations with errors lower than 7. 5%.
Ayudas: Ministerio de Economía y Competitividad TIN2017-84875-P
Derechos: Aquest document està subjecte a una llicència d'ús Creative Commons. Es permet la reproducció total o parcial, la distribució, la comunicació pública de l'obra i la creació d'obres derivades, fins i tot amb finalitats comercials, sempre i quan es reconegui l'autoria de l'obra original. Creative Commons
Lengua: Anglès
Documento: Article ; recerca ; Versió publicada
Materia: Checkpointing ; Energy consumption ; Fault tolerance ; High performance computing
Publicado en: IEEE Access, Vol. 7 (2019) , p. 71791-71803, ISSN 2169-3536

DOI: 10.1109/ACCESS.2019.2919970


13 p, 7.6 MB

El registro aparece en las colecciones:
Documentos de investigación > Documentos de los grupos de investigación de la UAB > Centros y grupos de investigación (producción científica) > Ingeniería > HPC4EAS (High Performance Computing for Efficient Applications and Simulation Research Group)
Artículos > Artículos de investigación
Artículos > Artículos publicados

 Registro creado el 2019-11-18, última modificación el 2022-07-06



   Favorit i Compartir