Web of Science: 2 citas, Scopus: 1 citas, Google Scholar: citas
A model of checkpoint behavior for applications that have I/O
León, Betzabeth (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Méndez, Sandra (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Franco Puntes, Daniel (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Rexachs del Rosario, Dolores Isabel (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Luque, Emilio (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)

Fecha: 2022
Resumen: Due to the increase and complexity of computer systems, reducing the overhead of fault tolerance techniques has become important in recent years. One technique in fault tolerance is checkpointing, which saves a snapshot with the information that has been computed up to a specific moment, suspending the execution of the application, consuming I/O resources and network bandwidth. Characterizing the files that are generated when performing the checkpoint of a parallel application is useful to determine the resources consumed and their impact on the I/O system. It is also important to characterize the application that performs checkpoints, and one of these characteristics is whether the application does I/O. In this paper, we present a model of checkpoint behavior for parallel applications that performs I/O; this depends on the application and on other factors such as the number of processes, the mapping of processes and the type of I/O used. These characteristics will also influence scalability, the resources consumed and their impact on the IO system. Our model describes the behavior of the checkpoint size based on the characteristics of the system and the type (or model) of I/O used, such as the number I/O aggregator processes, the buffering size utilized by the two-phase I/O optimization technique and components of collective file I/O operations. The BT benchmark and FLASH I/O are analyzed under different configurations of aggregator processes and buffer size to explain our approach. The model can be useful when selecting what type of checkpoint configuration is more appropriate according to the applications' characteristics and resources available. Thus, the user will be able to know how much storage space the checkpoint consumes and how much the application consumes, in order to establish policies that help improve the distribution of resources.
Ayudas: Agencia Estatal de Investigación PID2020-112496GB-I00
Nota: Altres ajuts: acords transformatius de la UAB
Derechos: Aquest document està subjecte a una llicència d'ús Creative Commons. Es permet la reproducció total o parcial, la distribució, la comunicació pública de l'obra i la creació d'obres derivades, fins i tot amb finalitats comercials, sempre i quan es reconegui l'autoria de l'obra original. Creative Commons
Lengua: Anglès
Documento: Article ; recerca ; Versió publicada
Materia: Checkpoint ; I/O applications ; Storage ; HPC ; Fault tolerance
Publicado en: The journal of supercomputing, (April 2022) , ISSN 1573-0484

DOI: 10.1007/s11227-022-04482-8


33 p, 3.2 MB

El registro aparece en las colecciones:
Artículos > Artículos de investigación
Artículos > Artículos publicados

 Registro creado el 2022-04-29, última modificación el 2024-02-05



   Favorit i Compartir