Web of Science: 0 cites, Scopus: 0 cites, Google Scholar: cites
Deep learning data handling : exploring file formats and access strategies
Parraga Pinzon, Edixon Alexander (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
León, Betzabeth (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Mendez, Sandra (Barcelona Supercomputing Center)
Rexachs del Rosario, Dolores Isabel (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Franco Puntes, Daniel (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)
Luque, Emilio (Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius)

Data: 2025
Resum: Accessing large volumes of data presents a significant challenge when finding the best strategies to manage the data efficiently. Deep learning applications require the processing of massive amounts of data, which implies a considerable access Input/Output (I/O) load on computer systems. During training, interaction with the I/O system intensifies as files are continuously accessed to read data sets. This persistent access could overload the file system, which, in turn, adversely impacts application performance and efficient storage system utilization. Several factors influence the I/O of these applications, and one of the most relevant is the variety of file formats in which datasets can be stored. The choice of file format depends on the use case, as each format defines how information is stored. Some file formats have features that promote efficient access to datasets during the training phase, which can improve the performance of deep learning applications. Likewise, it is also important that the format adapts to the context, in this case, to an HPC system with a parallel file system. We will propose an image preprocessing method for cases where performance improves with parallel file access. This method will transform image data sets from their original JPEG format to the more efficient HDF5 format. Thus, our research will focus on the importance of understanding the mode of data access, spatial and temporal patterns, and the level of parallelism in file access to determine whether it is advisable to change the storage format.
Ajuts: Agencia Estatal de Investigación PID2020-112496GB-I00
Nota: Altres ajuts: acords transformatius de la UAB
Drets: Aquest document està subjecte a una llicència d'ús Creative Commons. Es permet la reproducció total o parcial, la distribució, la comunicació pública de l'obra i la creació d'obres derivades, fins i tot amb finalitats comercials, sempre i quan es reconegui l'autoria de l'obra original. Creative Commons
Llengua: Anglès
Document: Article ; recerca ; Versió publicada
Matèria: File format ; Deep learning ; Scalability ; High-performance computing ; Parallel input/output
Publicat a: Cluster Computing, Vol. 28 (August 2025) , art. 613, ISSN 1573-7543

DOI: 10.1007/s10586-025-05283-3


23 p, 4.1 MB

El registre apareix a les col·leccions:
Articles > Articles de recerca
Articles > Articles publicats

 Registre creat el 2025-09-09, darrera modificació el 2025-10-30



   Favorit i Compartir