Audio-visual scene recognition using attention-based graph convolutional model

Wang, Ziqi; Wu, Yikai; Wang, Yifan; Gong, Wenjuan; Gonzàlez, Jordi

doi:10.1007/s11042-024-19654-2

Cita bibliográfica -- Enlace permanente: https://ddd.uab.cat/record/320266

Scopus: 1 citas, Google Scholar: citas

Audio-visual scene recognition using attention-based graph convolutional model
Wang, Ziqi (China University of Petroleum (East China). Qingdao Institute of Software)
Wu, Yikai (China University of Petroleum (East China). Qingdao Institute of Software)
Wang, Yifan (China University of Petroleum (East China). Qingdao Institute of Software)
Gong, Wenjuan

(China University of Petroleum (East China). Qingdao Institute of Software)
Gonzàlez, Jordi

(Universitat Autònoma de Barcelona)

Fecha:	2024
Resumen:	Scene recognition aims to automatically comprehend scenes, and is widely utilized in various fields such as autonomous driving, intelligent security, and robotics. Current research predominantly employs local audio feature extractors, which results in the extracted features being unable to accommodate long-range contextual characteristics. Moreover, regarding the extracted features, most studies assume that the features of each modality possess equal importance. Our work primarily introduces a long-range audio feature extractor and employs a self-attention module to re-weight different features, addressing the limitations of the aforementioned local audio features and the varying importance of different modalities. We propose a visual-audio fusion model based on a self-attention-based graph convolutional neural network (SAGCN). In this model, we introduce an attention mechanism based cross-modal learning module into a structured multi-modal fusion network, and integrate the extracted features from different modalities to achieve precise scene recognition. The proposed model achieves an accuracy of 93. 1% on a standard multi-modal scene recognition dataset: TAU dataset. Compared with other standard early and late fusion methods, the prediction accuracy enhances by 1. 4% and 10%, respectively. For comparison with the SOTA methods, SAGCN exceeded the TAU baseline and attentional graph convolutional network on the TAU dataset by 8. 3% and 1. 5%, respectively, and achieved a 95. 0% accuracy on the UCF101 dataset, outperforming the evolved loss method by 1. 2% and the cross-modal deep clusterin method by 0. 8%. The code is available at https://github. com/submission1234/SAGCN.
Ayudas:	Agencia Estatal de Investigación PID2020-120311RB-I00
Derechos:	Aquest material està protegit per drets d'autor i/o drets afins. Podeu utilitzar aquest material en funció del que permet la legislació de drets d'autor i drets afins d'aplicació al vostre cas. Per a d'altres usos heu d'obtenir permís del(s) titular(s) de drets.
Lengua:	Anglès
Documento:	Article ; recerca ; Versió sotmesa a revisió
Materia:	Scene recognition ; Multi-modal fusion ; Graph convolutional neural network ; Attention mechanism
Publicado en:	Multimedia tools and applications, Vol. 84, Issue 15 (May 2025) , p. 14915-14939, ISSN 1573-7721

DOI: 10.1007/s11042-024-19654-2

Preprint
32 p, 1015.5 KB

El registro aparece en las colecciones:
Artículos > Artículos de investigación
Artículos > Artículos publicados

Registro creado el 2025-09-30, última modificación el 2025-11-26

Registros similares

Añadir a la cesta personal
Exportar como Citation, BibTeX, MARC, MARCXML, DC, EDM OpenAire4