Audio-visual scene recognition using attention-based graph convolutional model
Wang, Ziqi (China University of Petroleum (East China). Qingdao Institute of Software)
Wu, Yikai (China University of Petroleum (East China). Qingdao Institute of Software)
Wang, Yifan (China University of Petroleum (East China). Qingdao Institute of Software)
Gong, Wenjuan 
(China University of Petroleum (East China). Qingdao Institute of Software)
Gonzàlez, Jordi 
(Universitat Autònoma de Barcelona)
| Fecha: |
2024 |
| Resumen: |
Scene recognition aims to automatically comprehend scenes, and is widely utilized in various fields such as autonomous driving, intelligent security, and robotics. Current research predominantly employs local audio feature extractors, which results in the extracted features being unable to accommodate long-range contextual characteristics. Moreover, regarding the extracted features, most studies assume that the features of each modality possess equal importance. Our work primarily introduces a long-range audio feature extractor and employs a self-attention module to re-weight different features, addressing the limitations of the aforementioned local audio features and the varying importance of different modalities. We propose a visual-audio fusion model based on a self-attention-based graph convolutional neural network (SAGCN). In this model, we introduce an attention mechanism based cross-modal learning module into a structured multi-modal fusion network, and integrate the extracted features from different modalities to achieve precise scene recognition. The proposed model achieves an accuracy of 93. 1% on a standard multi-modal scene recognition dataset: TAU dataset. Compared with other standard early and late fusion methods, the prediction accuracy enhances by 1. 4% and 10%, respectively. For comparison with the SOTA methods, SAGCN exceeded the TAU baseline and attentional graph convolutional network on the TAU dataset by 8. 3% and 1. 5%, respectively, and achieved a 95. 0% accuracy on the UCF101 dataset, outperforming the evolved loss method by 1. 2% and the cross-modal deep clusterin method by 0. 8%. The code is available at https://github. com/submission1234/SAGCN. |
| Ayudas: |
Agencia Estatal de Investigación PID2020-120311RB-I00
|
| Derechos: |
Aquest material està protegit per drets d'autor i/o drets afins. Podeu utilitzar aquest material en funció del que permet la legislació de drets d'autor i drets afins d'aplicació al vostre cas. Per a d'altres usos heu d'obtenir permís del(s) titular(s) de drets.  |
| Lengua: |
Anglès |
| Documento: |
Article ; recerca ; Versió sotmesa a revisió |
| Materia: |
Scene recognition ;
Multi-modal fusion ;
Graph convolutional neural network ;
Attention mechanism |
| Publicado en: |
Multimedia tools and applications, Vol. 84, Issue 15 (May 2025) , p. 14915-14939, ISSN 1573-7721 |
DOI: 10.1007/s11042-024-19654-2
El registro aparece en las colecciones:
Artículos >
Artículos de investigaciónArtículos >
Artículos publicados
Registro creado el 2025-09-30, última modificación el 2025-11-26