Deepfake Detection Using Spatiotemporal Transformer

Kaddar, Bachir; Fezza, Sid Ahmed; Akhtar, Zahid; Hamidouche, Wassim; Hadid, Abdenour; Serra Sagristà, Joan

doi:10.1145/3643030

Cita bibliogràfica -- Enllaç permanent: https://ddd.uab.cat/record/304114

Google Scholar: cites

Deepfake Detection Using Spatiotemporal Transformer
Kaddar, Bachir

(University of Ibn Khaldoun-Tiaret)
Fezza, Sid Ahmed

(National Higher School of Telecommunications and ICT)
Akhtar, Zahid

(State University of New York Polytechnic Institute)
Hamidouche, Wassim

(University of Rennes (França))
Hadid, Abdenour

(Sorbonne University Abu Dhabi. Sorbonne Center for Artificial Intelligence)
Serra Sagristà, Joan

(Universitat Autònoma de Barcelona)

Data:	2024
Resum:	Recent advances in generative models and the availability of large-scale benchmarks have made deepfake video generation and manipulation easier. Nowadays, the number of new hyper-realistic deepfake videos used for negative purposes is dramatically increasing, thus creating the need for effective deepfake detection methods. Although many existing deepfake detection approaches, particularly CNN-based methods, show promising results, they suffer from several drawbacks. In general, poor generalization results have been obtained under unseen/new deepfake generation methods. The crucial reason for the above defect is that CNN-based methods focus on the local spatial artifacts, which are unique for every manipulation method. Therefore, it is hard to learn the general forgery traces of different manipulation methods without considering the dependencies that extend beyond the local receptive field. To address this problem, this article proposes a framework that combines Convolutional Neural Network (CNN) with Vision Transformer (ViT) to improve detection accuracy and enhance generalizability. Our method, named HCiT, exploits the advantages of CNNs to extract meaningful local features, as well as the ViT's self-attention mechanism to learn discriminative global contextual dependencies in a frame-level image explicitly. In this hybrid architecture, the high-level feature maps extracted from the CNN are fed into the ViT model that determines whether a specific video is fake or real. Experiments were performed on Faceforensics++, DeepFake Detection Challenge preview, Celeb datasets, and the results show that the proposed method significantly outperforms the state-of-the-art methods. In addition, the HCiT method shows a great capacity for generalization on datasets covering various techniques of deepfake generation.
Drets:	Aquest material està protegit per drets d'autor i/o drets afins. Podeu utilitzar aquest material en funció del que permet la legislació de drets d'autor i drets afins d'aplicació al vostre cas. Per a d'altres usos heu d'obtenir permís del(s) titular(s) de drets.
Llengua:	Anglès
Document:	Article ; recerca ; Versió acceptada per publicar
Matèria:	Deepfake video ; Detection ; Convolutional neural network ; Vision transformer
Publicat a:	ACM transactions on multimedia computing, communications and applications, Vol. 20, Núm. 11 (November 2024) , art. 345, ISSN 1551-6865

DOI: 10.1145/3643030

Postprint
15 p, 1.9 MB

El registre apareix a les col·leccions:
Articles > Articles de recerca
Articles > Articles publicats

Registre creat el 2024-12-02, darrera modificació el 2025-04-12

Registres semblants

Afegeix-lo al cistell personal
Anomena i desa Citation, BibTeX, MARC, MARCXML, DC, EDM OpenAire4