Enhancing Deepfake Video Verification using Spatial-Temporal Long-Distance Attention and weak supervision
Enhancing Deepfake Video Verification using Spatial-Temporal Long Distance Attention and weak supervision
B.Rajesh1 , Guttula Kavya Sri2 , Neelam Sai Swetha3 , Kodur Sravanthi4 , Obul Reddy Puli5 1Assistant
Professor Dept of Information Technology, SV College of Engineering, Tirupati, India.
2B.Tech, Dept of Information Technology, SV College of Engineering, Tirupati, India.
3B.Tech, Dept of Information Technology, SV College of Engineering, Tirupati, India.
4B.Tech, Dept of Information Technology, SV College of Engineering, Tirupati, India.
5B.Tech, Dept of Information Technology, SV College of Engineering, Tirupati, India.
Email:1bondirajesh88@gmail.com, 2kavyasreegutthula@gmail.com,
3swethaneelam2805@gmail.com, 4kodurusravanthi2004@gmail.com,
5obulpuli414@gmail.com
Corresponding Author*: B.Rajesh
ABSTRACT:With the rapid advancement of deepfake technologies, detecting highly realistic forged facial videos has become increasingly critical yet challenging. Existing detection methods mainly treat this task as a binary classification problem, often relying on fragile, specific semantic or local artifacts and lacking effective global context modeling. This paper reformulates deepfake detection as a fine-grained classification problem, where subtle and localized differences between real and fake faces must be captured. To address existing limitations, a novel spatial-temporal model is proposed that integrates a long-distance attention mechanism designed to assemble global spatial and temporal information. The spatial module focuses on detecting generation artifacts within individual frames by recalibrating shallow texture features, while the temporal module captures inter-frame inconsistencies by guiding mid-level semantic features using motion residualsacross consecutive frames. This dual-attention approach leverages non-overlapping image patches and trainable global forgery templates to highlight critical forged regions. Extensive experiments on public datasets demonstrate that the proposed method significantly outperforms state-of-the-art approaches, achieving robust accuracy even under heavy compression and cross-dataset settings. The design's weakly supervised nature enhances adaptability and interpretability, making it a promising direction for future deepfake video detection systems.KEYWORDS: deepfake technologies, spatial-temporal model, attention mechanism, binary classification,