A Survey on Deepfake Audio Detection Techniques: Challenges, Approaches, and Prospects for Mobile Deployment
A Survey on Deepfake Audio Detection Techniques: Challenges, Approaches, and Prospects for Mobile Deployment
Authors:
Dr. Rekha B Venkatapur
Department of Computer Science and Engineering
K.S. Institute of Technology (Affliated to VTU,Belagavi),
Visvesvaraya Technical University- 590018,
Bengaluru, Karnataka- 560109, India.
Reema J
Department of Computer Science and Engineering
K.S. Institute of Technology (Affliated to VTU,Belagavi),
Visvesvaraya Technical University- 590018,
Bengaluru, Karnataka- 560109, India.
Ritika Singh
Department of Computer Science and Engineering
K.S. Institute of Technology (Affliated to VTU,Belagavi),
Visvesvaraya Technical University- 590018,
Bengaluru, Karnataka- 560109, India.
Rashmi Soni
Department of Computer Science and Engineering
K.S. Institute of Technology (Affliated to VTU,Belagavi),
Visvesvaraya Technical University- 590018,
Bengaluru, Karnataka- 560109, India.
Raziya Khan
Department of Computer Science and Engineering
K.S. Institute of Technology (Affliated to VTU,Belagavi),
Visvesvaraya Technical University- 590018,
Bengaluru, Karnataka- 560109, India.
Abstract—With the recent advancements in technology and Artificial Intelligence, the threat of deepfake audio also increased rapidly and significantly in the recent times. Deepfake audio generation systems are creating extremely realistic human voices. With this kind of improvements, it gets increasingly challenging to differentiate between genuinely real and manipulated audios, leading to serious posing risks to security, privacy and trust in digital communication. This survey presents a comprehensive overview of deepfake audio generation and detection techniques. We highlight rapid advancements in various deepfake audio generation methods such as text-to-speech synthesis, voice conversion, and real-time voice cloning. These techniques make detection complex by introducing very subtle artifacts in speech signals. The paper also examines key feature extractions methods like STFT, Mel- Spectrograms and CQT, exploring their complementary roles in capturing temporal, spectral and perceptual inconsistencies. Further, it explores deep learning models, CNN-based and transformer- based, once in particular, because they have shown strong performance in detecting deepfake audio. There is discussion, highlighting the importance of multi-feature fusion and data augmentation. Combining diverse representations like this increases accuracy and robustness for detection of synthetic audio. Additionally, there is analysis of commonly used datasets and evaluation metrics like accuracy, AUROC and EER. Despite progress, challenges such as real-time deployment issues, poor generalization, dataset limitations and ethical concerns remain significant. Overall, the conclusion of the survey is that effective deepfake audio detection requires integrated approaches like combining multiple features, adaptive models and realistic data.
Keywords— deepfake voice, voice cloning, CNN, spectrogram, STFT, Mel spectrogram, CQT, ASVspoof, mobile application