Text vs. Speech? Detecting Audio Deepfakes on Instagram

This abstract has open access

Abstract Summary

With the increasing use of AI, deepfakes are becoming an increasingly prevalent threat in today's world. At the same time, the performance of most detectors drops significantly when faced with unseen data, whereas generation models are improving, resulting in fewer artefacts. We examined deepfakes published on Instagram, using the SocialDF dataset. In addition to analysing the deepfakes in the frequency domain using audio deepfake detectors, we transcribed the speech and analysed the text (e.g. emotion and topics) and the audio content (e.g. emotion and music genre). We found that audio deepfake detectors struggle to identify real-world deepfakes on Instagram. Furthermore, current audio deepfake detection uses audio artefacts only. Content is not used for detection purposes. We suggest using both the speech recording and the content. This approach improves results on real-world data and provides an explanation for the classification. Using content information, we outperformed frequency-based detection with an F1-score of 74.3%.

Abstract ID :

NKDR90

Submission Type

Short papers

Submission Topics

Explainability methods IR applications Societally-motivated IR research