Training an audio model on many languages actually makes it worse at spotting deepfakes because the model starts ignoring the high-frequency clues of fraud.
Modern AI development typically assumes that more data and more languages lead to better performance. This research reveals a forensic trade-off where the pursuit of meaning destroys the ability to detect fine-grained acoustic anomalies. Monolingual models outperform their massive multilingual counterparts by focusing on the physical artifacts of voice synthesis rather than the words being spoken. Security teams should stop reaching for the biggest general-purpose models for fraud detection. Small, specialized models provide a far more reliable defense against sophisticated audio spoofing.
Monolingual vs. Multilingual Self-Supervised Models for Deepfake Audio Detection: A Comparative Robustness Analysis
SSRN · 6709561
Voice authentication systems are now at risk because of the rapid growth of neural voice synthesis, which makes fraud and evidence manipulation more likely. Many researchers employ self-supervised models like HuBERT and XLS-R for defense, but it is not clear if multilingual pretraining is beneficial for improving their robustness against novel attacks. To this end, we com pare monolingual HuBERT (English only) to the 128-language XLS-R model on the ASVspoof 2019 dataset. The monolingual model ge