TITLE:
Cybersecurity and Forensic Audio Analysis: Deepfake Detection Based on MFCC, Audio-Text Disconsistency, and Prosodic Features
AUTHORS:
Nursel Yalçın, Kübra Zaptiye
KEYWORDS:
Deepfake Voice Analysis, Forensic Voice Investigation, Speech-to-Text Discrepancy, Speech-to-Text, Speech Recognition, Artificial Voice, Digital Forgery
JOURNAL NAME:
Journal of Computer and Communications,
Vol.14 No.3,
March
11,
2026
ABSTRACT: Advances in AI-based voice production and conversion technologies have made it possible to create deepfake voices that closely resemble real human speech, raising new security challenges in forensic voice analysis and cybersecurity. Traditional forensic audio analysis methods rely primarily on acoustic characteristics, which can be limited by the increasing realism of deepfake audio. This study proposes an approach for forensic audio analysis that considers the temporal structure of speech, prosodic features and inconsistencies between audio and textual content to enhance the detection of deepfake audio. Accordingly, a dataset was created containing a total of 60 audio recordings, consisting of 30 real and 30 artificial Turkish voice recordings, each 7 - 10 seconds long. Mel-Frequency Cepstral Coefficients (MFCCs) were extracted from each audio recording; text and word time tags were obtained using an automated speech recognition method. Based on these time tamps, speech rate, pause durations and temporal alignment features were calculated. In addition, prosodic features such as pitch and amplitude were incorporated into the model. All obtained features were classified in two stages using the Random Forest classification algorithm. The model was analyzed in two stages: firstly, without prosodic features, and secondly, with the addition of prosodic features. Model performance was then evaluated using 5-fold cross-validation. The results indicate that incorporating prosodic features leads to higher deepfake Detection accuracy compared to models that rely solely on non-prosodic features. These results aim to demonstrate that temporal and prosodic inconsistencies in forensic audio analysis provide supportive and complementary elements for deepfake audio detection.