TITLE:
Multimodal Digital Phenotyping for Bipolar Disorder: Robust Mood-State Classification and Early Relapse Risk Monitoring
AUTHORS:
Rocco de Filippis, Abdullah Al Foysal
KEYWORDS:
Bipolar Disorder, Digital Phenotyping, Multimodal Learning, Face/Voice/Phone, Mood Classification, Relapse Prediction, T-SNE, Ablation
JOURNAL NAME:
Open Access Library Journal,
Vol.12 No.12,
December
23,
2025
ABSTRACT: Bipolar disorder (BD) is characterized by recurrent transitions between manic, depressive, and euthymic states, yet continuous symptom monitoring remains a major clinical challenge. We present a multimodal digital phenotyping framework for fine-grained BD mood-state classification and relapse-risk monitoring using naturalistic facial video, voice audio, and phone-usage metadata. The proposed architecture employs modality-specific encoders with late-fusion logits to learn disentangled representations of affective, prosodic, and behavioural signals. Across a moderately imbalanced but clinically representative dataset, the model achieves near-perfect validation performance, including a 100% final accuracy and a strictly diagonal confusion matrix, indicating complete separation between euthymic, depressive, and manic classes. t-SNE visualizations show well-defined clusters at the embedding level for each individual modality and even tighter grouping in the fused representation, suggesting robust cross-modal alignment. An ablation analysis confirms that facial affect provides the strongest single-modality predictive signal (98.8% accuracy), while combining voice and facial features yields the highest bi-modal performance (99.0%), closely followed by the full multimodal system (98.5%). We further demonstrate a relapse-risk layer that transforms predicted mood probabilities into a continuous risk score, triggering alerts when a calibrated clinical threshold is crossed. Although the results are strong, we critically examine the possibility of data leakage and overfitting underlying “perfect” validation learning curves. To ensure realistic clinical utility, we outline subject-wise evaluation, temporal blocking, calibration strategies, and privacy-preserving deployment considerations. Class proportions (euthymic ≈ 1000, depressive ≈ 534, manic ≈ 468) reflect real-world prevalence patterns rather than strict balance. Overall, our findings highlight the promise of low-burden multimodal monitoring for BD while emphasizing the methodological rigor and safeguards required for real-world translation.