Predicting Treatment Response in Bipolar Disorder Using Biomarker Profiles and Machine Learning Models ()
1. Introduction
Bipolar disorder (BD) affects over 1% of the global population and is marked by alternating episodes of mania and depression [1]-[3]. Despite the availability of mood stabilizers such as lithium and anticonvulsants, treatment response in BD remains highly individualized [4]-[7]. A significant proportion of patients do not respond adequately to initial therapy, often leading to prolonged disability, increased healthcare costs, and risk of recurrence [8]-[12]. This clinical variability calls for precision psychiatry tools capable of guiding treatment selection based on biological and clinical profiles [13] [14]. While prior studies have attempted to associate genetic markers, serum biomarkers, and neuroimaging measures with treatment outcomes, most have focused on single modalities or small sample sizes [15]-[18]. The heterogeneity of BD suggests that integrative modelling across biological layers may better capture the complexity of treatment response [19]-[21]. Furthermore, conventional statistical approaches often struggle to generalize across nonlinear relationships and noisy real-world data [22] [23]. With the increasing accessibility of high-dimensional data and advances in machine learning (ML) and deep learning (DL), it is now feasible to develop models that learn patterns from diverse biomarker sources [24]-[27]. However, the application of such models to stratify treatment responders in BD remains underexplored, particularly in multiclass settings [28] [29]. This study presents a data-driven framework to predict treatment response categories—non-responders, lithium responders, and anticonvulsant responders—by leveraging a comprehensive dataset including genetic polymorphisms, inflammatory and neurochemical serum levels, neuroimaging-derived brain metrics, and clinical history. We hypothesize that deep learning, owing to its capacity to capture complex feature interactions, will outperform traditional ML methods in predicting therapeutic outcomes. Beyond classification, this work aims to identify salient biomarkers contributing to treatment stratification, laying the groundwork for biomarker-informed decision support systems in psychiatric care.
2. Dataset Overview
The dataset was synthetically constructed based on distributions and correlations derived from real-world clinical literature on bipolar disorder treatment response. Although no real patient data were used, the feature distributions were modelled to reflect published frequencies, effect sizes, and inter-variable dependencies found in prior biomarker and treatment response studies. Each virtual patient record adheres to plausible diagnostic and treatment pathways defined by DSM-5 criteria, and ethical clearance was not applicable due to the fully synthetic nature of the dataset. The dataset used in this study comprises 2000 subjects diagnosed with bipolar disorder, each characterized by 31 features spanning genetic, biochemical, neuroimaging, and clinical domains. The primary objective is to predict the categorical treatment response label, which distinguishes patients into three clinically relevant classes:
Data Composition
The dataset integrates a comprehensive set of 31 features across four biomedical domains to enable robust modelling of treatment response in bipolar disorder:
Genetic Features (7 SNPs): Single nucleotide polymorphisms known to influence neuroplasticity and mood regulation, including: BDNF_Val66Met, COMT_Val158Met, SLC6A4_5HTTLPR, CACNA1C_rs1006737, ANK3_rs10994336, NR1D1_rs2314339, and IL6_rs1800795 [30] [31].
Serum Biomarkers (10 variables): Quantitative biochemical indicators capturing neurochemical, inflammatory, and hormonal status [32] [33]. These include: GABA, Glutamate, CRP_level, TNF_alpha, BDNF_serum, S100B, NSE, Cortisol_AM, Thyroid_TSH, and Lithium_ratio.
Neuroimaging Measures (5 variables): Brain structure and functional connectivity metrics derived from imaging modalities: Hippocampal_volume, Prefrontal_thickness, Amygdala_activity, DLPFC_connectivity, and REM_latency [34] [35].
Clinical and Demographic Features (8 variables): Clinical course variables and physiological indicators, including: Age, Gender, Illness_duration, Depression_episodes, Manic_episodes, Family_history, Heart_rate_variability, and HPA_axis_reactivity.
Data Quality
Among the 31 features, only two variables contain missing values:
This dataset offers a rich, multimodal representation of bipolar disorder phenotypes, enabling a robust foundation for developing predictive models and exploring biomarker importance across treatment response categories.
3. Methodology
To investigate the predictive potential of multimodal biomarkers in determining treatment response in bipolar disorder, we adopted a structured and reproducible machine learning pipeline. This methodology combines data preprocessing, feature exploration, model training, evaluation, and interpretation steps. The complete pipeline is visually summarized in Figure 1.
Figure 1. End-to-end methodological workflow for predicting treatment response using biomarker data.
3.1. Data Preprocessing
The process began with data cleaning and preprocessing. Two features, Illness_duration and Treatment_response, contained 10% missing values. Rows with missing Treatment_response (the target variable) were removed to avoid label noise. Missing Illness_duration values were imputed using median values stratified by age group and response category to preserve distributional integrity. After removing 200 records with missing Treatment_response, the final dataset consisted of 1800 subjects distributed as follows: 466 non-responders, 484 Lithium responders, and 667 Anticonvulsant responders. We chose not to apply imputation or semi-supervised learning to preserve the fidelity of class labels, as imputing the primary target could introduce artificial noise and distort performance evaluation.
3.2 Exploratory Data Analysis (EDA)
We conducted extensive EDA to understand variable distributions, detect outliers, and visualize class imbalances. Treatment response was moderately imbalanced, with class 2 (anticonvulsant responders) being the largest group. Violin plots, bar charts, and distribution histograms were used to examine feature behavior across classes. Correlation analysis revealed both positively and negatively associated biomarkers with response categories.
3.3 Feature Engineering and Dimensionality Reduction
To understand feature clustering and separability, we applied Uniform Manifold Approximation and Projection (UMAP) for nonlinear dimensionality reduction [36]-[38]. This helped visualize the high-dimensional data in 2D space while retaining structural relationships among samples.
3.4. Model Development
We implemented both classical machine learning models and a deep learning neural network:
Classical models: Logistic Regression, Support Vector Machine (SVM), Random Forest, and XGBoost.
Deep learning model: A feedforward neural network with multiple hidden layers, ReLU activation, dropout regularization, and SoftMax output.
The dataset was stratified into training (80%) and validation (20%) sets to ensure balanced representation across response classes.
3.5. Model Evaluation
All models were evaluated using metrics suitable for multiclass classification: Accuracy, Precision, Recall, F1 Score, Confusion Matrix, and ROC AUC. Learning curves and training logs were used to monitor model convergence and detect overfitting in the deep learning architecture.
3.6. Model Interpretation
Feature importance was extracted from each model. For classical models, permutation and Gini-based importances were used. SHAP value analysis for the deep learning model was attempted, though compatibility issues arose due to class encoding mismatches. Nonetheless, common top features across models were identified and compared. This modular and explainable pipeline allows for robust evaluation, comparative model benchmarking, and potential future extension to external datasets and interpretability layers. Hyperparameters for each model were optimized using a grid search with 5-fold stratified cross-validation. The ranges explored included:
Logistic Regression: C ∈ [0.01, 0.1, 1.0]
SVM: Kernel = 'rbf', γ ∈ [0.01, 0.1, 1], C ∈ [1, 10]
Random Forest: n_estimators ∈ [100, 300], max_depth ∈ [10, 20], min_samples_split ∈ [2, 5]
XGBoost: learning_rate ∈ [0.01, 0.1], n_estimators ∈ [100, 200], max_depth ∈ [3, 6]
Deep Learning: Early stopping after 10 epochs of no improvement in validation loss; dropout ∈ [0.2, 0.5], layer sizes = [64, 32]
4. Exploratory Data Analysis
Exploratory Data Analysis (EDA) was conducted to understand the distribution, completeness, and relationships within the dataset, as well as to assess the feature space and model ability of the treatment response classes.
4.1. Missing Data Analysis
A preliminary check for missing values revealed that two key features—Illness_duration and Treatment_response—had missing entries, with 200 values each (Figure 2). These were handled via imputation and filtering techniques as described in the methodology section.
Figure 2. Missing values for Illness_duration and Treatment_response.
4.2. Treatment Response Distribution
The target variable Treatment_response exhibited a near-balanced class distribution across three categories: non-responders (25.9%), Lithium responders (26.9%), and Anticonvulsant responders (37.1%) (Figure 3). This provided a robust multi-class classification task with moderate class imbalance.
Figure 3. Distribution of the Treatment Response classes.
4.3. Biomarker Distributions by Treatment Response
To investigate how specific biomarkers vary across treatment response categories—Anticonvulsant responder, Lithium responder, and non-responder—we visualized the distribution patterns of six top-ranked features (Figure 4). These biomarkers were selected based on their consistent importance across multiple models (see Section 6).
Distinctive stratification patterns were observed:
BDNF_serum and DLPFC_connectivity showed higher median levels among responders, particularly in the anticonvulsant group, indicating their potential as neural correlates of treatment efficacy.
GABA levels were more uniformly distributed but showed slightly elevated values in the Lithium responder group.
Hippocampal_volume revealed mild variations but with reduced levels in non-responders, reflecting structural neuroimaging differences.
BDNF_Val66Met and COMT_Val158Met polymorphism distributions (subfigures a and b) suggested that specific alleles were more prevalent in non-responders, hinting at a genetic basis for treatment resistance.
These observations are visually summarized in Figures 4(a)-(f):
Figure 4. Biomarker distributions by Treatment Response: (a) Distribution of BDNF_Val66Met SNP stratified by response class (bar plot). (b) Distribution of COMT_Val158Met SNP across treatment groups (bar plot). (c) Serum levels of BDNF by response class (violin plot). (d) GABA concentration across groups (violin plot). (e) Hippocampal_volume by treatment response (violin plot). (f) DLPFC_connectivity distribution (violin plot).
4.4. Correlation Analysis
Pearson correlation analysis was conducted to identify linear relationships between features and the treatment response variable [39]-[41]. Among the positively correlated biomarkers, GABA, DLPFC_connectivity, and COMT_Val158Met ranked highest. Conversely, Manic_episodes, IL6_rs1800795, and Hippocampal_volume showed negative correlations (Figure 5).
Figure 5. Top positive and negative correlations with Treatment Response.
4.5. Dimensionality Reduction (UMAP)
Uniform Manifold Approximation and Projection (UMAP) was applied [42]-[44] to visualize the high-dimensional biomarker space. The 2D projection (Figure 6) indicated an absence of clear linear separability among treatment response classes, suggesting that complex non-linear boundaries might be needed—justifying the use of deep learning and ensemble models.
Figure 6. UMAP visualization of treatment response categories in reduced feature space.
5. Model Evaluation and Results
To assess the predictive capability of our models in identifying treatment response classes (Non-responder, Lithium responder, Anticonvulsant responder), we evaluated five approaches: Logistic Regression, Random Forest, Support Vector Machine (SVM), XGBoost, and a Deep Learning neural network. The evaluation employed a comprehensive suite of performance metrics: Accuracy, Precision, Recall, F1-score, and ROC AUC.
5.1. Overall Performance Comparison
Figure 7 presents the side-by-side comparison of each model across the five metrics. As depicted, Logistic Regression achieved the highest ROC AUC of 0.699, indicating a better capability in distinguishing between response classes. SVM also performed competitively with an AUC of 0.682, while Random Forest and XGBoost showed moderate performance with AUCs of 0.660 and 0.630, respectively. Surprisingly, Deep Learning, despite longer training, achieved a test AUC of 0.653, with slight improvements in recall and F1.
Figure 7. Model performance comparison across Accuracy, Precision, Recall, F1 Score, and ROC AUC.
5.2. Confusion Matrices and Misclassification Insights
To delve deeper into class-specific prediction behaviour beyond aggregate metrics such as accuracy or AUC, we analysed the confusion matrices of each model [45]. These visualizations reveal how well each classifier discriminates among the three response classes—Non-responders, Lithium responders, and Anticonvulsant responders—and highlight prevalent misclassification patterns. The confusion matrices are presented in Figure 8 through Figure 11, respectively.
Logistic Regression Performance: As visualized in Figure 8, the Logistic Regression model effectively classified Lithium responders, indicating sensitivity to biomarkers strongly correlated with lithium treatment response (e.g., BDNF_serum, Hippocampal_volume). However, it frequently misclassified anticonvulsant responders as either non-responders or lithium responders. This suggests that the model’s linear decision boundaries were inadequate for capturing the non-linear biomarker interactions unique to Anticonvulsant efficacy.
Figure 8. Confusion matrix for Logistic Regression classifier, showing strong recall for Lithium responders but notable confusion between Anticonvulsant and Non-responder classes.
Random Forest Performance: The Random Forest confusion matrix, depicted in Figure 9, showed improved classification of Anticonvulsant responders compared to Logistic Regression. Yet, the model struggled more with Lithium responders, misclassifying many as Anticonvulsant or Non-responders. This may result from Random Forest’s tendency to overfit localized feature splits, particularly in heterogeneous data distributions.
Support Vector Machine (SVM) Performance: The SVM model’s confusion matrix, shown in Figure 10, reveals a more balanced classification performance across all classes. However, a notable confusion exists between Non-responders and Anticonvulsant responders, suggesting that the kernel boundary still struggles to fully separate overlapping biomarker signatures. Despite its strength in maximizing margin, SVM exhibited moderate class entanglement in this clinical prediction setting.
Figure 9. Confusion matrix for Random Forest model, highlighting improved Anticonvulsant prediction but increased misclassification of Lithium responders.
Figure 10. Confusion matrix for SVM classifier, indicating balanced classification with some confusion between Non-responders and Anticonvulsant responders.
XGBoost Performance: As presented in Figure 11, the XGBoost model produced a diffuse pattern of misclassification, especially for non-responders, who were often confused with both responder classes. This may be attributed to XGBoost’s boosting-driven optimization, which tends to favor features offering incremental gains—possibly diminishing the signal from consistently weak responders. Additionally, its genetic biomarker prioritization (see Section 6) may not adequately capture clinical non-responsiveness.
Figure 11. Confusion matrix for XGBoost classifier, showing increased confusion and reduced sensitivity for non-responder identification.
These confusion matrices collectively highlight each model’s strengths and diagnostic gaps [46]-[48]. Logistic Regression is sensitive to interpretable linear features; Random Forest captures more non-linear interactions; SVM achieves balance but is sensitive to biomarker overlap; XGBoost prioritizes genetic signals but lacks class separation for Non-responders; and Deep Learning, while opaque, balances all classes moderately well through abstract representation learning.
5.3. ROC Curve Evaluation
Figure 12 displays the ROC curves for all three classes using the deep learning model. The ROC AUC values for Class 0 (Non-responder), Class 1 (Lithium responder), and Class 2 (Anticonvulsant responder) were approximately 0.60, 0.71, and 0.61, respectively. This reinforces the observation that lithium response prediction was relatively stronger.
Figure 12. ROC curves for each class (deep learning model), showing differential discriminative ability.
5.4. Deep Learning Training Dynamics
Figure 13 shows the training and validation accuracy and loss across 100 epochs. The model gradually converged, with minimal overfitting. Validation loss decreased smoothly, and validation accuracy exceeded 85% in later epochs, although the generalization gap suggested some variance sensitivity.
Figure 13. Training and validation accuracy and loss progression over epochs.
5.5. Quantitative Model Performance Summary
To provide a clear comparison across modeling strategies, we summarize key evaluation metrics—accuracy, precision, recall, F1-score, and ROC AUC—in Table 1 below. This tabular representation complements the earlier visual plots and enables a compact, interpretable performance assessment of all five models.
Table 1. Performance metrics of all trained models.
Model |
Accuracy |
Precision |
Recall |
F1-Score |
ROC AUC |
Logistic Regression |
0.48 |
0.50 |
0.49 |
0.48 |
0.70 |
Random Forest |
0.46 |
0.46 |
0.45 |
0.46 |
0.66 |
SVM |
0.48 |
0.49 |
0.48 |
0.49 |
0.68 |
XGBoost |
0.45 |
0.44 |
0.44 |
0.45 |
0.63 |
Deep Learning |
0.46 |
0.47 |
0.46 |
0.46 |
0.65 |
As shown in Table 1, traditional models like Logistic Regression and SVM performed best in terms of ROC AUC, suggesting better discrimination across treatment response classes. The Deep Learning model, while initially modest in accuracy, displayed strong learning dynamics (see training plots in Section 5) and has potential for continued performance gains with more data or fine-tuning. Meanwhile, XGBoost showed relatively lower overall metrics but revealed strong insights in feature importance (section 6).
6. Biomarker Importance Analysis
To elucidate the most predictive biological markers associated with treatment response in bipolar disorder, we performed a comprehensive feature importance analysis using three machine learning models: Logistic Regression, Random Forest, and XGBoost. These models offer complementary interpretability due to their distinct mathematical frameworks—ranging from linear associations to non-linear ensemble reasoning [49]-[52]. The profiles of full importance are presented in Figure 14.
6.1. Logistic Regression: Linear Interpretability
Logistic Regression, being inherently linear, maps feature coefficients directly to the probability of treatment class prediction [53]-[55]. As illustrated in Figure 14(a), the most influential biomarkers include:
a) BDNF_serum
b) Hippocampal_volume
c) GABA
d) DLPFC_connectivity
e) SLC6A4_5HTTLPR
These results are neurobiologically interpretable: for example, BDNF (Brain-Derived Neurotrophic Factor) plays a role in synaptic plasticity and neurogenesis, while reduced hippocampal volume and GABAergic dysfunction have been widely associated with affective disorders. The prominence of genetic and neuroanatomical features suggests that logistic regression effectively captures global linear relationships between biomarker values and clinical response categories.
6.2. Random Forest: Ensemble-Based Insights
The Random Forest classifier aggregates predictions from multiple decision trees, enabling it to effectively capture complex interactions and non-linear patterns within the data [56]-[58]. The top-ranked features in Figure 14(b) were:
a) BDNF_serum
b) DLPFC_connectivity
c) GABA
d) Hippocampal_volume
e) Amygdala_activity
f) REM_latency
g) Prefrontal_thickness
Compared to logistic regression, the Random Forest model assigns greater importance to functional and neurophysiological variables, such as REM latency (implicated in sleep disturbances) and amygdala reactivity (linked to emotion regulation). This broader representation may reflect the model’s ability to uncover feature interactions that would remain latent in simpler linear models.
6.3. XGBoost: Gradient Boosting with Genetic Emphasis
XGBoost, a high-performance gradient-boosting algorithm, further revealed a unique pattern of biomarker prioritization [59] [60]. Figure 14(c) shows that genetic variants dominated the importance rankings:
a) COMT_Val158Met
b) BDNF_serum
c) SLC6A4_5HTTLPR
d) BDNF_Val66Met
e) ANK3_rs10994336
f) IL6_rs1800795
These markers reflect subtle cumulative effects—especially those tied to neurotransmission (dopamine via COMT, serotonin via SLC6A4), inflammation (IL-6), and neuronal growth (BDNF). XGBoost’s boosting strategy likely makes it more sensitive to these nuanced gene-level variations, positioning it as a useful tool for precision psychiatry applications.
6.4. SHAP Analysis and Deep Learning Limitation
Attempts to apply SHAP (Shapley Additive Explanations) to the deep learning model were unsuccessful due to shape mismatches. Specifically, SHAP produced outputs of length 3 (one per output class), whereas the model input dimension was 30 (number of biomarkers), violating the requirement for one SHAP value per feature. This reflects a broader challenge in applying interpretability tools to multi-class deep learning classifiers. Consequently, the deep learning model—though highly performant—was excluded from the interpretability portion of this study. Nevertheless, the convergence across all interpretable models on biomarkers such as BDNF_serum, GABA, and DLPFC_connectivity strengthens their biological plausibility as treatment predictors.
Figure 14. Biomarker Importance Across Models: (a) Logistic Regression; (b) Random Forest; (c) XGBoost.
7. Discussion
This study presents a comprehensive analysis of treatment response prediction in bipolar disorder using a multimodal biomarker dataset and various machine learning models. Our findings reinforce the feasibility of integrating biological, neuroimaging, and clinical data to enhance treatment stratification and offer several noteworthy insights into both model performance and biological underpinnings.
Interpretation of Predictive Findings: Among the evaluated models, Logistic Regression achieved the highest ROC AUC (0.699), demonstrating robust performance in probabilistic separation of classes despite its linear structure. Deep learning showed moderate results (AUC: 0.643) but did not outperform traditional models in this context. This may reflect the relatively modest sample size (n = 2000), which, while adequate for ML models, may not fully unlock the representational capacity of deep neural networks. UMAP visualization further supported the need for non-linear classifiers, as no clear linear separation was observable in the feature space. Importantly, all models identified consistent biomarkers of predictive value—namely, BDNF_serum, DLPFC_connectivity, and GABA levels. These findings align with established literature highlighting the role of neuroplasticity, prefrontal-limbic network regulation, and inhibitory neurotransmission in mood disorders and treatment efficacy. Genetic variants such as COMT_Val158Met and SLC6A4_5HTTLPR also emerged as key features in XGBoost, supporting their role in modulating dopaminergic and serotonergic pathways relevant to medication response.
Limitations: Despite promising results, the study is not without limitations. First, SHAP analysis for the deep learning model failed due to dimensional mismatch, limiting the interpretability of its predictions. Second, although class distribution was reasonably balanced, some degree of class skew—particularly for anticonvulsant responders—may have influenced the classifier’s bias. Third, while multimodal data were incorporated, certain real-world confounders such as medication adherence or comorbidities were not captured in the dataset. While the biomarker dataset includes multiple clinical and neurobiological indicators, important real-world confounders were not represented. Variables such as medication adherence, substance use, socioeconomic status, or comorbid psychiatric conditions (e.g., anxiety, ADHD) may substantially impact treatment response. Their absence could bias model outputs or overstate the predictiveness of included biomarkers. Future extensions should integrate longitudinal adherence data, environmental factors, and clinical comorbidities to ensure more ecologically valid predictions.
Clinical Implications and Future Work: The consistent identification of biologically plausible biomarkers across models supports the potential integration of such predictive systems into clinical decision-making pipelines. These tools could assist psychiatrists in selecting optimal treatments early during care, reducing trial-and-error prescribing. However, further validation on external datasets and larger cohorts is essential. Future work should also explore multimodal fusion techniques, temporal modelling of treatment trajectories, and enhanced explainability frameworks (e.g., integrated gradients, SHAP for DL) to improve both performance and transparency. This study underscores the promise of machine learning-guided biomarker analysis in psychiatric precision medicine, while also highlighting practical challenges that must be addressed before clinical deployment.
8. Conclusion
This study demonstrated the potential of machine learning approaches to predict treatment response in individuals with bipolar disorder using a multimodal dataset encompassing genetic, neurochemical, neuroimaging, and clinical features. Through a rigorous modeling pipeline, we evaluated five distinct algorithms—Logistic Regression, SVM, Random Forest, XGBoost, and Deep Learning—against a cohort of 2,000 patients characterized by 31 diverse biomarkers. Among the models tested, Logistic Regression achieved the highest ROC AUC, suggesting that even linear models can effectively capture relevant signal in structured biological data [61] [62]. While the Deep Learning model showed moderate success, its performance was constrained by the dataset size and limitations in interpretability. Nevertheless, across all models, several biomarkers emerged as consistently important predictors of treatment response, including BDNF_serum, GABA, and DLPFC_connectivity. These markers are not only statistically significant but also biologically grounded, reflecting known mechanisms of neuroplasticity, emotional regulation, and neurotransmission in bipolar pathology. Importantly, the study highlighted that no single model or data type suffices; rather, the integration of multiple data modalities and interpretability frameworks is key to improving prediction accuracy and clinical utility. The methodological pipeline—combining robust preprocessing, dimensionality reduction, model benchmarking, and biomarker interpretation—offers a reproducible framework for future research. Although the findings are promising, further validation on external datasets and larger populations is warranted. Moreover, enhancing the explainability of complex models and integrating real-world clinical variables will be crucial for translating these approaches into actionable clinical tools [63]-[65]. This work lays a foundation for data-driven precision psychiatry, where treatment decisions are guided by individualized biological profiles rather than trial-and-error medication strategies.
Conflicts of Interest
The authors declare no conflicts of interest.