Predicting Antidepressant Treatment Response Using Machine Learning: A Multimodal Analysis of Clinical and Genetic Data ()
1. Introduction
Major depressive disorder (MDD) is a highly prevalent and disabling psychiatric condition that significantly impacts quality of life and imposes a substantial burden on healthcare systems worldwide [1]-[4]. Although antidepressant medications, particularly selective serotonin reuptake inhibitors (SSRIs) and serotonin-norepinephrine reuptake inhibitors (SNRIs), remain the first-line pharmacological treatments, individual response to these therapies varies greatly [5]-[7]. Many patients endure multiple treatment cycles before finding an effective medication, leading to prolonged suffering, increased risk of chronic depression, and higher healthcare costs [8]-[11]. Currently, there are no reliable clinical tools that can accurately predict whether a patient will respond to a specific antidepressant before treatment begins, forcing clinicians to rely on a trial-and-error prescribing strategy [12]-[15]. Recent advances in precision psychiatry emphasize the importance of integrating clinical, genetic, and environmental factors to develop more individualized treatment approaches [16]-[20]. Machine learning (ML) offers a powerful solution for handling complex, multi-dimensional datasets and uncovering non-linear patterns that may not be apparent through traditional statistical methods [21]-[24]. By leveraging ML techniques, it becomes possible to build predictive models that can assist clinicians in identifying patients who are more likely to benefit from specific antidepressants based on their unique clinical profiles and genetic makeup [25]-[29]. In this study, we systematically investigate the performance of four supervised machine learning algorithms—Random Forest, XGBoost, Support Vector Machine (SVM), and Logistic Regression—in predicting antidepressant treatment response. The models are trained using a combination of clinical variables and genetic polymorphisms, including the 5HTTLPR genotype and other serotonin-related markers known to influence treatment outcomes. The study incorporates comprehensive model evaluation, feature importance analysis, SHAP-based interpretability, and threshold optimization to improve clinical relevance. By addressing the variability in antidepressant response, this research aims to contribute to the development of personalized treatment strategies that could significantly improve therapeutic outcomes in psychiatric practice.
2. Methodology
This study employed a systematic machine learning pipeline, carefully designed and illustrated in Figure 1, which presents the complete research roadmap used to predict antidepressant treatment response. The process began with problem definition, where the primary goal was to build predictive models capable of distinguishing responders from non-responders based on pre-treatment clinical and genetic features. The next step, data acquisition and preparation, involved collecting comprehensive patient-level data, which included clinical features such as age, BMI, baseline HAMD scores (indicating depression severity), illness duration, sleep quality, early life stress, anxiety comorbidity, previous treatments, and family history, alongside genetic polymorphisms like 5HTTLPR and other serotonin-related markers. After data acquisition, feature engineering was performed to properly encode genetic variants, handle categorical variables, and ensure all features were formatted for supervised learning algorithms. Data preprocessing included managing missing values, normalizing continuous variables when required, and splitting the dataset into stratified training and testing sets to maintain class balance across responders and non-responders [30]-[33]. Following preprocessing, model development focused on building four classifiers: Random Forest, XGBoost (Gradient Boosting), Support Vector Machine (SVM), and Logistic Regression, each chosen for their clinical applicability and interpretability. Each model underwent hyperparameter tuning using GridSearchCV with 5-fold cross-validation to systematically identify the best configuration and prevent overfitting. The tuned models were then subjected to performance evaluation using multiple metrics, including accuracy, precision, recall, F1 score, and ROC AUC, to fully capture both overall predictive power and the balance between sensitivity and specificity, which is critical in psychiatric clinical decision-making. In addition, confusion matrices were analyzed to examine misclassification patterns for each model, and ROC and precision-recall curves were plotted to further visualize performance across varying thresholds [34]-[36]. A feature importance analysis was conducted using Random Forest feature rankings, Logistic Regression coefficients, and SHAP (SHapley Additive exPlanations) values to interpret individual predictor contributions, particularly highlighting the strong influence of baseline depression severity, age, and the 5HTTLPR genotype. To maximize clinical relevance, a threshold optimization analysis was performed by plotting precision, recall, and F1 scores across different decision thresholds, leading to the selection of an optimal threshold at 0.36 that significantly improved recall (96.9%), which is essential to avoid missing potential responders in practice. Finally, the trained model was applied to example patient predictions, providing both predicted classes and probability scores to demonstrate how the model can guide clinical treatment decisions. The complete workflow for this methodology is visualized in Figure 1, providing a clear visual guide to each phase of the process from initial problem definition to result interpretation and clinical deployment.
![]()
Figure 1. Research roadmap of the machine learning pipeline.
This diagram illustrates the complete workflow of the machine learning pipeline used in this study, including problem definition, data acquisition, feature engineering, supervised learning using four classifiers, hyperparameter tuning, cross-validation, evaluation, threshold optimization, and result interpretation.
3. Methods
3.1. Dataset Description
This study used a carefully balanced dataset specifically designed to predict antidepressant treatment response. The dataset consisted of approximately 800 responders and 800 non-responders, ensuring that both classes were equally represented, which is essential for minimizing model bias. The clinical features included age, body mass index (BMI), illness duration, baseline Hamilton Depression Rating Scale (HAMD) score, sleep quality, early life stress exposure, anxiety comorbidity, history of previous treatments, family psychiatric history, and gender. These clinical variables provided a comprehensive profile of each patient’s health status, mental health history, and potential risk factors [37]-[40]. In addition to clinical data, the dataset included genetic information from 9 well-known polymorphisms that influence antidepressant response: 5HTTLPR, HTR2A (rs6311, rs6313), TPH2 (rs4570625), COMT (rs4680), BDNF (rs6265), FKBP5 (rs1360780), SLC6A4 (rs25531), MAOA (rs6323), and CRHR1 (rs110402). The integration of both clinical and genetic domains provided a biologically and psychologically rich foundation for predicting treatment outcomes. The clinical and genetic data used in this study were collected through a multi-site research collaboration involving psychiatric outpatient clinics and university-affiliated hospitals across Italy. Data collection occurred between 2019 and 2022 under ethical approval from an institutional review board (IRB #2023-1027). All participants provided written informed consent for their data to be used in secondary analysis and machine learning research. Inclusion criteria required a DSM-5 diagnosis of major depressive disorder (MDD) and completion of a minimum six-week course of antidepressant treatment. All procedures were conducted in accordance with the ethical standards of the Declaration of Helsinki. The class distribution of responders and non-responders is shown in Figure 2, confirming that the dataset was balanced, which is crucial for training machine learning models without introducing class imbalance bias.
![]()
Figure 2. Class distribution of responders and non-responders.
Handling of Missing Data: Prior to model development, missing values in both clinical and genetic variables were assessed. Clinical variables with <5% missingness were imputed using median values for continuous features and mode imputation for categorical ones. Genetic markers with missingness above 10% were excluded from analysis, and the remaining missing genotype data (<5%) were imputed using most frequent allele encoding. Post-processing, the dataset was complete with no missing values in the final modelling set. The final dataset consisted of 800 responders and 800 non-responders. Class balance was achieved through stratified sampling from a larger original cohort (N = 3421), ensuring equal group representation without introducing synthetic data. No SMOTE or oversampling methods were applied. This natural stratification reduces bias but may limit generalizability, as real-world response rates are typically imbalanced. This limitation is discussed in Section 5.
3.2. Data Exploration
A thorough data exploration phase was conducted to understand the structure and distribution of the dataset [41] [42]. First, clinical feature distributions were visualized using boxplots, focusing on key variables such as age, BMI, baseline HAMD score, illness duration, sleep quality, and early life stress. As presented in Figure 3, these boxplots showed variations between responders and non-responders. Although some overlap was observed, responders tended to have slightly lower baseline HAMD scores and shorter illness durations, suggesting potential predictors of positive treatment outcomes. All 19 features were retained based on domain knowledge and clinical relevance. Feature selection was intentionally avoided to preserve interpretability and clinical coverage. To prevent information leakage, all feature engineering and preprocessing steps—including normalization and imputation—were confined to the training folds during cross-validation.
![]()
Figure 3. Boxplots of clinical features by treatment response.
Next, categorical feature distributions such as gender, anxiety comorbidity, and family psychiatric history were analysed and visualized in Figure 4. This figure highlights how these categorical variables were distributed across responders and non-responders. Gender distribution appeared relatively balanced, while the distribution of anxiety comorbidity and family history showed subtle differences between groups, indicating their potential relevance in treatment response prediction.
Figure 4. Distributions of gender, anxiety comorbidity, and family psychiatric history across treatment response groups.
Figure 5. Heatmap of genotype-specific treatment response rates across genetic variants.
In addition, genotype-specific response rates were visualized using a heatmap (Figure 5), which demonstrated response probabilities associated with each genetic variant. The 5HTTLPR BB genotype exhibited a notably higher response rate, supporting its role as a potential pharmacogenetic marker for antidepressant efficacy.
A feature correlation matrix was generated and is shown in Figure 6. This matrix confirmed that most clinical and genetic features were weakly correlated, suggesting minimal multicollinearity and supporting the use of all features in a multivariate machine learning model.
Figure 6. Correlation matrix of clinical and genetic features.
Further, a pair plot analysis was conducted to visualize the joint distributions of selected features and their separability across responders and non-responders. As illustrated in Figure 7, this provided visual evidence that, while individual features did not offer perfect class separation, the combination of features could allow the models to learn complex decision boundaries.
Figure 7. Pair plot of key clinical features across treatment response groups.
3.3. Machine Learning Model Development
Four widely used supervised machine learning classifiers were employed in this study: Random Forest, XGBoost (Gradient Boosting), Support Vector Machine (SVM), and Logistic Regression. These models were selected to provide a balance between interpretability and predictive power, with Random Forest and XGBoost known for their ability to capture complex, non-linear interactions, and Logistic Regression and SVM offering more interpretable, linear decision-making frameworks [43]-[47]. Each model underwent extensive hyperparameter tuning using GridSearchCV with 5-fold cross-validation to identify the best-performing configurations. Cross-validation was essential for ensuring that the models generalized well to unseen data, reducing the risk of overfitting.
Model performance was evaluated using the following key metrics:
Accuracy: The proportion of correctly classified instances.
Precision: The ability to correctly identify true responders without misclassifying non-responders.
Recall (Sensitivity): The ability to correctly detect all true responders, which is crucial in clinical applications to avoid missing patients who might benefit from treatment.
F1 Score: A harmonic mean of precision and recall, providing a balanced performance metric.
ROC AUC: A threshold-independent metric measuring the model’s ability to discriminate between responders and non-responders across all decision thresholds.
The multi-metric evaluation ensured that models were not only accurate but also clinically meaningful, with special attention given to recall due to its importance in minimizing missed responders.
4. Results
4.1. Overall Model Performance
Figure 8. Comparative performance of machine learning models across multiple metrics (Accuracy, Precision, Recall, F1 Score, ROC AUC).
The predictive capability of the four machine learning classifiers—Random Forest, XGBoost, Support Vector Machine (SVM), and Logistic Regression—was rigorously assessed using a comprehensive set of performance metrics, including accuracy, precision, recall, F1 score, and ROC AUC. The comparative performance of all models is presented in Figure 8. Random Forest achieved the highest overall accuracy (60.4%) and precision (62.6%), indicating that it was more conservative and reliable in correctly identifying true responders while minimizing false positives. However, Logistic Regression demonstrated the highest recall (62.5%) and the best ROC AUC (0.633), showing its superior ability to detect a larger proportion of true responders, a feature of critical clinical importance. XGBoost performed competitively with strong precision (60.7%) but showed slightly lower recall, while SVM provided balanced but modest performance across all metrics. These results indicate that while all models provided clinically useful predictions, Random Forest and Logistic Regression consistently outperformed the others, offering the best trade-offs between precision and sensitivity. The selection of the optimal model may depend on the clinical priority: precision-driven decision-making (Random Forest) versus recall-focused safety nets (Logistic Regression).
4.2 ROC Curve and Discriminative Power
Figure 9. ROC Curve for random forest (AUC = 0.62).
The discriminative ability of each classifier was further examined through Receiver Operating Characteristic (ROC) curves. The individual ROC curves for Random Forest (Figure 9), XGBoost (Figure 10), Logistic Regression (Figure 11), and SVM (Figure 12) confirmed moderate but meaningful class separation, with AUC values ranging from 0.62 to 0.63 across all models.
Figure 10. ROC Curve for XGBoost (AUC = 0.63).
Figure 11. ROC Curve for Logistic Regression (AUC = 0.63).
Although no model achieved perfect discrimination, the consistent AUC scores indicated that each model performed significantly better than random guessing. Among these, Logistic Regression and XGBoost exhibited slightly more favorable ROC curvature, aligning with their higher recall and sensitivity to true responders.
Figure 12. ROC Curve for SVM (AUC = 0.63).
4.3. Precision-Recall Trade-Off Analysis
Figure 13. Precision-recall curve for random forest (AP = 0.70).
Figure 14. Precision-recall curve for logistic regression (AP = 0.69).
Figure 15. Precision-recall curve for SVM (AP = 0.68).
Precision-Recall (PR) curves were analysed to evaluate the models under the context of imbalanced decision thresholds and real-world clinical settings, where identifying true responders without overwhelming false positives is essential [48]-[50]. The PR curve for Random Forest (Figure 13) showed the highest average precision (AP = 0.70), confirming its strength in confidently predicting responders. Logistic Regression (Figure 14) and SVM (Figure 15) demonstrated competitive but slightly lower AP values of 0.69 and 0.68, respectively. This analysis reinforces the observation that Random Forest offers the best precision-driven decision framework, which is beneficial when aiming to minimize false positives. On the other hand, Logistic Regression maximizes the detection of potential responders, which is often a clinical priority to avoid missing patients who could benefit from treatment [51]-[53].
4.4. Confusion Matrix Insights
Figure 16. Confusion matrices for random forest, XGBoost, logistic regression, and SVM.
The detailed confusion matrices for all models, presented in Figure 16, provided further insight into each classifier’s error patterns.
Random Forest correctly classified 112 non-responders and 82 responders but misclassified 78 responders (false negatives).
XGBoost produced a comparable error distribution to Random Forest.
Logistic Regression correctly identified 100 responders, offering the most responder detections and the fewest missed cases.
SVM showed balanced misclassifications across both classes but slightly favoured non-responders.
The confusion matrix analysis emphasized that Logistic Regression was the most effective in minimizing false negatives, a key requirement in psychiatric treatment planning, where it is safer to over-treat than to miss potential responders.
4.5. Feature Importance and Clinical Interpretability
4.5.1. Random Forest Feature Importance
The Random Forest feature importance rankings (Figure 17) revealed that baseline HAMD score, age, early life stress, illness duration, sleep quality, BMI, and the 5HTTLPR genotype were the most impactful predictors. Notably, baseline HAMD score and age emerged as the two most critical clinical features, supporting the clinical intuition that disease severity and patient age significantly influence antidepressant treatment response.
Figure 17. Random forest feature importance rankings showing top clinical and genetic predictors.
4.5.2. Logistic Regression Coefficient Analysis
The coefficient profile of the Logistic Regression model (Figure 18) provided clear interpretability of feature contributions. The 5HTTLPR genotype and baseline HAMD score exhibited strong positive associations with treatment response probability, while prior treatments and anxiety comorbidity showed negative contributions. This alignment with known clinical factors validated the biological plausibility of the model.
Figure 18. Logistic regression coefficients for top positive and negative predictors.
4.5.3. SHAP Value Interpretation
Figure 19. SHAP summary plot displaying the impact of clinical and genetic features on model predictions.
To deeply explore individual prediction contributions, SHAP value analysis was performed (Figure 19). The SHAP summary plot confirmed that 5HTTLPR, baseline HAMD score, and age were the most influential features in driving model outputs. The color-coded distribution highlighted that higher baseline HAMD scores and specific 5HTTLPR genotypes increased the probability of a positive treatment response, while lower scores and certain genetic patterns reduced it.
4.6. Threshold Optimization for Clinical Deployment
Precision, recall, and F1 score were plotted across varying decision thresholds (Figure 20) to identify the clinically optimal classification point. The optimal threshold was determined to be 0.36, where the model achieved maximum recall (96.9%) with acceptable precision (51.8%). This threshold adjustment ensures that very few potential responders are missed, which is critically important in real-world psychiatric treatment, where false negatives can lead to prolonged patient suffering and treatment resistance.
Figure 20. Precision, recall, and F1 score by threshold, optimal decision point at 0.36 for maximum clinical sensitivity.
4.7. Example Patient Prediction
The final trained model was applied to an example patient case to demonstrate clinical usability. The model predicted a treatment response with a probability of 0.648, exceeding the selected optimal threshold of 0.36. Based on this prediction, the patient would be classified as a likely responder to SSRI/SNRI treatment, supporting the practical application of the model in guiding personalized therapeutic decisions.
5. Discussion
This study developed and evaluated machine learning models to predict individual antidepressant treatment responses by integrating clinical and genetic data, moving toward a more personalized approach to psychiatric care. The results demonstrated that while none of the models achieved perfect predictive power, both Random Forest and Logistic Regression consistently provided clinically meaningful performance, with Random Forest excelling in precision and Logistic Regression offering superior recall.
The Random Forest model’s high precision (62.6%) and feature interpretability made it particularly valuable when the clinical priority is to confidently identify responders with minimal false positives. This is crucial in psychiatric medication management, where unnecessary exposure to ineffective drugs can lead to adverse effects, increased patient frustration, and higher dropout rates. In contrast, Logistic Regression provided the highest recall (62.5%), which is critically important when the priority is to avoid missing true responders. In psychiatric settings, a model with higher recall is often preferable because the cost of missing a potential responder is typically greater than the cost of over-treating a non-responder [54]-[56]. The feature importance analysis across models revealed a consistently strong influence of baseline depression severity (HAMD score), age, early life stress, illness duration, and sleep quality on treatment response. These results align with existing clinical literature, which emphasizes that patients with lower baseline severity, shorter illness duration, and less accumulated life stress often exhibit better responses to antidepressant therapies [57]-[61]. Notably, the 5HTTLPR genotype emerged as one of the most important genetic predictors across both Random Forest and SHAP analyses. This finding is consistent with prior pharmacogenomic research suggesting that the 5HTTLPR polymorphism significantly modulates serotonin transporter function and antidepressant efficacy. The SHAP value interpretation further validated the biological relevance of the selected features by quantifying their individual impact on model predictions. The SHAP plots provided actionable insights into how variations in patient-specific factors influence the probability of treatment response, offering a level of interpretability that can support clinicians in shared decision-making with patients. The threshold optimization analysis was a key contribution of this study. By systematically adjusting the decision threshold, we were able to maximize recall (96.9%) while maintaining reasonable precision (51.8%), ensuring that nearly all potential responders were correctly identified. This trade-off is particularly valuable in psychiatry, where treatment delays can exacerbate symptoms and reduce the likelihood of future remission. The decision to adopt a lower threshold reflects a clinical bias toward minimizing false negatives, which in this context represents missing a patient who could significantly benefit from treatment. Despite these promising findings, there are important limitations to consider. First, the dataset was relatively modest in size, and although class balance was achieved, larger and more diverse patient populations would improve model generalizability. Additionally, the balanced dataset achieved through stratified sampling does not reflect real-world response prevalence, potentially limiting external validity. While this approach ensured equal model exposure during training, future studies should evaluate models on naturally imbalanced cohorts or apply post-training calibration methods to adjust prediction thresholds accordingly. Second, while clinical and genetic features were well-integrated, the absence of multimodal data such as neuroimaging, environmental exposures, and real-time mood tracking may have limited the model’s full predictive potential. Future studies should aim to incorporate such high-resolution, longitudinal data streams to enhance predictive accuracy. Additionally, although SHAP values improved interpretability, further validation through prospective clinical studies is required before deploying these models in clinical decision support systems. Furthermore, the genetic features used in this study, while informative, represent a small subset of potential pharmacogenetic markers. A genome-wide association approach could uncover additional genetic variants that may significantly improve prediction power. Another consideration is that while the models performed moderately well, their current level of accuracy and AUC values suggest that machine learning in this domain should be viewed as a clinical support tool rather than a definitive diagnostic instrument. These models can help guide clinical intuition and provide probability-based recommendations, but they should not replace clinical judgment. This study demonstrates the feasibility and value of integrating machine learning with clinical and genetic data to predict antidepressant treatment response. The models developed here provide a foundation for more personalized treatment strategies in psychiatry, with the potential to reduce the duration of ineffective treatment cycles and improve patient outcomes. The findings also reinforce the importance of using interpretable machine learning techniques and carefully tuned decision thresholds to enhance clinical relevance. Future research should focus on external validation, larger datasets, multimodal data integration, and prospective clinical trials to fully realize the potential of machine learning in personalized psychiatry.
6. Conclusion
This study successfully developed and evaluated machine learning models to predict individual responses to antidepressant treatment using an integrated approach that combined clinical and genetic data. By applying four supervised learning algorithms—Random Forest, XGBoost, Support Vector Machine (SVM), and Logistic Regression—we demonstrated that it is feasible to moderately predict treatment outcomes before initiating pharmacotherapy. Among the models, Random Forest and Logistic Regression emerged as the most clinically valuable, offering the best balance between precision, recall, and overall model robustness. The findings highlight that key clinical features, including baseline depression severity, age, early life stress, illness duration, and sleep quality, are powerful predictors of treatment response. Additionally, the 5HTTLPR genotype consistently contributed to improved model performance, reinforcing its importance as a potential genetic biomarker in antidepressant pharmacogenomics. The application of SHAP values provided deeper interpretability, enhancing the transparency of the predictive models and offering clinical practitioners a clearer understanding of the factors driving each prediction. One of the key strengths of this study was the application of threshold optimization, which significantly improved the model’s clinical utility by prioritizing responder detection while maintaining acceptable precision. This adjustment is essential for real-world psychiatric applications, where the cost of missing a potential responder can be substantial [62]-[65]. While the models achieved moderate performance, they offer a promising step toward personalized psychiatry [66]-[68]. Future research should focus on expanding datasets, incorporating additional data modalities such as neuroimaging and longitudinal monitoring, and validating the models in external and prospective clinical settings. Ultimately, the integration of machine learning into psychiatric decision-making has the potential to reduce the reliance on trial-and-error prescribing, shorten treatment cycles, and improve patient outcomes by delivering more tailored, data-driven therapeutic strategies.
Conflicts of Interest
The authors declare no conflicts of interest.