Machine Learning-Based Detection of Human Metapneumovirus (HMPV) Using Clinical Data ()
1. Introduction
Human Metapneumovirus (HMPV) is a globally prevalent respiratory pathogen, particularly impacting vulnerable groups such as young children, the elderly, and individuals with weakened immune systems [1]-[5]. This virus is a leading cause of acute respiratory infections, manifesting symptoms such as fever, cough, fatigue, and difficulty breathing [6]-[9]. Despite its significant clinical burden, the timely and accurate diagnosis of HMPV remains a critical challenge in healthcare [10]-[12]. A key issue lies in the overlap of HMPV symptoms with other respiratory illnesses, such as influenza and respiratory syncytial virus (RSV), which complicates differential diagnosis [13]-[16]. Additionally, traditional diagnostic techniques, including viral culture, serology, and polym erase chain reaction (PCR), often require specialized equipment, trained personnel, and substantial time, rendering them resource-intensive and less accessible in under-resourced settings [17]. In recent years, the rapid advancement of artificial intelligence (AI) and machine learning (ML) has opened new opportunities in medical diagnostics [18]. These computational methods can analyze complex datasets, identify subtle patterns, and provide accurate predictions, making them well-suited for the challenges posed by HMPV diagnosis [19] [20]. However, one of the major challenges in leveraging ML for medical applications is the imbalance in clinical datasets, where cases of HMPV-positive samples are often significantly outnumbered by negative samples. This imbalance can lead to biased models that fail to accurately detect positive cases, limiting their utility in real-world clinical scenarios [21]-[23]. To address these challenges, this study proposes a robust ML pipeline for HMPV detection. Using a synthetic dataset that simulates real-world clinical conditions, the pipeline incorporates key features such as symptoms, vital signs, and comorbidities [24]-[26]. The application of the Synthetic Minority Oversampling Technique (SMOTE) ensures balanced data representation, enabling the model to detect minority class cases effectively [27]-[29]. By employing an optimized XGBoost classifier, this research aims to improve diagnostic accuracy, sensitivity, and clinical relevance [30] [31]. The results of this study demonstrate the potential of ML-based approaches to revolutionize HMPV diagnosis and pave the way for scalable, efficient solutions for respiratory illness detection.
2. Methods
2.1. Data Generation
To create a clinically relevant dataset, a synthetic dataset of 5,000 samples was generated, designed to closely simulate real-world patient presentations. The dataset aimed to capture a diverse range of clinical conditions, combining symptomatology, vital signs, and patient comorbidities. Key features include:
Symptoms: Core symptoms associated with HMPV were included:
Fever: Categorized into mild, moderate, or severe levels to reflect varying clinical intensities.
Cough: Distinguished as non-productive, productive, or absent, as cough type is a key indicator in respiratory illnesses.
Fatigue: Captured as a binary feature (present or absent), emphasizing a common but often overlooked symptom.
Symptom Duration: Measured in days to provide temporal insights into disease progression.
Vitals: Key physiological measurements included:
Oxygen Saturation: A critical indicator of respiratory distress, measured as a percentage.
Heart Rate: Measured in beats per minute to reflect cardiovascular stress.
Respiratory Rate: Measured in breaths per minute, indicating pulmonary function.
Comorbidities: Patients were categorized based on comorbidity burden, reflecting mild, moderate, or severe conditions such as asthma, diabetes, or cardiovascular diseases. These factors often influence disease severity and outcomes.
The dataset was generated using statistical distributions (e.g., Gaussian for continuous features and categorical probabilities for discrete variables) informed by domain expertise. Controlled noise was introduced to simulate real-world variability, ensuring the data reflects actual clinical complexity.
2.2. Preprocessing
Preprocessing is a critical step in preparing data for machine learning. The following methods were applied:
Data Cleaning and Transformation:
Extreme outliers were removed or clipped to ensure logical consistency (e.g., clipping ages to a range of 0 - 100 years).
Features such as age and vitals were normalized to mitigate the impact of extreme values.
Feature Standardization:
Continuous features, including age, oxygen saturation, heart rate, respiratory rate, and symptom duration, were standardized using StandardScaler. This ensured all features had a mean of 0 and a standard deviation of 1, preventing scale-related biases during model training.
Addressing Class Imbalance:
The dataset was highly imbalanced, with fewer HMPV-positive cases (minority class). To address this, the Synthetic Minority Oversampling Technique (SMOTE) was applied. SMOTE generates synthetic samples for the minority class by interpolating between existing samples, ensuring balanced representation of both classes in the dataset. This step was crucial for improving the model’s sensitivity to positive cases and reducing false negatives.
2.3. Machine Learning Pipeline
A robust machine learning pipeline was designed to optimize performance, interpretability, and clinical relevance. The pipeline leveraged XGBoost, a high-performing gradient boosting algorithm widely used for tabular data analysis. The approach included careful feature selection, data balancing, and threshold tuning to enhance real-world applicability.
1) Hyperparameter Optimization
To achieve the best possible model performance, a grid search was conducted over a range of hyperparameters, including:
Number of estimators: Controlled the number of trees in the ensemble.
Maximum depth: Regulated the complexity of each tree to prevent overfitting.
Learning rate: Adjusted the step size for updates during training, balancing convergence speed and accuracy.
Class weight balancing: Ensured the model accounted for class imbalances without relying solely on SMOTE.
Overfitting Mitigation:
Cross-validation (stratified k-fold) was employed to ensure generalizability.
Regularization techniques (L1 and L2 penalties) were incorporated to reduce the risk of overfitting.
Early stopping was implemented, preventing excessive iterations that could lead to memorization rather than generalization.
2) Training and Testing
The SMOTE-enhanced dataset was split into training (80%) and testing (20%) subsets using stratified sampling to preserve class proportions.
Threshold optimization: The classification threshold was tuned using ROC curve analysis to balance false negatives and false positives, ensuring clinical reliability.
Generalizability considerations: Further testing on demographically varied synthetic datasets simulated differences in population-based symptoms.
3) Evaluation Metrics
The model was evaluated using multiple metrics to comprehensively assess performance and clinical reliability:
Accuracy: Measured the proportion of correctly classified samples.
F1-Score: Focused on balancing precision and recall, particularly important for imbalanced datasets.
ROC-AUC: Assessed the model’s ability to discriminate between classes at various thresholds.
Precision-Recall Curve: Highlighted performance on the minority class, emphasizing precision for positive cases.
False Negative Rate Analysis: To ensure minimal misclassifications in high-risk patients.
4) Explainability and Feature Importance
Feature Importance Analysis: XGBoost’s built-in feature importance scores were analysed to identify the most influential predictors.
SHAP Explainability Tools: Provided granular insights into how individual features influenced predictions, ensuring transparency for clinical use.
Threshold Sensitivity Analysis: Multiple threshold values were tested to optimize clinical utility, reducing false negatives without excessive false positives.
2.4. Robustness and Validation
The robustness of the pipeline was ensured through multiple validation strategies, mitigating risks of overfitting, bias, and misclassification.
Sensitivity Analysis: The impact of SMOTE was evaluated across varying oversampling ratios to confirm the model’s effectiveness without synthetic data distortions.
Comparison with Baseline Models: XGBoost’s performance was benchmarked against Logistic Regression, Random Forest, and Support Vector Machines (SVMs) to validate its superiority beyond SMOTE effects.
Synthetic Data Testing: The synthetic dataset was cross validated against known clinical distributions to ensure alignment with real-world patient patterns.
Real-World Generalizability Assessment: Future work involves testing on multi-centre clinical datasets to confirm geographic and demographic robustness.
Comparison with Traditional Diagnostic Methods: While PCR remains the gold standard, this model provides a cost-effective AI-assisted pre-screening tool, particularly useful in low-resource settings.
3. Results
After the text edit has been completed, the paper is ready for the template. Duplicate the template file by using the Save As command and use the naming convention prescribed by your journal for the name of your paper. In this newly created file, highlight all the contents and import your prepared text file. You are now ready to style your paper.
Model Performance
The application of Synthetic Minority Oversampling Technique (SMOTE) effectively mitigated the class imbalance in the dataset, significantly enhancing the model’s ability to accurately detect HMPV-positive cases (minority class) [32]-[34]. By generating synthetic samples for underrepresented cases, SMOTE ensured a more balanced learning process, allowing the model to recognize subtle patterns in positive cases rather than favoring the majority class (HMPV-negative cases) [35] [36].
Key performance metrics reflect this improvement: Accuracy (73.54%) indicates the model’s overall effectiveness in correctly classifying both HMPV-positive and negative cases, demonstrating a reliable prediction capability. F1-Score (0.7063) highlights the model’s strong balance between precision (how many predicted positives are correct) and recall (how well the model detects true positive cases). This is particularly critical in medical diagnostics, where false negatives (missed cases) could lead to delayed treatment and worsen patient outcomes. ROC-AUC (0.7990) reflects the model’s discriminative power in distinguishing between HMPV-positive and negative cases across various classification thresholds. A higher AUC value indicates that the model can reliably separate positive and negative cases, making it highly effective for clinical use. These results validate the robustness of the machine learning pipeline, confirming that the integration of SMOTE and an optimized XGBoost classifier significantly improves the model’s sensitivity to detecting HMPV-positive cases. By addressing the challenges of class imbalance, this approach ensures more equitable and reliable predictions, making it highly applicable to real-world clinical settings where early and accurate HMPV detection is essential (See Figure 1).
![]()
Figure 1. Example of Key performance metricss.
The feature importance analysis provides key insights into the clinical variables that most significantly influence the model’s predictions. Among these, symptom duration emerges as the most critical factor, indicating that patients experiencing prolonged symptoms are more likely to test positive for HMPV. This aligns with clinical observations, as persistent respiratory symptoms are a hallmark of viral infections like HMPV. Additionally, fever plays a crucial role, reinforcing its significance as a primary physiological response to infection. The model also identified age as an important predictor, with older individuals showing a higher likelihood of testing positive, which is consistent with epidemiological data indicating that HMPV disproportionately affects vulnerable populations such as the elderly and immunocompromised individuals. Beyond these primary factors, comorbidities and vital signs—such as respiratory rate and oxygen saturation—also contributed significantly to the model’s predictions. Patients with pre-existing conditions may experience more severe disease progression, making these features highly relevant in distinguishing high-risk cases. The alignment between the model’s decision-making and established clinical knowledge further validates its reliability. Understanding which features drive predictions can assist healthcare professionals in prioritizing key diagnostic metrics, ensuring that high-risk patients receive timely intervention. The ability to interpret the model’s decisions enhances trust and usability in clinical settings, making AI-driven diagnostic support a valuable tool for improving patient outcomes in respiratory disease detection.
The Receiver Operating Characteristic (ROC) curve (Figure 2) evaluates the trade-off between the true positive rate (sensitivity) and false positive rate across different thresholds. The ROC curve achieved an AUC of 0.7990, highlighting the model’s strong discriminative power. A near-perfect ROC curve would approach an AUC of 1, making this result highly promising for early-stage HMPV detection.
Figure 2. Receiver operating characteristic (ROC) curve of the HMPV detection model.
The Precision-Recall (PR) curve provides an in-depth evaluation of the model’s performance, particularly in handling the imbalanced nature of HMPV detection (See Figure 3). Precision, or positive predictive value, measures how many of the predicted positive cases are correct, while recall assesses the model’s ability to identify true positive cases. The curve demonstrates that at lower recall levels, precision remains high, indicating that when the model classifies a case as HMPV-positive, it is highly likely to be correct. However, as recall increases—meaning the model becomes more sensitive and captures more actual positive cases—precision decreases slightly. This trade-off is expected, as increasing sensitivity often leads to a higher number of false positives. Despite this decline, precision remains within an acceptable range, making the model clinically valuable for early screening and diagnostic decision-making. This behaviour is particularly crucial in medical applications, where false negatives (missed diagnoses) can have severe consequences, making a high-recall approach essential for effective disease detection. By achieving a strong balance between precision and recall, the model ensures both diagnostic reliability and practical applicability in real-world clinical settings.
![]()
Figure 3. Precision-recall curve for HMPV detection.
The feature importance analysis provides key insights into the clinical variables that most significantly influence the model’s predictions. Among these, symptom duration emerges as the most critical factor, indicating that patients experiencing prolonged symptoms are more likely to test positive for HMPV. This aligns with clinical observations, as persistent respiratory symptoms are a hallmark of viral infections like HMPV. Additionally, fever plays a crucial role, reinforcing its significance as a primary physiological response to infection. The model also identified age as an important predictor, with older individuals showing a higher likelihood of testing positive, which is consistent with epidemiological data indicating that HMPV disproportionately affects vulnerable populations such as the elderly and immunocompromised individuals. Beyond these primary factors, comorbidities and vital signs—such as respiratory rate and oxygen saturation—also contributed significantly to the model’s predictions. Patients with pre-existing conditions may experience more severe disease progression, making these features highly relevant in distinguishing high-risk cases. The alignment between the model’s decision-making and established clinical knowledge further validates its reliability. Understanding which features drive predictions can assist healthcare professionals in prioritizing key diagnostic metrics, ensuring that high-risk patients receive timely intervention. The ability to interpret the model’s decisions enhances trust and usability in clinical settings, making AI-driven diagnostic support a valuable tool for improving patient outcomes in respiratory disease detection (See Figure 4).
![]()
Figure 4. Feature importance analysis of the HMPV detection model.
4. Discussion
The study highlights the effectiveness of applying machine learning to HMPV detection, particularly in addressing the challenge of data imbalance through the Synthetic Minority Oversampling Technique (SMOTE). By leveraging SMOTE, the model significantly improved its sensitivity toward detecting HMPV-positive cases, a critical aspect in medical diagnostics where missing positive cases can have serious health implications [29] [37]. The results provide strong evidence that balancing the dataset enhances the model’s predictive capability, ensuring that both HMPV-positive and negative cases are correctly classified with high accuracy. Beyond data balancing, additional model enhancements contributed to its robustness. Hyperparameter tuning with cross-validation helped reduce overfitting and ensured the model generalized well to unseen data. Threshold tuning was performed to minimize false negatives, prioritizing sensitivity in clinical applications where misdiagnosis can lead to delayed treatment and increased transmission risks. Additionally, baseline model comparisons with Logistic Regression, Random Forest, and Support Vector Machines (SVMs) confirmed that XGBoost outperformed these models in both sensitivity and overall predictive performance. The model’s feature importance analysis revealed that symptom duration, fever, and age were the most influential predictors of HMPV. This finding aligns with established clinical knowledge, where longer symptom duration and elevated fever are strong indicators of respiratory viral infections. Age plays a crucial role, as older individuals and immunocompromised patients are more vulnerable to severe manifestations of HMPV [38]-[40]. The inclusion of vital signs such as Oxygen saturation, heart rate, and respiratory rate further strengthened the model’s ability to capture early signs of respiratory distress, enabling more informed clinical decision-making. The model’s performance metrics, including an F1-score of 0.7063 and ROC-AUC of 0.7990, demonstrate that it successfully addresses the limitations of traditional diagnostic tools. The confusion matrix analysis indicates that false negatives have been significantly reduced, which is critical in medical applications where missing a positive case can lead to delayed treatment and worsened patient outcomes [40]-[43]. Furthermore, the Precision-Recall (PR) curve confirms a stable balance between sensitivity and specificity, ensuring reliable screening performance. The use of SHAP (SHapley Additive Explanations) or feature importance analysis bridges the gap between AI-driven diagnostics and real-world clinical applications. Model explainability is crucial for healthcare AI adoption, and by identifying which features contribute most to predictions, this model enhances trust, transparency, and interpretability for clinicians [23] [44] [45]. While the model achieves high accuracy and sensitivity, an important consideration is how it compares to gold-standard diagnostic techniques like PCR and viral culture. PCR remains the most reliable method for HMPV detection, but it is expensive, time-consuming, and less accessible in resource-limited settings. This study suggests that ML-based detection can serve as an effective pre-screening tool, flagging high-risk cases that require confirmatory PCR testing. Future work should include direct performance comparisons with PCR-based diagnosis to assess the real-world reliability of ML-assisted screening. The methodology used in this study is not limited to HMPV detection but can be extended to other respiratory illnesses such as Influenza, Respiratory Syncytial Virus (RSV), and COVID-19. Since these diseases often share similar symptoms, a generalized AI model trained on diverse respiratory datasets could be deployed across different healthcare settings [46]-[48]. Additionally, integrating this model into clinical decision support systems (CDSS) can provide real-time assistance to physicians, reducing diagnostic errors and improving patient outcomes. In resource-limited settings, where specialized diagnostic tools such as PCR testing may not be readily available, machine learning models can serve as cost-effective pre-screening tools, identifying high-risk cases that require further medical evaluation. Furthermore, combining ML-based predictions with wearable health monitoring devices could facilitate early detection of respiratory distress before patients reach critical stages. Despite its strong performance, this study has certain limitations that must be acknowledged. Synthetic data constraints are a primary concern, as synthetic data, while mimicking real-world conditions, lacks the variability and complexity seen in actual patient records. Future work should involve training the model on real-world clinical datasets to further validate its performance. While SMOTE improves model sensitivity, it can introduce synthetic noise, potentially affecting generalizability. Alternative techniques, such as Generative Adversarial Networks (GANs) for data augmentation, could be explored to enhance data realism. Additionally, threshold sensitivity and clinical calibration need further refinement, as optimizing classification thresholds for different patient populations is necessary to minimize false positives and false negatives in real-world settings. Transitioning from a research-based model to a real-world diagnostic tool requires additional considerations, including computational efficiency optimizations for real-time predictions, seamless integration with electronic health records (EHR), and regulatory approvals to validate performance in hospital settings. The integration of machine learning in healthcare must also adhere to ethical guidelines, ensuring fairness, transparency, and patient privacy. Bias detection mechanisms should be incorporated to prevent discrimination against specific demographic groups. Additionally, physician oversight remains crucial, as AI should complement rather than replace medical expertise. Machine learning models must be designed to support clinical decision-making rather than act as standalone diagnostic systems, ensuring that healthcare professionals remain in control of patient care.
5. Conclusion
This study highlights the potential of machine learning in diagnosing Human Metapneumovirus (HMPV) with high accuracy, particularly by addressing data imbalance using SMOTE. The model achieved an F1-score of 0.7063 and an ROC-AUC of 0.7990, demonstrating its reliability in detecting HMPV-positive cases while minimizing false negatives—a crucial factor in clinical diagnostics. By leveraging clinically relevant features such as symptom duration, fever, oxygen saturation, and respiratory rate, the model aligns with real-world medical insights, making it suitable for clinical decision support systems (CDSS). Beyond HMPV, the approach is scalable to other respiratory diseases like Influenza, RSV, and COVID-19, enabling AI-driven early detection in both clinical and telemedicine settings. The model also presents opportunities for low-resource environments, where access to laboratory diagnostics is limited, offering a cost-effective pre-screening tool. Despite promising results, further validation using real-world clinical datasets is needed to ensure generalizability across diverse populations [49]-[51]. Future research should explore ensemble learning techniques to enhance predictive performance further. In conclusion, this study demonstrates the power of AI in respiratory disease diagnostics, providing a scalable and interpretable solution for early HMPV detection. With continued refinement, machine learning can revolutionize respiratory illness screening, improving patient outcomes and optimizing healthcare resources.
Conflicts of Interest
The authors declare no conflicts of interest.