1. Introduction
Tuberculosis (TB) remains one of the world’s deadliest infectious diseases, particularly in low- and middle-income countries, where social, economic, and educational barriers hinder prevention and early detection efforts. According to the World Health Organization, despite global progress, TB continues to claim over 1.3 million lives annually, with the highest burden concentrated on resource-limited settings [1]. The incidence of tuberculosis has fallen by an average of 1.5% per year since 2000, but this decline is much slower in certain countries in Africa, Asia and Eastern Europe because of poverty, HIV infection and the exodus of vulnerable populations [2].
In this context, understanding and enhancing public knowledge about TB symptoms, transmission, and prevention strategies is crucial for breaking the chain of infection and achieving Sustainable Development Goal 3 (SDG 3) on good health and well-being. However, traditional methods of assessing public awareness, such as manual surveys and aggregated statistics, often fail to capture granular insights and cannot easily scale for real-time or large-scale analysis. At the same time, the growing availability of survey data and the advancement of machine learning (ML) techniques provide an unprecedented opportunity to model, predict, and understand patterns in health-related knowledge including for tuberculosis.
This study contributes to this growing field by exploring how ML models can be used to predict the level of tuberculosis knowledge from structured survey data. Specifically, we:
Compare the performance of several regression algorithms (Linear Regression, SVM, Random Forest, and Neural Networks).
Identify the most influential variables affecting knowledge levels.
Assess the interpretability of models using SHAP values.
By doing so, this work aims to support data-informed public health strategies, allowing for more targeted education campaigns and adaptive interventions that respond to real knowledge gaps in the population. The approach also illustrates how artificial intelligence can strengthen precision health policies in the fight against tuberculosis, especially in underserved communities.
Many studies use qualitative methods or categorical questionnaires to assess TB knowledge. However, using a numerical score allows for a quantitative and continuous measurement of this level of knowledge while providing finer granularity and greater precision. A numerical score will indeed allow interventions to be targeted according to the specific level of knowledge of individuals or groups, which will thus improve the effectiveness of TB awareness campaigns. The use of a numerical score can rely on innovative technologies such as machine learning algorithms to analyze responses and calculate scores. A numerical score can also be used to measure the impact of TB awareness campaigns more precisely by quantifying changes in the level of knowledge before and after an intervention. A new publication that uses a numerical score to assess the level of knowledge about tuberculosis provides an innovative and rigorous approach that stands out from existing publications in that it allows for more detailed analysis, easier comparisons, and more targeted interventions that will contribute to a better understanding and more effective fight against tuberculosis.
To optimize control strategies, understanding how to predict TB in different populations can play a crucial role in public health policies. A better understanding of the disease among populations is crucial for improving prevention and treatment efforts. Artificial intelligence (AI) makes it possible to analyze complex data and predict knowledge of tuberculosis, enabling more targeted public health interventions.
TB knowledge can be predicted through various methods and approaches depending on the context and specific objectives. Questionnaires and surveys can be designed to assess individuals’ knowledge about tuberculosis, including transmission, symptoms, treatment and prevention. Studying the correlation between the level of knowledge about tuberculosis and demographic (age, gender, education, etc.) and socioeconomic (income, employment, social status) factors can help predict knowledge of the disease.
The data collected as part of this work come from Niger. In Niger in 2015, the National Tuberculosis Control Program (PNLT) estimated 7115 new cases of microscopy-positive tuberculosis [3]. The capital of Niger (Niamey) had an estimated population of 1,057,347 inhabitants in 2015 according to the National Statistics Institute (INS) [4].
The general objective of this work is to study the knowledge levels of the population on tuberculosis in Niamey, and the specific objectives are as follows:
Evaluate knowledge on the mode of transmission of tuberculosis.
Measure the degree of knowledge about the germ causing the disease.
Determine knowledge of symptoms suggestive of tuberculosis.
Knowing the means of treating tuberculosis.
To achieve these objectives, we followed the plan below:
The first part presents the problem.
The second part is related to the literature review.
The third part addresses methods for predicting knowledge of tuberculosis.
The fourth part concerns the framework for collecting the data used, the results obtained and the interpretation of these results.
The fifth part presents the results obtained.
The sixth part presents a discussion of the results in light of the literature review.
The seventh section is devoted to the conclusion and suggestions.
2. Problematic
Tuberculosis is a public health problem because of its severity, extent worldwide and economic weight. It is one of the main causes of death from infectious diseases worldwide according to estimates from the World Health Organization (WHO) in 2014 [1]. The incidence of tuberculosis has fallen by an average of 1.5% per year since 2000, but this decline is much slower in certain countries in Africa, Asia and Eastern Europe because of poverty, HIV infection and the exodus of vulnerable populations [2].
Knowledge of tuberculosis varies depending on several factors around the world. Knowledge of TB is complex among young adults and adolescents due to unverified and pervasive information despite challenges in the early detection of TB [5] [6]. The authors of [7]-[9] sought to explore the epidemiology of tuberculosis and, in particular, the role of socioeconomic factors in determining how disadvantaged populations experience tuberculosis. Assessing knowledge of tuberculosis among disadvantaged populations requires taking into account the level of education, access to information and economic barriers that limit access to care and prevention. Humans respond to, understand, and treat infectious diseases worldwide through the lens of anthropology [10]. Macdonald and Harper [11] take an in-depth look at tuberculosis and show that barriers related to geographic isolation make it difficult to identify gaps in knowledge of the disease. Precarious living conditions and limited access to health care are particularly favorable for the spread of tuberculosis, especially in low- and middle-income countries [12] [13]. Ignorance of symptoms and modes of transmission and prevention strategies, especially among vulnerable populations, aggravates the tuberculosis situation in India and contributes to the persistence of the disease and delays in seeking care [14] [15].
Geographic isolation in certain regions of Niger makes access to awareness campaigns and tuberculosis screening services difficult. This hampers the collection of data on disease, especially in remote areas [16]. Popular beliefs and social stigma related to tuberculosis in the Enugu region of Nigeria may hinder recognition of symptoms of the disease and deter people from seeking treatment [17]. The assessment tools used to measure knowledge of tuberculosis are not always adapted to the specificities of the populations in eastern Algeria, particularly due to cultural differences, linguistic diversity and literacy levels [18]. The lack of health infrastructure and trained personnel in rural areas leads to low tuberculosis screening capacity, making it difficult to assess the rate of awareness of the disease among these populations [19].
The assessment of knowledge of tuberculosis is essential and complex at several levels, particularly the following:
design educational strategies that counter misinformation and promote positive health behavior;
help identify appropriate strategies to overcome social disparities and improve access to health;
develop adapted interventions aimed at improving prevention, screening and adherence to treatment;
understand not only factual knowledge but also social and cultural perceptions and health behaviors influenced by external factors such as stigma and discrimination;
measures the effectiveness of awareness campaigns, adapts prevention and treatment strategies, and targets groups at risk.
Knowledge of tuberculosis (TB) is crucial in the fight against this infectious disease, especially in areas with poor access to health care. A good understanding of tuberculosis will make it possible to identify the levers necessary to improve the situation and fight more effectively against this preventable and treatable disease.
Machine learning algorithms can be used to predict TB knowledge by analyzing structured and unstructured data. To do this, we use specific measures and indicators that make it possible to quantify the relationships between the characteristics of individuals (demographic factors, health behaviors, etc.) and their level of knowledge about tuberculosis.
3. Literature Review
Artificial intelligence (AI) represents a significant advancement in the analysis and prediction of tuberculosis (TB) knowledge, particularly in areas with high prevalence rates. By analyzing complex data and using machine learning models, AI can identify gaps in people’s understanding, predict their level of knowledge and adapt public health interventions.
This review explores recent approaches that use AI to predict and analyze TB knowledge, highlighting supervised learning models.
Machine learning (ML) is used to analyze complex data from different sources to predict TB knowledge among different populations. ML algorithms, such as decision trees, random forests, artificial neural networks, and support vector machines (SVMs), are used to identify factors influencing the understanding of TB and to predict knowledge levels in specific individuals or groups [20]-[23].
Supervised models such as random forests or SVMs make it possible to classify individuals according to their level of knowledge about tuberculosis via survey data or medical histories [20]. Neural networks can be applied to process nonlinear data and discover complex relationships between variables influencing TB knowledge [21].
The results from Harris Miriam [22] revealed that deep learning techniques can predict higher levels of diagnostic accuracy than can human radiologists in the interpretation of chest radiographs for pulmonary tuberculosis. Mohidem et al. [23] emphasized that neural networks can predict multifactorial phenomena such as tuberculosis, which makes them more suitable for improving public health policies.
Deep neural networks are a subcategory of machine learning algorithms capable of processing unstructured and complex data, such as images, text or behavioral signals. These models have been used to predict TB knowledge by processing large datasets, including demographic information, survey responses, and even medical histories [24]-[27].
Reference [24] is a review of recent research based on the application of artificial intelligence for the management of infectious diseases. Previous studies [25] have shown that artificial intelligence and machine learning play increasingly central roles in the diagnosis and management of diseases. This study [26] highlights the application of deep learning techniques for tuberculosis diagnosis in India. This study [27] uses a convolutional neural network (CNN) for image feature extraction to improve pneumonia detection. This paper [28] provides an in-depth analysis of data mining for predicting TB contagion risk.
Understanding the relationships between environmental factors, including weather and air quality, and the incidence of tuberculosis can help predict epidemics and implement targeted prevention measures [29]. Linear regression is used to analyze how factors such as age, gender, socioeconomic status, and education affect knowledge of TB in a given population [30]-[32]. Vimala Balakrishnan et al. used a support vector machine regression-based approach to improve prediction accuracy by accounting for various clinical and demographic factors [33].
An ANN model was used to predict knowledge of tuberculosis from variables such as age, gender, clinical symptoms, laboratory test results, and medical history [34].
An artificial neural network model was designed and implemented to classify patients as having TB from data that include patient information such as age, gender, clinical symptoms, laboratory test results, and medical history [35].
Random forests showed the best predictive performance, with high accuracy and a good ability to identify patients at risk of nonadherence using variables such as age, gender, education level, distance to health center, treatment side effects, and cultural beliefs from data collected from TB patients in the Mukono district, Uganda [36].
Several machine learning algorithms (logistic regression, random forests, gradient boosting, and neural networks) have been used to predict treatment success via data from pulmonary tuberculosis patients, including variables such as age, gender, comorbidities, microbiological test results, treatment regimens, and radiological data [37].
This study aims to develop a machine learning model to predict TB detection via data from trained African giant rats to improve diagnostic efficiency and accuracy [38].
4. Methodology
4.1. Data Collection Methodology
This work uses data collected during a study that took place in Niamey, capital of Niger, during the month of July 2016 to assess the population’s knowledge levels of tuberculosis. The sample size of this study was limited to 507 individuals distributed across the five municipal districts of the city of Niamey as follows:
Individuals were targeted for data collection according to the following characteristics:
At least 15 years old, having agreed to freely answer the questionnaires;
Having all their mental capacity.
Individuals with the following characteristics were not taken into account in the study:
A semi-directive questionnaire was used in which the subject had to answer YES, NO or DON’T KNOW to choose a single answer from several proposed answers.
A free questionnaire was used for group interviews, in which subjects offered several answers. For general knowledge of tuberculosis, each correct answer was given one point for each of the ten questions asked.
The surveys were classified into knowledge categories, namely, good knowledge (8 - 10/10 points), average knowledge (5 - 7/10 points), and insufficient knowledge (0 - 4/10 points), on the basis of their response to this knowledge.
For the ten questions, the correct answers were as follows:
For the definition of tuberculosis: what is tuberculosis, you had to say “YES” and suggest at least a chronic cough, fever, and weight loss (in the symptoms).
For contagiousness, it was necessary to say “YES” and suggest that transmission takes place by air.
For the causative agent, it was necessary to say the “koch bacillus”.
Whatever the location, it was taken into account.
Whatever the source of information, it was taken into account.
For curability, it was necessary to say: “YES, it is curable”.
Whatever the risk factors, they were taken into account.
For diagnosis, it was necessary to offer sputum examination or X-ray.
For treatment, it was necessary to say: “anti-tuberculosis drugs”.
For the duration of treatment, it was necessary to propose: “6 months, more than 6 months or 8 months”.
On the basis of the scores obtained on knowledge, the respondents are classified into three levels of knowledge, and this distribution makes it possible to look for relationships between these respondents. The choice of respondents was made on the basis of neighborhoods and age groups. The group interviews took place within households with the subjects, in public places or on the street. Emphasis was placed on the fact that participation is free and voluntary.
The different variables studied are sociodemographic characteristics (gender, age, marital status, level of education, city of residence) and the general theoretical level of knowledge about tuberculosis (signs, transmission, sources of information, causal agent, locations of tuberculosis, risk factors, diagnosis, treatment).
This study takes into account the dignity, privacy and freedom of those interviewed by ensuring, in particular, the confidentiality of the identity of the respondents through an anonymous questionnaire.
4.2. Sampling Method and Distribution of Respondents
The study is based on a stratified random sampling approach grounded in the five municipal districts of Niamey. Each district was considered as a stratum, and respondents were randomly selected within each stratum, ensuring a geographic representativeness of opinions across the city of Niamey. This method combines scientific rigor with territorial balance.
The 507 survey participants were distributed across the five districts as shown in Table 1.
Table 1. Distribution of 507 respondents.
Municipal District |
Number of Respondents |
Approximate Percentage |
Niamey 1 |
99 |
19.5% |
Niamey 2 |
102 |
20.2% |
Niamey 3 |
100 |
19.7% |
Niamey 4 |
103 |
20.3% |
Niamey 5 |
103 |
20.3% |
Total |
507 |
100% |
This distribution reveals an almost equal number of respondents across all districts, as detailed in Table 1, ensuring a well-balanced dataset for comparative analysis and enhancing the external validity of the study.
The chosen method guarantees homogeneous coverage of the urban territory of Niamey, minimizes selection bias, and ensures statistical reliability for extrapolating the results to the city’s entire population.
4.3. Defining Predictive Variables
The table of predictive variables and their coding system is presented below in Table 2.
Table 2. Predictive variables and their coding system.
Predictive Variable |
Definition/Description |
Type of Coding |
Gender |
Respondent’s gender |
One-hot (male, female) |
Age |
Respondent’s age group |
Ordinal (15 - 25 = 1, 26 - 44 = 2, etc.) |
Marital Status |
Marital status (single, married, etc.) |
One-hot (single, married, divorced, widower, etc.) |
Town |
Place of residence |
One-hot (Niamey 1, Niamey 2, etc.) |
Heard about tuberculosis |
Has ever heard about TB |
Binary: 0 = No, 1 = Yes |
Perception of tuberculosis form |
Personal assessment of the seriousness of the disease |
One-hot (fatal disease, mild illness, etc.) |
Knowledge about the causal agent of tuberculosis |
Knowledge specifically related to the causal agent |
Binary: 0 = No, 1 = Yes |
Localization of the causal agent |
Knowledge of the different organs affected by the causal agent |
One-hot (pulmonary, pleural, bone, don’t know, etc.) |
Information sources used |
Media or communication channels used |
One-hot (television, radio, social networks, etc.) |
Confidence in the curability of tuberculosis |
Declared level of confidence in the curability of the disease |
Binary: 0 = No, 1 = Yes |
Knowledge about tuberculosis risk factors |
History of illness or exposure |
Numeric (aggregated score) |
Confidence in tuberculosis diagnosis (curability) |
Declared level of confidence in the effectiveness of the diagnosis |
One-hot (sputum test, X-ray, etc.) |
Knowledge about tuberculosis treatment |
Dosage |
Numeric (aggregated score) |
Duration of tuberculosis treatment |
Knowledge about different treatment durations |
One-hot (6 months, more than 6 months, 8 months, etc.) |
The dependent variable is the level of knowledge about tuberculosis, and the knowledge score results from the weighted sum of correct answers.
The coding system may vary depending on the software used for data analysis, but the underlying logic remains consistent.
4.4. Tuberculosis Knowledge Score and Modeling Approach
The tuberculosis knowledge score is calculated by summing the correct answers to a series of closed-ended questions.
This score can therefore take multiple integer values between 0 and 10, depending on the number of correct responses provided by each participant.
This score represents a graduated quantitative measure each additional point reflects an actual gain in knowledge. It is based on a fixed-interval scale and may vary considerably between individuals, without natural thresholds or predefined categories.
This type of score is suitable for modeling as a continuous variable, as it meets the conditions of an interval scale and can be treated accordingly in a linear regression analysis.
Some might consider grouping the scores into three categories (low, medium, high), but such classification would result in a loss of information. The regression approach is therefore preferable for several reasons:
1) Preserving Data Granularity
a) Keeping the variable as numeric preserves the full richness of the data.
b) Categorizing into three levels would imply arbitrary thresholds, introducing methodological noise.
2) Analytical Objective: Predicting the Effect of Explanatory Variables on the Score
a) The goal is to measure the impact of predictive factors (e.g., gender, age, occupation, perception) on the knowledge level.
b) Regression provides estimated marginal effects (e.g., men score on average 0.8 points higher than women), which classification methods cannot deliver as precisely.
3) Statistical Power Gain
a) Treating the scores as continuous variables increases the sensitivity of statistical tests.
b) Less information loss = better ability to detect real effects.
4) Reducing Interpretation Bias
a) Class-based methods impose arbitrary thresholds that are difficult to justify empirically.
b) Regression yields coefficients directly interpretable on the original scale.
A classification approach would only be suitable if:
The objective was to categorize respondents according to policy or decision thresholds (e.g., urgent need for tuberculosis training if score < 4).
The model focused on probabilities of group membership rather than score progression.
But that is not the case here, where the objective is to measure the effect of predictors on the knowledge level measured by a score.
4.5. Proposed Methodology for Predicting Tuberculosis
Knowledge Levels
1) Input
a) A dataset of 507 observations collected via a survey on knowledge of tuberculosis.
2) Output
a) Trained regression models capable of predicting knowledge levels.
b) Interpretability of the predictions, identifying the most influential features.
3) Step-by-Step Methodological Workflow
a) Import Libraries and Packages
Load Python libraries essential for data science (e.g., pandas, numpy, scikit-learn, TensorFlow, SHAP).
b) Load Dataset
c) Data Cleaning
d) Feature Engineering
Encode categorical variables using one-hot or label encoding.
Standardize numerical features to normalize their ranges (e.g., z-score).
e) Dataset Splitting
f) Model Selection
Define and prepare five candidate models for comparison:
Dummy Regressor (baseline)
Linear Regression
Random Forest Regressor
SVM Regressor (Support Vector Machine)
Artificial Neural Network (ANN)
g) Hyperparameter Tuning
h) Training and Validation
i) Performance Evaluation
Evaluate all models using five regression metrics:
R2 (explained variance)
RMSE (root mean squared error)
MAE (mean absolute error)
MSE (mean squared error)
MAPE (mean absolute percentage error)
j) Model Interpretability with SHAP
Use SHAP (SHapley Additive exPlanations) to interpret the best-performing model.
Identify the most influential variables.
Visualize their impact (positive or negative) on prediction outcomes.
This robust, multi-model regression pipeline ensures predictive accuracy and interpretability, making it suitable for public health decision-making related to tuberculosis knowledge dissemination. The inclusion of SHAP enhances transparency, allowing stakeholders to understand which factors most influence knowledge levels across the population.
5. Results
5.1. Presentation of Indicators
The mean absolute error (MAE), mean squared error (MSE) and root mean squared error (RMSE) metrics are specific indicators that help quantify how accurate and reliable the model is in predicting TB knowledge levels, especially when working with continuous variables or regression tasks. The MAE is the average of the absolute differences between the actual values and the values predicted by the model. This gives an idea of the average error in absolute terms without taking into account the direction (positive or negative) of the error.
(1)
where
is the actual value and where
is the predicted value.
For the prediction of a knowledge score between 0 and 10 to assess knowledge of tuberculosis, the MAE measures the average difference between the actual knowledge score and the predictive score of the model. The random forest regressor, SVM regressor, and linear regression record the three lowest MAEs, indicating that these models make small errors in predicting the knowledge level.
The MSE is the average of the squares of the errors, which penalizes large errors (those that are far from the true value) more because the errors are squared.
(2)
where
is the actual value and where
is the predicted value.
For predicting a knowledge score, the MSE gives you an idea of the error variance. A low MSE means that the errors are generally small, whereas a high MSE indicates that the errors are large and can destabilize the model. The random forest regressor, linear regression and SVM regressor yield the three smallest errors in predicting the knowledge level.
The RMSE is the square root of the MSE. It represents the average error in a more intuitive form by putting the error scale in the same unit as the target variable, which, in our case, is the knowledge score.
(3)
The RMSE is a measure that is often more understandable than the MSE because it gives the average error in the same units as the target variable, which here is a knowledge score between 0 and 10. A low RMSE means that the model predictions are close to the actual values, and a high RMSE indicates that the model predictions deviate greatly from the actual values.
These indicators measure different aspects of model performance; the MAE, MSE, and RMSE measure the accuracy of predictions (the lower the error is, the better). In the case of tuberculosis, a good prediction model must minimize errors so that it can effectively identify cases of tuberculosis without producing too many false positives or false negatives.
The Mean Absolute Percentage Error (MAPE) measures the average percentage difference between predicted values and actual (true) values. It is commonly used in regression analysis to assess forecast accuracy or model prediction error in percentage terms.
(4)
where:
= total number of observations,
= actual value,
= predicted value. MAPE expresses prediction errors as a percentage of the actual values, which makes it easy to interpret across datasets and domains.
5.2. Hyperparameter Optimization Results, Training Durations
and Cross-Validation
Hyperparameter tuning was carried out using the RandomizedSearchCV method, as stated in step g) Hyperparameter Tuning. This random search process over the possible hyperparameter combinations was applied to all models, except the Dummy Regressor, for which only the baseline strategy was adjusted. Table 3 summarizes the training times and the best hyperparameter settings for each model.
Table 3. Hyperparameters and training times.
Model |
Training Time (s) |
Best Hyperparameters |
Linear Regression |
0.0018 |
{‘fit_intercept’: False} |
Random Forest |
0.248 |
{‘max_depth’: 10, ‘min_samples_split’: 10, ‘n_estimators’: 200} |
SVM Regressor |
0.011 |
{‘C’: 1, ‘gamma’: ‘auto’, ‘kernel’: ‘rbf’} |
Artificial Neural Network |
0.407 |
{‘activation’: ‘relu’, ‘alpha’: 0.0001, ‘hidden_layer_sizes’: (50,)} |
AANN and Random Forest provide higher complexity and accuracy but require more training time. Linear and SVM models are more time-efficient but may perform worse in complex prediction tasks. This hyperparameter tuning confirms that model performance gains often come with a training time cost, especially for ensemble and deep learning models.
1) Fastest Training
a) Linear Regression trained the fastest (0.0018 s), reflecting its simplicity and low computational cost.
b) SVM also trained quickly (0.011 s), despite using a non-linear kernel (rbf), due to its optimization efficiency on smaller datasets.
2) Longest Training
a) ANN had the longest training time (0.407 s), which is typical due to the iterative learning and complexity of neural networks.
b) Random Forest took 0.248 s, which is relatively efficient given the number of estimators (200) and tree depth.
3) Model Complexity & Parameters
a) The Random Forest model benefited from fine-tuned parameters that control overfitting (max_depth, min_samples_split) and model stability (n_estimators).
b) The SVM Regressor used a radial basis function kernel with gamma = ‘auto’, suitable for capturing non-linear relationships.
c) The ANN model was configured with a single hidden layer of 50 neurons, ReLU activation, and L2 regularization (alpha = 0.0001).
5.3. Results and Interpretation
Quantitative approaches, such as machine learning models, are used to predict TB knowledge on the basis of multiple variables. These models can identify groups at risk of lack of knowledge and guide interventions more precisely.
Cross-validation was performed using the K-fold method, as indicated in step h) Training and Validation of the data processing methodology. The number of folds (k) used was 10, which is a standard practice to ensure a good balance between bias and variance, while maintaining sufficient data representativeness in each training and test subsample. This is consistent with the cross-validation results presented in the performance Table 4, which reflect the average performance across the different folds.
Table 4. Performance with cross-validation.
Model |
Best_R2 |
Best_RMSE |
Best_MAE |
Best_MSE |
Best_MAPE |
Dummy Regressor |
−0.000003 |
0.694385 |
0.554172 |
0.482171 |
9.973556 |
Linear Regression |
0.560736 |
0.701756 |
0.555878 |
0.492462 |
8.964243 |
Random Forest |
0.794882 |
0.597130 |
0.439553 |
0.356565 |
7.151338 |
SVM Regressor |
0.574384 |
0.537905 |
0.394127 |
0.289342 |
6.929802 |
Artificial Neural Network |
0.706541 |
0.606632 |
0.411178 |
0.368002 |
6.890633 |
The evaluation compared five models: Dummy Regressor, Linear Regression, Random Forest, Support Vector Machine (SVM) Regressor, and Artificial Neural Network (ANN), based on five performance metrics (R2, RMSE, MAE, MSE, and MAPE).
Random Forest achieved the highest explanatory power with an R2 of 0.795, showing strong ability to model the variance in knowledge levels.
SVM Regressor recorded the lowest prediction errors across RMSE (0.538), MAE (0.394), and MSE (0.289), indicating superior accuracy in absolute terms.
ANN provided a strong balance between explanation and precision, with an R2 of 0.707 and the best MAPE (6.89%), suggesting high reliability in percentage-based error.
Linear Regression performed poorly, with higher error rates and low explanatory power compared to non-linear models.
The Dummy Regressor, as expected, served as a weak baseline.
The best-performing models in this context are SVM Regressor and ANN, due to their lower error margins and robust predictive ability. Random Forest remains the top choice for interpretation and variance explanation. These results support a hybrid or ensemble strategy for enhanced accuracy and interpretability in public health knowledge modeling.
5.4. Feature Importance and Interpretability
To understand what drives the knowledge score, we combined two complementary perspectives. The Random Forest’s global importance summarizes which variables contribute most to the model’s predictive power, while SHAP values reveal not only how important each variable is overall but also the direction of its effect for individual observations. Both views are based on the same final model and the same preprocessing pipeline, ensuring consistency.
Figure 1 shows that Treatment Duration is by far the dominant driver of the model’s predictions, with Town clearly in second position. Beyond these two, Age, Mild Illness, and Fasting Treatment contribute meaningfully, whereas Marital Status, Gender, Fatal Disease, Form of TB, and Heard about TB play smaller roles. This ranking reflects how much each feature helps the trees separate higher from lower knowledge scores; it does not, however, indicate whether a higher value pushes the prediction up or down.
Figure 1. Random forest global feature importance.
Figure 2. SHAP summary for the final model.
Figure 2 complements this picture by adding direction. Longer perceived Treatment Duration is associated with higher predicted knowledge, suggesting that respondents who recognize that treatment takes time tend to score better. Geographic context also matters: the Town variable shows heterogeneous effects, consistent with local differences in awareness and access to information. Two misconceptions, considering tuberculosis a Mild Illness and believing treatment should be taken on an empty stomach, are linked to lower predicted knowledge, while Age exerts only a modest, non-monotonic influence. Taken together, the SHAP summary reinforces the Random Forest ranking and clarifies how these predictors shape the outcome.
5.5. Discussion of Error Distribution
Figure 3 presents histograms of the residuals (prediction errors = predicted value − actual value) for each model used to estimate knowledge levels on tuberculosis.
1) Dummy Regressor
a) The residuals are widely and irregularly spread around −1, with no clear central tendency.
b) This reflects the model’s inability to learn from the data, acting as a random baseline.
c) Distribution is non-normal and asymmetric, with high variance.
2) Linear Regression
a) The residuals are centered around 0, showing a symmetric, bell-shaped distribution.
b) However, the spread is relatively large, indicating higher error variance.
c) Some outliers are visible on both ends, suggesting sensitivity to extreme values.
3) Random Forest
a) The residuals show a narrow and steep distribution around 0.
b) The shape is more peaked (leptokurtic), suggesting high precision and low bias.
c) The curve is well-centered and exhibits fewer extreme residuals compared to others.
4) SVM Regressor
a) The error distribution is tightly centered around zero and almost perfectly symmetrical.
b) It demonstrates the least spread and smoothest distribution, confirming the model’s excellent generalization capacity.
c) Indicates minimal bias and consistent error performance across instances.
5) Artificial Neural Network (ANN)
a) The residuals are centered at 0 with a smooth, symmetrical distribution.
b) The spread is slightly broader than SVM but tighter than Linear Regression.
c) The ANN shows a balanced trade-off between accuracy and generalization, with relatively few extreme errors.
SVM Regressor and Random Forest display the most desirable residual patterns: tight, centered, and normally distributed, indicating high model reliability. ANN follows closely with a smooth and symmetrical distribution, slightly more dispersed. Linear Regression and Dummy Regressor show less optimal residual behaviors, with wider spreads and signs of underfitting or oversimplification.
The residual distribution analysis reinforces the numerical evaluation:
SVM and Random Forest offer the most consistent and accurate predictions.
ANN is also strong, with minimal bias and decent error variance.
Linear Regression fails to capture non-linearities, while Dummy Regressor confirms its role as a weak benchmark [39]-[41].
These visual diagnostics suggest that non-linear models with regularization and ensemble learning (SVM, Random Forest) are better suited for predicting complex patterns such as knowledge levels in public health datasets.
Figure 3. Error distribution.
5.6. Performance Visualization Results Based on the Metrics
Charts, Heatmaps, Bar Plots, and Radar Charts
As shown in Figure 4, Random Forest attains the highest R2 (0.795), indicating that it explains the largest share of variance, with ANN close behind (R2 = 0.707). SVM Regressor and Linear Regression perform at moderate levels (≈0.57 and 0.56), while the Dummy model is near 0, as expected. For error magnitude, Figure 4 also shows SVM achieving the lowest RMSE (0.538), with Random Forest and ANN remaining competitive (0.597 and 0.607). On absolute error metrics, Figure 4 indicates that SVM leads on MAE (0.394) and MAPE (6.93%), with ANN just behind (MAE = 0.411; MAPE = 6.89%). Random Forest is moderate (MAE = 0.44; MAPE = 7.15%), whereas Linear Regression and the Dummy model exhibit the largest errors (MAE > 0.55; MAPE ≈ 9% - 10%).
Figure 4. Performance visualizaion results—part I.
The radar chart in Figure 4 highlights SVM’s broad, balanced profile across normalized axes; ANN and Random Forest also perform strongly, while the Dummy and Linear models lag on error-related metrics. The heatmap and ranking panel in Figure 5 summarize these patterns: SVM ranks first on three of four metrics (MAE, MAPE, RMSE), ANN ranks first on MAPE and second on two metrics, and Random Forest leads on R2, suggesting strong explanatory power but not necessarily the lowest errors. The trade-off plot (R2 vs MAPE) in Figure 5 further illustrates that ANN offers the best balance between high R2 and low MAPE; Random Forest provides excellent variance explanation with slightly higher MAPE, and SVM minimizes errors while accepting a modestly lower R2 than ANN.
Figure 5. Performance visualizaion results—part II.
SVM Regressor delivers the most accurate predictions in terms of absolute error reduction, whereas ANN offers the best overall balance across metrics [42]. Random Forest is preferable when explanatory power and variance explanation are prioritized. Linear Regression and the Dummy model are not adequate for reliable prediction in this context. Depending on priorities (lowest errors vs. best balance), SVM or ANN is recommended; a stacking ensemble combining SVM and ANN could further enhance robustness.
5.7. Actual vs. Predicted Results
Figure 6 places the observed values against the model predictions to show how well each method tracks reality. The Dummy Regressor (R2 = −0.0032) collapses into a flat band, which simply confirms that it does not learn from the data and performs worse than predicting the mean.
With Linear Regression (R2 = 0.2358), Figure 6 shows a gentle upward trend toward the 45˚ line, but the cloud of points spreads widely, signaling limited variance capture and sensitivity to outliers and heteroscedasticity. The Random Forest panel is noticeably tighter around the perfect-prediction line; its R2 of 0.4104 reflects the strongest alignment between actual and predicted values among all candidates.
Figure 6. Actual vs. predicted results.
The SVM Regressor (R2 = 0.3461) appears more structured than Linear Regression in Figure 6, yet a visible spread remains, indicating good generalization with slightly less fidelity than Random Forest. The Artificial Neural Network (R2 = 0.3438) also pulls predictions closer to the reference line than the linear baseline, landing roughly on par with SVM and just behind Random Forest.
Overall, Figure 6 supports a clear hierarchy: Random Forest offers the best balance of explanatory power and reliability, followed by SVM and ANN. Linear Regression and the Dummy model lag behind, consistent with their limited ability to capture the non-linear patterns present in this task.
6. Discussion
This study introduced a numerical score to assess the level of knowledge of tuberculosis and used machine learning algorithms to validate and optimize this score. The results show that the numerical score allows a more accurate and granular assessment of knowledge than traditional methods based on categorical responses do. The machine learning algorithms used demonstrated a high ability to predict the level of knowledge on the basis of questionnaire responses, with accuracy.
Our study provides a methodological innovation by combining a numerical score with machine learning techniques to assess knowledge of tuberculosis. Unlike traditional methods, which are limited to qualitative or categorical assessments, our approach allows a quantitative and aggregated measurement of the level of knowledge. In addition, the use of machine learning algorithms made it possible to predict the level of knowledge on the basis of questionnaire responses via different indicators. This method can be adapted to other areas of public health where the overall assessment of knowledge is crucial.
Our results are consistent with those of previous studies that identified gaps in TB knowledge, particularly in populations with low education levels or those living in rural areas [43]. However, unlike existing studies that rely on qualitative methods or categorical questionnaires, our numerical score offers a more precise and reproducible measure. For example, a recent study reported that only 50% of 1,200 randomly surveyed Hainan University students knew specific prevention methods and that 60% believed that TB could be completely cured [44]. Our score, in contrast, allows for more nuanced distinctions between knowledge levels, identifying specific subgroups in need of tailored interventions. Several limitations should be considered. First, the numerical score relies on a self-report questionnaire, which may introduce social desirability bias. Second, although machine learning algorithms have shown high performance, their application requires quality data and technical expertise, which may limit their use in some contexts. Third, our study was conducted in a specific population, and the generalizability of the results to other contexts requires further validation.
To overcome these limitations, future research could explore the use of more advanced machine learning techniques, such as deep neural networks, to enable a more fine-grained analysis of the factors influencing the level of knowledge. Finally, longitudinal studies are needed to assess the impact of TB awareness on the evolution of the numerical score over time.
7. Conclusions
Prediction plays a crucial role in the management of tuberculosis, facilitating early detection, reducing transmission, improving treatments, and contributing to a more effective health response to this disease. Predicting TB could reveal how the disease is managed globally. For these changes to occur, it is necessary to invest in data collection and analysis technologies, strengthen cooperation between public and private health institutions, and build resilient health systems capable of responding quickly to the challenges posed by the disease. This change, although complex, could have a major impact on reducing the incidence of tuberculosis by increasing the rate of early detection, improving treatments and limiting the spread of the disease.
This study made it possible to assess the knowledge levels of the population of Niamey on tuberculosis. It constitutes a basic tool on which subsequent studies can be based with the aim of improving the levels of knowledge of a population about tuberculosis.
This study successfully demonstrates the potential of supervised machine learning algorithms to predict individuals’ level of knowledge about tuberculosis based on survey data. By comparing multiple regression models including Linear Regression, Support Vector Machine (SVM), Random Forest, and Artificial Neural Networks (ANN), the analysis highlights the effectiveness of non-linear approaches in capturing complex patterns within the data.
Among all tested models, Random Forest achieved the best performance in terms of variance explanation (R2 = 0.795), while the SVM Regressor and ANN stood out for their low prediction errors, with MAPE values below 7%. These results confirm the added value of advanced algorithms in enhancing the predictive accuracy of public health assessments. A value of k = 10 was used for stratified cross-validation.
Hyperparameter tuning was conducted via RandomizedSearchCV, only on the training set. No data leakage occurred: the data split, step sequencing, and final evaluation all adhered to best practices in predictive modeling.
Moreover, the use of SHAP interpretability techniques made it possible to identify the most influential features contributing to knowledge prediction, ensuring transparency and facilitating actionable insights for public health stakeholders.
In light of these findings, this work paves the way for data-driven decision-making in tuberculosis education and awareness campaigns; personalized public health strategies, targeting populations with the lowest predicted knowledge levels; scalable AI-based screening tools in epidemiological surveys.
Further research could enrich the model by integrating behavioral, spatial, or clinical data, and by deploying real-time applications in community-based health programs. Ultimately, this study underscores the promise of explainable artificial intelligence in promoting precision public health, especially in resource-limited settings where tuberculosis remains a major concern.