1. Introduction
Thalassemia, a hereditary chronic hemolytic disease, is caused by the deficiency or mutation of globin genes that impede the synthesis of hemoglobin [1] . It was first discovered and named by Thomas Cooley and Pear Lee, Italian researchers, along the coast of the Mediterranean Sea in 1925. According to the statistical data of the World Health Organization (WHO) in 2008, about 300,000 to 400,000 thalassemia patients are born worldwide each year, accounting for 17% of the global population as carriers of thalassemia genes [2] . Approximately 18.7% of beta-thalassemia major neonates require regular blood transfusions to sustain life, and about 10% of affected children die in the neonatal period. The mortality rate of children under five years old is as high as 3.4%, posing a significant threat to people’s health [3] .
As a monogenic hereditary disease, thalassemia is widely distributed in parts of Africa, the Middle East, and Asia. However, there are significant variations in the screening programs for thalassemia due to differences in the level of medical development in different countries and regions [4] . Common screening methods include single-factor analysis and combined screening. Additionally, even with the same screening protocol, each country may set different thresholds for blood parameters based on regional influences [5] . For example, the common hematological parameter, Mean Corpuscular Volume (MCV), has a threshold of 80 FL in the Yunnan region of China, while it is set at 82 FL in other regions, resulting in regional variations in screening outcomes [6] .
Research on screening for thalassemia patients can be divided into three stages. In the first stage, due to the underdevelopment of the medical field, screening methods mainly relied on medical tests or post-onset blood parameter screening, lacking systematic mathematical data collection and analysis methods [7] . The second stage introduced the application of statistical methods, where screening methods for thalassemia were primarily based on the statistical results of certain indicators [8] , such as MCV, MCH, and HbA2. However, this stage still remained at the stage of manual screening and statistical analysis, thus increasing the possibility of misdiagnosis and risk index to some extent. In the third stage, screening methods based on machine learning models gradually emerged. For example, Yi-Kai used Support Vector Machine (SVM) to differentiate between beta-thalassemia and non-beta-thalassemia microcytic anemia [6] . However, the data studied by Yi-Kai did not start from the perspective of gene detection, but from the perspective of blood, resulting in a lower algorithm accuracy rate.
Thalassemia is a prevalent and debilitating genetic disorder in the local population, with the severity of symptoms increasing with the accumulation of gene deletions [9] . Individuals with severe thalassemia have a short lifespan, and if identified and addressed during early pregnancy, measures can be taken to control the birth of children with severe thalassemia, reducing unnecessary suffering and loss. Therefore, considering the characteristics of existing machine learning algorithms and techniques, this study proposes the construction of a warning model for thalassemia screening based on machine learning algorithms, as well as further research on risk factors. This has significant academic and practical implications.
2. Materials and Methods
The data used in this study were sourced from real clinical records at a hospital in the Guangxi Zhuang Autonomous Region, China. The dataset consists of a total of 60 individuals’ genetic samples, with each sample containing 110 different genes, resulting in a total of 110 observed indicators or variables. Prior to conducting data analysis, strict privacy protection measures were taken. All personally identifiable information that could identify patient identities was removed, and the data underwent de-identification procedures.
Due to the relatively small sample size and high dimensionality of the data used in this study, issues such as data sparsity and distance calculation pose significant challenges for all machine learning methods [10] . This is commonly referred to as the “curse of dimensionality” [11] .
In this context, dimension reduction is considered an important approach [12] . It involves a mathematical transformation that converts the original high-dimensional attribute space into a lower-dimensional “subspace” to identify more suitable observed variables for modeling. Dimension reduction effectively reduces the data’s dimensionality, improves model training efficiency, and better addresses the curse of dimensionality. Next, we will provide a detailed introduction to the dimension reduction method and machine learning model used in this article.
2.1. Principal Component Analysis
Principal Component Analysis (abbreviated as PCA) is a widely used data dimensionality reduction algorithm [13] . Its main idea is to map the original n-dimensional features onto a new k-dimensional space, which is composed of entirely new orthogonal features, also known as principal components.
The process of PCA involves sequentially searching for a set of mutually orthogonal axes, where the selection of these new axes is closely related to the data itself [14] . The first new axis chosen is the direction of maximum variance in the original data. The second new axis is then selected as the direction of maximum variance in the plane orthogonal to the first axis. The third axis is selected as the direction of maximum variance in the plane orthogonal to the first two axes, and so on, until we obtain n such axes.
By following this approach, most of the variance is captured by the first k axes, while the remaining axes contain almost no variance. Therefore, we can ignore the remaining axes and only retain the first k axes that contain the majority of the variance [14] [15] . In practice, this means keeping the feature dimensions that capture the significant variance and disregarding the ones with negligible variance, thus achieving dimensionality reduction of the data features.
The algorithmic steps of PCA are shown in Algorithm 1.
Algorithm 1. PCA.
2.2. Partial Least Squares Regression
Partial Least Squares Regression (abbreviated as PLS) is a commonly used statistical analysis method for finding the relationship between independent and dependent variables [16] . It combines the characteristics of principal component analysis and canonical correlation analysis, as well as linear regression analysis. PLS Regression can effectively handle problems such as multicollinearity and small sample size among independent variables.
In the process of PLS, a new space is created by projecting the independent and dependent variables onto it. This new space is characterized by principal components, which are new variables obtained through linear transformations of the original independent variables. The method is effective at handling issues such as multicollinearity and small sample size among independent variables. The parameters of PLS are estimated by minimizing the sum of squared residuals, resulting in the establishment of a linear regression model.
In comparison to traditional multiple linear regression models, PLS exhibits the following distinctive features [17] :
1) When there is severe multicollinearity among independent variables, traditional regression models may encounter issues. However, PLS can handle regression modeling in such cases and reduce the impact of collinearity among independent variables on the results.
2) In situations where the number of data points is fewer than the number of variables, traditional regression analysis methods may suffer from overfitting problems. On the other hand, PLS can perform regression modeling under such conditions, improving the stability and reliability of the model.
3) The regression coefficients in PLS are more interpretable for each independent variable, facilitating a better understanding of the relationship between the independent and dependent variables.
In summary, PLS performs well in addressing challenges such as multicollinearity and small sample size, while also providing more interpretable regression coefficients [18] .
2.3. Logistic Regression
Logistic Regression (abbreviated as LR) is a classical statistical learning method commonly used to solve binary classification problems [19] . It predicts the probability of a sample belonging to a certain category by establishing a LR model [20] . For example, it can be used to predict the likelihood of a user purchasing a certain product, a patient having a particular disease, or a user clicking on a certain advertisement.
The LR model is based on the concept of linear regression [20] , but with a special function transformation known as the logistic or sigmoid function. This conversion maps the output to probability values between 0 and 1. The function that LR aims to fit is as follows:
(1)
The cost function of LR is represented by the following equation:
(2)
The dependent variable y can only take values of 0 or 1, while it is difficult for the independent variable x to truly reach positive or negative infinity. Therefore, the range of
being (0, 1), without truly reaching the two endpoint values, suggests that the cost function can be considered as a conditional function:
(3)
To find the minimum points of the cost function, we need to equate the partial derivatives to zero. Here, we take the partial derivative of
:
(4)
In classification problems, overfitting can easily occur when the sample size is too small. To achieve reasonable fitting results, there are two methods: The first method is reducing the number of parameters or limiting the coefficient values within a certain range [21] . Regularization is an example of the second method. At this point, the cost function can be rewritten as:
(5)
This can limit the size of θ. It should be noted that when λ is too large, θ can become too small, resulting in underfitting. When λ is too small or even zero, θ can become too large, resulting in overfitting. Therefore, it is important to adjust the appropriate value of λ.
3. Modeling and Result
3.1. Experimental Environment
This study was conducted on the Windows 11 operating system using MATLAB version 2023a. The computer was equipped with an Intel® Core TM i7-12700H processor and 24GB of RAM.
To prevent overfitting, this study randomly divided the data into training and testing sets in a 7:3 ratio. The training set was used to train the model using leave-one-out cross validation [22] , and the performance of the model was evaluated on the testing set to validate its predictive effectiveness. Next, we will provide a detailed introduction to the modeling process.
3.2. Modeling
3.2.1. PCA-LR
1) PCA Dimension
Due to the high dimensionality (110 dimensions) and relatively small size of the data used in this study, it is necessary to perform dimensionality reduction before modeling. By calculating the contribution rate of each principal component, we can determine how many principal components to retain while minimizing information loss.
As shown in Figure 1, as the number of variables increases, the cumulative contribution rate of the first n principal components gradually increases. When the number of variables reaches 14, the cumulative contribution rate of the first 14 principal components has already reached 81.25%. Generally, when the cumulative contribution rate exceeds 80%, a sufficient amount of sample information has been extracted. Therefore, we use the first 14 principal components for modeling analysis.
Visualizing the distribution of different types of patients on the first two principal components, as shown in Figure 2. By observing the distribution in the figure, it can be seen that the majority of patients with thalassemia are concentrated on the right side of the plot, while normal patients are mainly distributed on the left side. This indicates that there is a certain difference in spatial distribution between normal patients and thalassemia patients on the first two principal components.
2) PCA-LR Modeling
The reduced-dimensional data from the previous section is used as the new independent variable, and the patient’s condition is input as the response variable into a logistic regression model for modeling.
In the process of logistic regression modeling, two important hyper parameters need to be considered: the highest degree of interaction term (degree) and
Figure 2. Distribution map of different patients.
the regularization coefficient λ. This section aims to optimize these two parameters and evaluate the model’s accuracy on the test set.
We will observe the performance of the model in terms of accuracy based on different values of the regularization coefficient λ and the highest degree of interaction term (degree) ranging from 1 to 4.
Figure 3 shows the test accuracy curve of the LR model with degree = 1 and λ ranging from 1 to 105. It can be seen from the figure that the model’s prediction accuracy remains unchanged at 80% as λ increases, and the accuracy is stable. Figure 4 shows the boundary curve (red line in the figure) of the model trained under the first two principal components when the degree value is 1 and the regularization coefficient λ is 50. After verification, it was found that the model’s boundary remained unchanged regardless of the value of λ, which is also the reason why the accuracy remained unchanged.
Figure 5 shows the test accuracy curve for different values of λ ranging from 1 to 2000 when degree is set to 2. As λ increases, the accuracy first decreases and then stabilizes. When
, the test accuracy reaches its highest point at 87.5%. At λ around 400, the accuracy is 82.5%. When
, the accuracy is 75%.
Figures 6-8 respectively show the decision boundaries for λ values of 40, 400, and 1200 when degree is set to 2. As λ increases, the decision boundary becomes more and more “elliptical” as can be seen from the graphs, resulting in a batch of misclassified samples and decreasing classification accuracy towards the end.
Figure 9 shows the test accuracy curve for λ values ranging from 1 to 3000
Figure 4. Boundary visualization plot (degree = 1, lambda = 50).
Figure 6. Boundary visualization plot (degree = 2, lambda = 40).
Figure 7. Boundary visualization plot (degree = 2, lambda = 400).
Figure 8. Boundary visualization plot (degree = 2, lambda = 1200).
when degree is set to 3. From the graph, it can be observed that as λ increases, the accuracy initially increases and then decreases. The highest test sample accuracy of 87.5% is achieved at around
. At approximately
, the accuracy is 82.5%. For
, the accuracy is 80%.
Figures 10-12 respectively illustrate the decision boundaries for λ values of 50, 1200, and 1700 when degree is set to 3. From the graphs, it can be seen that as λ increases, the decision boundary becomes more and more insensitive to the points on the boundary. This leads to a batch of misclassified samples and decreasing classification accuracy towards the end.
Figure 13 shows the test accuracy curve for λ values ranging from 1 to 10000 when degree is set to 4. From the graph, it can be observed that as λ increases, the accuracy initially increases and then decreases. The highest test sample accuracy of 87.5% is achieved at around
. At approximately
, the accuracy is 85%. For
, the accuracy is 85%.
Figures 14-16 respectively illustrate the decision boundaries for λ values of 100, 3000, and 6000 when degree is set to 4. From the graphs, it can be seen that as λ increases, the decision boundary becomes more and more curved, resulting in a poorer classification of the points on the boundary and decreasing classification accuracy.
3.2.2. PLS Modeling
The plsregress function in MATLAB can be used to implement PLS, with the following syntax:
[Xloadings, Yloadings, betaPLS, PCTVAR] = plsregress(X, y, dims)
Figure 10. Boundary visualization plot (degree = 3, lambda = 50).
Figure 11. Boundary visualization plot (degree = 3, lambda = 1200).
Figure 12. Boundary visualization plot (degree = 3, lambda = 1700).
Figure 14. Boundary visualization plot (degree = 4, lambda = 100).
Figure 15. Boundary visualization plot (degree = 4, lambda = 3000).
Figure 16. Boundary visualization plot (degree = 4, lambda = 6000).
In this syntax, dims represent the number of components selected for analysis. PCTVAR can be used to calculate the proportion of dependent variable y explained by the first i components.
Figure 17 illustrates the proportions of the dependent variable explained by the first 30 components. From the graph, it can be seen that using 10 or more principal components explains over 90% of the variation in the dependent variable.
Due to the large number of variables and severe multicollinearity among them in the data used in this study, not all variables are suitable for modeling. Therefore, we need to select the most important variables for modeling and prediction. Some studies have suggested that Variable Importance in Projection (abbreviated as VIP) can be used to select predictive variables [23] , where variables with a VIP score greater than 1 are considered important for predicting the PLS regression model.
The VIP value distributions of different variables are shown in Figure 18, where the red crosses correspond to variables with VIP values greater than 1 and the blue dots correspond to variables with VIP values less than 1. The 30 variables selected by VIP value screening are listed in Table 1.
Using these 30 variables as independent variables and whether a patient has Mediterranean disease as the dependent variable, they are inputted into the PLS model. By gradually increasing the dimensionality (“dims” parameter), the performance of the PLS model can be observed to change, as shown in Figure 19. It can be seen that the model performs best when 4 components are selected, with a corresponding cross-validation accuracy of 92.5%.
3.3. Results Comparison
The accuracy of PCA-LR under different degree conditions and the accuracy of the PLS model are presented in Table 2. It can be observed that the Partial Least Squares Regression (PLS) model demonstrates the highest accuracy among the
Figure 19. Performance curve of the PLS model.
Table 2. Accuracy comparison of different models.
compared models, reaching 92.5%. This indicates that for the dataset under discussion, the PLS model is more effective in capturing the underlying patterns and relationships.
The PCA-LR model shows an interesting trend with changes in the polynomial degree used. When moving from a first-degree polynomial to a second-degree polynomial, there is a significant increase in accuracy. This suggests that introducing non-linear transformations (by increasing the polynomial degree) can significantly enhance the model’s ability to better fit the data. However, when the polynomial degree increases from 2 to 3 and 4, there is no further improvement in accuracy (remaining at 87.5%). This plateau effect indicates that beyond a certain level of complexity (in this case, degree = 2), increasing model complexity does not necessarily equate to better performance. This may be because the model has already captured most of the variance in the data with a second-degree polynomial, and additional degrees only add complexity without improving the model’s predictive capability.
4. Conclusions
PLS is an excellent modeling algorithm that is suitable for small samples and high-dimensional data, and it can also handle multicollinearity issues. In this study, we utilized the PLS model for the discrimination and diagnosis of Mediterranean anemia in the Guangxi region. To ensure the reliability of the model, we employed leave-one-out cross-validation and split validation set methods for modeling analysis. The results show that the model established has a high accuracy rate, demonstrating the effectiveness of this method.
Due to the limitations of the sample data, this paper did not explore its application in the diagnosis of different subtypes and clinical stages of Mediterranean anemia, which is a direction for future research. Additionally, research on how to integrate this algorithm into existing medical systems or mobile health applications to enhance its practicality and convenience can also be considered.