A Hybrid Ensemble Learning Approach Utilizing Light Gradient Boosting Machine and Category Boosting Model for Lifestyle-Based Prediction of Type-II Diabetes Mellitus

Addressing classification and prediction challenges, tree ensemble models have gained significant importance. Boosting ensemble techniques are commonly employed for forecasting Type-II diabetes mellitus. Light Gradient Boosting Machine (LightGBM) is a widely used algorithm known for its leaf growth strategy, loss reduction, and enhanced training precision. However, LightGBM is prone to overfitting. In contrast, CatBoost utilizes balanced base predictors known as decision tables, which mitigate overfitting risks and significantly improve testing time efficiency. CatBoost’s algorithm structure counteracts gradient boosting biases and incorporates an overfitting detector to stop training early. This study focuses on developing a hybrid model that combines LightGBM and CatBoost to minimize overfitting and improve accuracy by reducing variance. For the purpose of finding the best hyperpara-meters to use with the underlying learners, the Bayesian hyperparameter optimization method is used. By fine-tuning the regularization parameter values, the hybrid model effectively reduces variance (overfitting). Comparative evaluation against LightGBM, CatBoost, XGBoost, Decision Tree, Random Forest, AdaBoost, and GBM algorithms demonstrates that the hybrid model has the best F1-score (99.37%)


Introduction
Type-II diabetes mellitus (T2DM) represents a formidable global health challenge.This chronic metabolic disorder is characterized by high blood glucose levels, resulting from a combination of insulin resistance and inadequate insulin production.The condition is escalating at an alarming rate worldwide, presenting a severe public health crisis due to its long-term complications, including cardiovascular disease, kidney damage, and vision loss, among others.The increasing prevalence and consequential impact of T2DM on global health underscore the urgency for accurate and early prediction models, which are pivotal for preventive measures, timely interventions, and resource allocation in healthcare systems [1].
Identifying individuals at high risk for T2DM has traditionally relied on the assessment of various lifestyle and physiological indicators.Factors such as dietary habits, physical activity levels, and anthropometric measurements play a significant role in determining an individual's risk profile.However, the predictive challenges of T2DM are multifaceted, owing to the complex interplay of these risk factors, necessitating more sophisticated analytical methods capable of capturing the nuanced relationships inherent in patient data.
In this context, machine learning (ML) techniques have emerged as a revolutionary tool in predictive healthcare, offering nuanced insights drawn from large-scale datasets.Tree ensemble models, particularly boosting algorithms, have garnered considerable interest for their superior performance in classification tasks.These algorithms work by iteratively refining weak learners, thereby establishing robust models that can navigate the intricate patterns associated with T2DM risk factors [2].Specifically, LightGBM and CatBoost, two state-of-the-art boosting algorithms, have marked a significant advancement in this domain.LightGBM optimizes the traditional gradient boosting framework by employing a unique leaf-wise growth strategy, offering an efficient and highly precise model [3].Despite its benefits, LightGBM can succumb to overfitting, particularly with complex datasets, limiting its practical applicability.Conversely, CatBoost addresses some of these limitations by integrating an advanced system of balanced decision tables and an intrinsic overfitting detector, enhancing model reliability and execution efficiency [4].Nevertheless, CatBoost requires careful hyperparameter tuning to ensure optimal performance, presenting challenges in model optimization.
Given the individual strengths and limitations of LightGBM and CatBoost, this study introduces a novel hybrid model, synergizing the capabilities of both

Methods of Prediction
This section presents previous research conducted on the prediction and detection of Type II Diabetes Mellitus (TIIDM) using machine learning and ensemble learning techniques.The researchers discussed the algorithms, datasets, and methodologies employed in their studies.Experimental methods used in recent scientific studies have shown how important lifestyle, demographic, psycho-social, and genetic risk factors are in the early detection, prevention, and management of diabetes, especially type 2 diabetes [6]- [11].
Zhang L et al. [12] developed a framework for TIIDM utilizing machine learning and ensemble learning methods, including Logistic Regression (LR), Classification and Regression Technique (CART), Artificial Neural Network (ANN), Support Vector Machine (SVM), Random Forest (RF), and Gradient Boosting Machine (GBM).They analyzed 36,652 cases and 10 different lifestyle factors from a rural Henan cohort in China.When compared to other classifiers, GBM performs the best.
Ganie SM et al. [13] proposed a TIIDM prediction model using machine learning techniques.Their dataset consisted of 1939 records with 11 biological and lifestyle parameters.Various machine learning algorithms such as Bagged Decision Trees, Random Forest, Extra Trees, AdaBoost, Stochastic Gradient Boosting, and Voting (Logistic Regression, Decision Trees, Support Vector Machine) were employed.The greatest rate of accuracy among these classifiers was 99.14%, which was achieved by Bagged Decision Trees.Kopitar L et al. [8] implemented a machine learning system for Type I and Type II Diabetes Mellitus that employs an ensemble learning technique to track glucose levels based on independent features.They used data from 27,050 cases and 111 attributes gathered from patients at 10 different Slovenian healthcare facilities that focused on preventative medicine.For this framework, 59 variables were selected after preprocessing and feature engineering.When compared to other classifiers, LightGBM achieved better results across the board.This included better accuracy, precision, recall, AUC, AUPRC, and RMSE.
Ahmed S et al. [9] proposed a machine learning model for the prediction of cardiovascular disease using self-augmented datasets of heart patients and various machine learning models.CatBoost outperformed other models, achieving an accuracy of 87.93%, followed by LightGBM (86.21%),HGBC (84.48%), and XGBOOST (83.78%).
Using a variety of machine-learning classifiers such k-nearest neighbors, decision trees, AdaBoost, naive Bayes, XGBoost, and multi-layer perceptrons, Hasan MK et al. [14] created a solid framework for TIIDM.They used EDA to do tasks including outlier detection, missing value completion, data standardization, feature selection, and result validation.With a sensitivity of 0.789, a specificity of 0.934, a false omission rate of 0.092, a diagnostic odds ratio of 66.234, and an AUC of 0.950, the ensembling classifiers AdaBoost and XGBoost performed the best.
Rawat V et al. [11] used five machine learning methods for predicting and analyzing patients with diabetes mellitus: AdaBoost, Logic Boost, Robust Boost, Naive Bayes, and Bagging.The PIMA Indian Diabetes Dataset was used, which was found in the UCI machine learning library.Bagging and AdaBoost techniques yielded 79.69 and 81.77 percent accuracy in classification, respectively.
As can be seen from the aforementioned body of work, investigating lifestyle and biological data can aid in the early detection of Type II Diabetes Mellitus.
With this method, doctors will be able to make more informed judgments about diabetes treatment in real time, which could decrease the need for hospital readmissions, clinical laboratory visits, and the overall cost of health checks.Moreover, such a system can benefit patients and individuals at risk by enabling early prediction and delaying the onset of the disease.Unawareness and under-resourced healthcare systems have resulted in a considerable number of individuals, approximately 232 million [15], being unaware of their diabetes status.
Providing technological assistance to the general population can significantly address this issue.

Boosting Ensemble Learning
Ensemble learning is an efficient approach that uses various base learners to boost prediction and classification accuracy [16].Each base learner, which produces a model from a collection of labeled inputs, contributes to the overall pre-diction by considering different training sets and feature sets.The key concept behind ensemble learning is that errors made by individual base learners can be compensated for by the collective knowledge of the ensemble [17].The overall objective of the learning process is to improve the effectiveness of the weak learners [18].This is accomplished by compiling the results of the predictions made by each of the separate models, which can be done either by combining the results or through vote.In addition to this, every model in the ensemble is an improved and adapted version of the one that came before it, assigning more weight to misclassified samples in subsequent estimations [19].Notably, there are a number of different boosting approaches that have developed over the course of the years in order to improve performance , including AdaBoost, Gradient Boosting Machine (GBM),Light Gradient Boosting Machine (LightGBM), Extreme Gradient Boosting (XGBoost), and Category Boosting (CatBoost), particularly suited for handling categorical data.
Freund and Schapire first presented AdaBoost in 1996, and since then, it has garnered substantial reputation in data mining and machine learning fields [20].
In AdaBoost, the base learner is trained using a training set, and the sample distribution is adjusted based on the performance of the base learner [21].The algorithm assigns more attention to incorrectly predicted samples during subsequent training iterations.However, AdaBoost is prone to overfitting and underfitting, leading to poor performance on unseen data [22].To address these limitations, researchers have proposed variations of AdaBoost, such as the AdaBoostsupport vector regression model, which has shown improved performance in various prediction tasks [23].
GBM is an optimization technique that minimizes the loss function by iteratively adding weak learners or decision trees [24].The objective is to create base learners that correlate most effectively with the negative gradient of the loss function when combined with the full ensemble [17].Setting the number of trees in GBM is crucial, as choosing too many may lead to overfitting and too few may result in underfitting.To mitigate overfitting, stochastic gradient boosting techniques have been introduced, where trees are trained using small subsets of the original dataset [17].The effectiveness of GBM has been demonstrated in various applications, such as protein solubility prediction [25].
LightGBM is a decision tree-based, fast gradient boosting approach, it offers improved computational efficiency and accuracy [26] [27].Using exclusive functional grouping and histogram-based techniques, it eliminates occurrences with minor gradients and concentrates on those with big gradients to calculate information gain and decrease feature dimension [28].The adoption of a tree leaf-wise strategy, with a maximum depth limit, further improves LightGBM's effectiveness [28] [29].These strategies, along with others, contribute to LightGBM's superior computational efficiency and accuracy compared to other algorithms.Cheng W et al. [30] introduced the use of LightGBM in combination with a closed recurrent unit to predict weekday traffic congestion.The objective of their study was to built a model that could effectively capture and express features limited by traditional approaches.When compared to previous algorithms, the suggested model performed admirably and accurately predicted traffic congestion patterns.
In another application, using time series data, Hao X et al. [31] was able to accurately forecast the amount of free calcium oxide present in cement clinker by utilizing LightGBM in conjunction with Bayesian optimization.Bayesian optimization was used by the researchers to seek for optimal values for the hyperparameters and fine-tune the model, resulting in improved performance accuracy.Hyperparameter optimization methods, such as Bayesian optimization, are particularly valuable for algorithms that require extensive tuning to achieve optimal results.By incorporating these studies, we not only highlight the versatility of LightGBM in different domains but also emphasize its effectiveness in improving prediction accuracy and performance compared to alternative algorithms.
CatBoost, as highlighted by Shahriar SA et al. [32], is a powerful gradient boosting package specifically designed to handle categorical data.It makes use of a refined version of the gradient boosting decision tree (GBDT) method, which is able to successfully deal with issues including noisy data, diverse feature sets, and complex dependencies.This algorithm has proven to be adept at handling categorical features [4].Traditionally, categorical features are replaced by corresponding average label values when using the standard GBDT technique.However, CatBoost takes a different approach by utilizing oblivious trees as base predictors.In oblivious trees, the same splitting criterion is applied across an entire level of the tree [33] [34].This results in balanced trees that are less prone to overfitting.
Gradient boosted oblivious trees have demonstrated their effectiveness in various learning tasks, as demonstrated by Gulin A et al. [35] [36].Each leaf index in CatBoost can be represented as a binary vector of length 2. This vector's length is proportional to the tree's depth.Model predictions in CatBoost are computed using binary features, which are generated by first binarizing all float features, statistics, and one-hot encoded features [37].Figure 2 provides a visual representation of the symmetric or oblivious tree strategy employed by Cat-Boost.In a comparative study by Dorogush AV et al. [38], CatBoost, XGBoost, and LightGBM were evaluated.The results showed that CatBoost outperformed the other models in terms of computational efficiency, scoring around 25 times faster than XGBoost and approximately 60 times faster than LightGBM.Furthermore, among various models such as M5Tree, Random Forest (RF), XGBoost, CatBoost, and Support Vector Machines (SVM), CatBoost demonstrated satisfactory generalization capability and high computational efficiency [33].
Patel et al. [39] employed CatBoost, XGBoost, and LightGBM for predicting suicidal ideation in post-stroke patients.The objective of their study was to evaluate the efficiency of these boosting methods in predicting suicidal ideation based on clinical and psychological features.The results indicated that LightGBM had the least favorable performance, while XGBoost showed the best performance in terms of specificity, positive predictive value (PPV), and accuracy.On the other hand, CatBoost exhibited the best performance in terms of sensitivity, negative predictive value (NPV), and area under the curve (AUC).

Baseline Methods
The efficacy of the suggested hybrid LightGBM and CatBoost model in its application was validated by implementing various boosting techniques, including AdaBoost, GBM, XGBoost, Decision Tree, Random Forest, LightGBM, and CatBoost reinforcement models.This section provides an overview of the boosting techniques employed.

Adaptive Boosting
AdaBoost is a technique that takes multiple weak classifiers and combines them into one robust classifier.Input: set of weak classifiers Output: strong classifier: where { } µ λ are parameters that need to be trained.The matching weak classifier ( ) is not chosen since we prefer that most 0 µ λ = .The AdaBoost method prioritizes the next iteration on the basis of the most inaccurate predictions, which are given larger weights.

Gradient Boosting Machine
The goal of the Gradient boosting machine algorithm is to integrate multiple base learners into a single robust learner.If we are given a dataset of the form with n observations, we want to obtain an estimation of the function ( ) * f x (from x inputs to target values y) using the formula ( ) To do so, we minimize the expectation of the loss function where t ρ denotes the weight of the t th weak learner for ( ) ( ) where ( ) L y α represents a loss function that can be differentiated.The weak learners seek to minimize , arg min , .
In the process of optimizing gradient descent for * f , each weak learner t h can be thought of as a greedy step.As a result, a new dataset with the equation is trained using each model.The pseudo residuals ti r are ob- tained using the following formula: In order to determine the value of the weight t ρ , one must first solve the line search optimization issue.Gradient Boosting Machine is primarily focused with enhancing the accuracy of the model by decreasing the amount of error or residuals that it generates.

Extreme Gradient Boosting
The XGBoost algorithm is implemented as follows: In XGBoost, gradient boosting is used to fine-tune the trees.Consider the output of a tree: where x is the vector of input values and q w is the score of the related leaf.A collection of K trees will provide the following result: ( ) At each iteration t, the XGBoost algorithm seeks to optimize a certain objective function, denoted J. , where the second term represents the regularization term that controls the complexity of the model and prevents overfitting.Train loss function L (such as mean squared error) between the real class y and the output ŷ for the n samples is included in the first term.
In XGBoost, the complexity is defined as: ( ) where T is the total number of leaves, γ is the hyperparameter used to achieve pseudo-regularization (which varies between datasets), and λ is the L2 norm of the weights of the leaves.Finding the optimal weights w using gradients to approximate the loss function at a second order, the optimal value of the objective function is:

Decision Trees
Given training vectors , 1, , and a label vector Then, depending on the problem being solved (classification or regression), an impurity function or loss function H() is chosen and used to calculate the quality of a potential split of node m. ( , Choose the parameters that minimises the impurity Recurse for subsets ( ) If a target is a classification outcome taking on values 0,1, , 1 K −  , for node m, assume that ( ) is the fraction of node m's observations that belong to class k.It is assumed that m is a terminal node, and the P Predict roba for this area is changed to Pmk if this is the case.Common measures of impurity are the following. Gini: ( ) ( ) Log Loss or Entropy:

Random Forest
This is how the Random Forest algorithm is put into practice: Assume that the training set of microarrays The objective is to create a classifier that uses D as a data set to make predictions about y given X.Given a collection of classifiers, , some of which may be less accurate than others.The ensemble is a random forest if and only if each ( ) k h X is a decision tree.For the classifier ( ) k h X , we define the tree's parameters as follows: , , , Tree structure, variable partitioning among nodes, etc. are all examples of such factors.
We occasionally write ( ) ( ) As a result, decision tree k leads to classifier How do we prioritize the characteristics to show in each branch of the k th tree?
Based on a random selection of k Θ parameters from the model variable Θ .
A random forest is a type of classifier that is constructed using a family of classifiers ( ) ( ) on a classification tree, with parameters k Θ selected at random from a model random vector Θ .Each tree contributes one vote to the final classification ( ) f X , which combines the classifiers ( ) k h X , and the category that receives the greatest number of votes is deemed to be the most appropriate.Specifically given data , we train a family of classifiers is in our case a predictor of n 1 y = ± = outcome associated with input X.

LightGBM
Introduction to LightGBM: LightGBM operates as a gradient boosting framework that uses a histogram-based algorithm, enhancing speed and efficiency.It stands out due to its leaf-wise tree growth strategy and specific mathematical optimizations that address overfitting, a common issue in predictive modeling.
Tree Growth Strategy and Mathematical Foundation: In the gradient boosting landscape, LightGBM introduces an innovative leaf-wise tree growth strategy, contrasting with traditional level-wise methods.This strategy optimizes the following objective, minimizing loss more efficiently: In each iteration t, the model computes the gradients: , Using these gradients, LightGBM applies a leaf-wise strategy, selecting the leaf with the highest delta loss to grow.This method, governed by the following gain calculation, also integrates a regularization term λ to prevent overfitting: Here, λ acts as a complexity penalization, ensuring the model doesn't overly adapt to training data nuances, a principle that is crucial for generalization and predictive accuracy in unseen data.

CatBoost
Introduction to CatBoost: CatBoost, known for its effectiveness with categorical features, takes gradient boosting further by addressing overfitting through algorithmic enhancements and sophisticated mathematical underpinnings.
Ordered Boosting and Mathematical Insights: CatBoost employs a unique permutation-driven scheme within its boosting approach, ensuring error correction in each sequential tree while avoiding repetitive learning from the same instances.The mathematical foundation for this involves computing gradients and Hessians for loss minimization: Gradients: Hessians: , These values contribute to the construction of each tree, with the optimal leaf value computed as follows: In this formula, λ is a regularization parameter, adding a level of penalty against complexity, thereby safeguarding against overfitting.This regularization ensures that the model remains robust and maintains high accuracy by not mirroring the training data too closely.

The Proposed Hybrid LightGBM and CatBoost Model
The hybrid model was developed by constructing a super learner ensemble model by sequentially integrating the individual models with LightGBM and CatBoost as the foundational learning algorithms and data.• Weight Optimization Define the loss function  as: LGB, CB, 1 LGB, CB, log 1 The diagram for the hybrid model is depicted in Figure 3.The dataset is subjected to a thorough analytical process, with a comprehensive representation of the descriptive statistics for each attribute.This meticulous analysis illuminates the fundamental statistical characteristics and dynamics of the data, providing essential insights that are pivotal for further research and exploration in the field of diabetes (Figure 5).Unique Characteristics: The geographical specificity and encompassing medical data bestow the dataset with a unique standpoint in diabetes studies.

Data Pre-Processing
In order to identify the hyperparameters that produce the best results for the objective function, the base learners used a Bayesian hyperparameter optimization strategy.This approach, introduced in 2019, effectively tunes the trial and error computing process to determine the most suitable hyperparameters [40].To gain deeper insights into the hybrid model's behavior, Shapley Additive Explanation (Shap) values were utilized.Originally used in cooperative game theory within the economics sector, Shapley Additive Explanation values assess individual contributions in a predictive setting [41].By quantifying the impact of each variable, an importance value is assigned to calculate the overall explanation.The proposed methodology was implemented using Python 3.9.13.The Python algorithms were executed in Jupyter notebook, an open-source web tool.For building the hybrid model, various ML modules including Scikit-learn, op-tuna, Shap, lightgbm, and CatBoost were employed.The results were visually analyzed using Matplotlib and optuna visualization modules.The computational resources utilized for this implementation were an HP-Omen Gaming Laptop equipped with an NVIDIA GeForce GTX core i7, 16GB RAM, and a processing speed of 2.60 GHz.

Random Over-Sampling for Data Balancing
The Random Over-Sampling (ROS) technique is a widely employed method for addressing class imbalance in high-dimensional datasets, aiming to tackle various real-life scenarios [42].Dealing with imbalanced datasets can be challenging as it often leads to poor model performance across multiple statistical metrics.In this study, prior to constructing the ML/EL models, the random over-sampling approach was employed to balance the classes and optimize the predictive capability of the framework.The random over-sampling technique involves oversampling and augmenting the minority class present in the dataset by replicating existing minority samples, thereby increasing the size of the minority class.Figure 6 illustrates the count of outcomes (class variable) before and after applying the ROS technique, as depicted in a previous study by [43].

Dataset Distribution
The distribution of the predicate variables Age, Insulin, Skin Thickness, Blood Pressure, Pregnancies, Glucose, BMI, and Diabetes Pedigree Function towards the target variable Outcome has been plotted using the FacetGrid method (Seaborn package).In this technique, the distribution of the dataset's observations was graphically represented using the Kernel Density Estimate (KDE) plot function.It uses a continuous probability curve to represent data samples in one or more dimensions.The range of samples is given along the horizontal or a x xis , while the probability density function of a random variable is displayed vertically or a y xis .The probability of value is the sum of the shaded region of the curve between 1 x and 2 x , where K is the kernel function assigned to each data point i x .We can estimate the kernel density as: ( ) where, • P = density at location x; • K represents a non-negative kernel function; • N represents the number of steps; • The smoothing parameter is denoted by h; • x denotes the maximum random value; • The variable i x determines the data sampling rate.

K-Fold Cross-Validation and Splitting of Datasets
To mitigate dataset bias, researchers and professionals often employ the K-fold cross-validation technique [15].As depicted in Figure 15, this study utilized 10-fold cross-validation, visually demonstrating the data splitting process.The dataset was divided into 10 equal-sized partitions randomly.During each iteration, one partition was designated as the validation set (testing set), while the remaining nine partitions were used for training the model.This method guaranteed that each partition only ever performed the validation procedure once.Summation was used to add up the results from each iteration.By utilizing this approach, the dataset effectively addressed the issues of overfitting and underfitting, thereby minimizing bias and producing realistic results in machine learning models.Notably, both training and testing datasets encompassed all data samples, ensuring comprehensive evaluation.Journal of Data Analysis and Information Processing

Feature Engineering
Feature engineering is crucial to the process of constructing ML/EL models.Execution of a model can be negatively impacted by irrelevant or unsuitable features [44].The training time is cut down and accuracy is increased with careful feature selection.Machine learning paradigms make use of a variety of feature selection methodologies, such as filter, wrapper, embedding, and hybrid approaches respectively [45].Feature selection in this research was accomplished using the Information Gain and Correlation techniques.All of the characteristics that were used for the Category Boosting (CatBoost) classifier for TIIDM showcase prediction are shown in Figure 16.Glucose, Body Mass Index, Diabetes Pedigree Function, Age, Blood Pressure, Insulin, Pregnancies, and Skin Thickness are ranked/important in order from most to least in terms of outcome.

Description of Results
The primary aim of the weak learners was to maximize accuracy while minimizing squared error.Over the course of 50 iterations, the optimal set of hyperparameters was found.The hyperparameters were fine-tuned across a range of 100 -300 iterations.Trial 50 had the optimal combination of hyperparameters for the weak learners and yielded the highest accuracy.The optimal trial is depicted in Figure 17. Figure 18 and Figure 19 serve as an illustration.The model provided the optimal set of hyperparameters to get the lowest possible error while boosting accuracy.

Hyperparameter Importance
The models' results were affected in different ways by the set of hyperparameters that brought about the minimum in the objective function.Figure 20 shows that the LightGBM model's min child samples contributed 68% to the learning     process by minimizing error, whereas bagging fraction contributed 16%.The other hyperparameters had the smallest influence, enhancing model performance by less than 14%.Bagging temperature, which defines the settings of the Bayesian bootstrap, had the most influence on the CatBoost model's objective function optimization, contributing 30% to the process, preceded by Iterations and Learning Rate contributing 26% and 22% respectively as shown in Figure 21.Less than 10% of the remaining hyperparameters affected the model's performance.
The base learners use various hyperparameters, although several of the hyperparameters had little effect on the model, this does not imply that they had no effect on enhancing performance.In order to maximize the objective function using the more important hyperparameters, even the small contribution was required.Each possible value of one hyperparameter is tested against a range of other hyperparameters in order to find the optimal settings for base learners.

Overview of Base Learners Hyperparameters
The optimal values of the identical hyperparameters used to optimize the objective function for the weak learners and achieve the highest accuracy are shown in Table 1.The minimal number of child samples, bagging frequency, and total number of leaves for the LightGBM model were 3, 2, and 172, respectively.Iterations and Depth were set at 968 and 47 in the CATBOOST model, respectively.

Hybrid LightGBM and CATBOOST Model Interpretation
The Shap method was used to determine how much each of the weak learners contributed to the final results of the hybrid model.Figure 24 displays the distribution and impact of the features on TIIDM prediction, as well as their relative importance, in descending order.Low Glucose resulted in low chance to have TIIDM whilst high Glucose resulted in high chance to be TIIDM positive.
For both skin thickness and blood pressure, the vast majority of samples had a shap value of zero, which had negligible impact on the model's predictions.Figure 25 shows how each weak learner contributed to the final output of the hybrid model based on the accuracy with which its predictions matched the target values.As can be seen from the length of the bars, CatBoost had a greater effect on the hybrid model's performance than LightGBM.LightGBM and Cat-Boost models contributed 40% and 60%, respectively, to the hybrid model.The difference between the two learners' contributions at the beginning is twenty percent.The higher performance accuracy can be attributed to the fact that both models contributed significantly to the hybrid model's output.
The weak learners are shown in Figure 26 and Figure 27 to have assisted in shifting the initial prediction of the hybrid model from the base value (the average output of the training set) to the target value.The initial estimate of −0.99 for the negative class was improved to 0.04, and the first forecast of 0.5 was improved to 0.93, both of which were very close to the target number.Both models made substantial contributions to this prediction, as shown by the length of the base learners.

Hybrid Model Summary
The hybrid model was superior at optimizing the objective function, which aims to reduce error while increasing performance.the hybrid model before and after Bayesian hyperparameter tuning was performed on the base learners.The Log Loss of the hybrid model fell from 0.699 when the default hyperparameters were used to 0.262 when the optimal set of hyperparameters was used.

Hybrid Model Performance Evaluation
The performance of the hybrid model was assessed using the baseline techniques, which include the weak learners, LightGBM and CatBoost models, and tree-based algorithms, AdaBoost, XGBoost, decision tree, random forest and GBM models.According to Table 3, the hybrid LightGBM and CatBoost model outperformed the other algorithms by having the lowest log loss that is 0.25 and the highest accuracy of 99.37%.The hybrid model outperformed its predecessors with minimal error, allowing for improved Type II diabetes mellitus prediction.

Comparative Analysis with Existing Work
In Table 5, we compare our suggested framework to other research that have dealt with similar problems in terms of technique, dataset, and analysis to determine how effective it is.Most of these studies utilized similar lifestyle markers for comparison purposes.Our system demonstrated favorable results, particularly in terms of accuracy, for predicting Type-II Diabetes Mellitus (TIIDM).To ensure the validity of our results, we employed techniques such as hyperparameter tuning and K-fold cross-validation during the development of the proposed framework, aiming to achieve more robust and reliable outcomes compared to other related studies

Discussion on Hybrid Model Performance
The remarkable efficacy of the hybrid model, which integrates the LightGBM and CatBoost algorithms, is evident from the performance metrics presented in Table 3 and Table 4.This superiority is not coincidental but is attributable to several strategic and technical advantages, as discussed below: 1) Precision in Learning from Data: The synergy between LightGBM's efficiency in processing large datasets and CatBoost's adept handling of categorical features results in a model with enhanced learning precision.This precision significantly contributes to the model's high scores in accuracy and F1-score, ensuring a balanced harmony between precision and recall.
2) Reduced Overfitting: Both constituent models, LightGBM and CatBoost, have inherent features designed to combat overfitting.LightGBM's leaf-wise growth strategy, which is curtailed at a certain depth, and CatBoost's utilization of ordered boosting, collaboratively contribute to a model that generalizes well LGBM-CatBoost to unseen data.This is empirically confirmed by the model's superior performance metrics, including the minimal log loss.
3) Efficiency in Handling Various Data Types: The hybrid model stands out in its ability to seamlessly process a diverse array of data types.This characteristic, coupled with the lack of a need for extensive data pre-processing, establishes the model's robustness, especially in real-world applications where data diversity is a given.4) Optimized Ensemble Learning: The ensemble approach of the hybrid model leverages the individual strengths of both LightGBM and CatBoost.This method not only averages out individual biases and reduces variance but also enhances the model's resistance to overfitting, thereby optimizing performance.This strategic move is reflected in the model's higher accuracy and other metrics compared to those of standalone models.
5) Superiority in Complex Predictive Tasks: The task of predicting Type-II diabetes is intricate, given the disease's multifactorial nature.The hybrid model is well-equipped for such complexity, with its amalgamation of two potent algorithms that enable a more adaptive, accurate predictive analysis amidst the convoluted interaction of numerous risk factors.
In essence, the hybrid model's architectural innovation, advanced anti-overfitting approach, and capacity for handling diverse data types collectively contribute to its standout performance in predicting Type-II diabetes.The exemplary scores across all metrics underline the model's reliability and efficacy, promising substantial applicability in facilitating.

Potential Limitations of the Hybrid Model
The proposed hybrid model, despite its promising performance, is not without certain limitations in practical applications: 1) Hyperparameter Sensitivity: Significant reliance on the fine-tuning of hyperparameters, creating a dependency whereby slight alterations in data may necessitate a new round of exhaustive optimization.
2) Complexity and Interpretability: The integration of outputs from LightGBM and CatBoost contributes to a more complex model, potentially impeding straightforward interpretability-a crucial factor in healthcare settings.
3) Data Quality Dependence: The performance efficacy is tightly coupled with the input data quality, indicating that inadequate features or noisy, inconsistent data could undermine predictive capabilities.

Future Directions and Improvements
Considering the aforementioned limitations, future research and model refinement could explore the following avenues: 1) Automated Feature Engineering: Introduction of automated mechanisms for feature selection and engineering to fortify the model's adaptability to various datasets without necessitating manual intervention.
2) Enhanced Interpretability: Integration of model interpretability and explanation tools, offering clearer insights into prediction determinants and fostering trust among healthcare practitioners.
3) Optimized Resource Allocation: Refinement of computational resource usage through streamlined algorithms or parallel computing solutions, catering to the need for scalability and potentially enabling real-time application.
4) Extensive Real-World Validation: Prior to clinical deployment, conducting comprehensive testing in real-world environments using multi-center, diverse datasets to ascertain model reliability and effectiveness across different scenarios.
5) Dynamic Learning Integration: Adoption of a continuous learning framework allowing the model to evolve with new data, maintaining its relevancy and accuracy in the ever-changing clinical landscape.

. Conclusions and Suggestions
In this work, we develop a hybrid model for forecasting Type-II diabetes mellitus using lifestyle factors that combines the advantages of the Light Gradient Boosting Machine (LGBM) and the CatBoost algorithms.By minimizing overfitting and reducing variance, our hybrid model demonstrates improved accuracy compared to other classification techniques.Through the use of Bayesian hyperparameter optimization, we identified the optimal set of hyperparameters for the base learners, resulting in exceptional performance metrics such as accuracy, precision, recall, F1-score, and log loss.The proposed hybrid model achieved a high accuracy rate of 99.37%, making it a promising tool for early diabetes prediction in the healthcare industry.Furthermore, the framework shows potential for application to other datasets that share common characteristics with diabetes.
Our findings highlight the effectiveness of combining LGBM and CatBoost algorithms and underscore the importance of addressing overfitting concerns in prediction models.Further research can explore the implementation of the hybrid model in real-world healthcare settings and investigate its applicability to other medical conditions.Overall, our study contributes to the advancement of predictive modeling for Type-II diabetes mellitus and offers valuable insights for future research in this field.

Figure 1
Figure 1 displays an overview of the structure of the leaf-wise strategy in LightGBM.
the use of iterative construction, a constant estimation of ( ) * f x can be ob- tained by formulating it as: the gradient statistics on the loss function, and I is the set of leaves.

Q
l y ∈  , a decision tree is a recursive partition of the feature space that combines training samples with the same labels or comparable target values together.Let m n samples of data from node m be represented by m of a feature j and threshold m t .

Figure 3 .
Figure 3. Diagram depicting the steps required to put into action the proposed LightGBM/CatBoost hybrid model.

Figure 4 . 3 . 5 .
Figure 4.A Pseudo-code of the weight averaging method in the hybrid model development.

Figure 5 .
Figure 5. Descriptive statistics of the dataset.

Figures 4 -
Figures 4-14 depict the frequency distribution of all lifestyle characteristics, with light green representing the non-diabetic class and dark green representing the diabetes class.

Figure 6 .
Figure 6.Class balancing of dataset using ROS technique.

Figure 7 .
Figure 7. Age with respect to target.

Figure 8 .
Figure 8. BMI with respect to target.

Figure 9 .
Figure 9. Blood Pressure with respect to target.

Figure 10 .
Figure 10.Skin Thickness with respect to target.

Figure 12 .
Figure 12.Glucose with respect to target.

Figure 13 .
Figure 13.Insulin with respect to target.

Figure 14 .
Figure 14.Diabetes Pedigree Function with respect to target.

Figure 17 .
Figure 17.Optimization trials of hyperparameters to determine the optimal hyperparameter settings for the weak learners.

Figure 18 .
Figure 18.Catboost model's minimization of objective function optimization history over 300 tests.

Figure 19 .
Figure 19.LightGBM model's minimization of objective function optimization history over 300 Tests.

Figure 20 .
Figure 20.LightGBM model's minimization of the objective function: an optimization history of 300 attempts.

Figure 21 .
Figure 21.CatBoost model's minimization of the objective function: an optimization history of 300 attempts.

Figure 22
Figure 22 depicts the correlation between hyperparameter tuning and LightGBM model objective function value optimization.High bagging fraction, low bagging frequency, medium feature fraction, low lambda, high alpha, low minimum child sample, and high minimum data in leaf were all associated with high objective values for various trials.

Figure 23
Figure 23  demonstrates that in the CatBoost model, high objective values were related to low bagging temperature, medium depth, high iterations, low l2 leaf reg, high learning rate, low od wait, and a high random strength.

Figure 22 .
Figure 22.Parallel coordinates for LightGBM model hyperparameter and objective function values.

Figure 23 .
Figure 23.Parallel coordinates for CatBoost model hyperparameter and objective function values.

Figure 24 .
Figure 24.Feature Importance of Hybrid Model.Features' distribution and impact in TIIDM prediction.

Figure 25 .
Figure 25.Predictive performance of the TIIDM hybrid model and the relative importance of its weak learners.

Figure 26 .
Figure 26.Individual hybrid model prediction explanation.The negative class prediction of each weak learner.

Figure 27 .
Figure 27.Individual hybrid model prediction explanation.The positive class prediction of each weak learner.

Table 1 .
Hyperparameter settings for the weak learners.
Table 2 displays the Log Loss for

Table 2 .
Log Loss analysis of the combined efficiency of LightGBM and CATBOOST hybrids.

Table 5 .
Comparison with existing systems.