Modeling Analysis of Chronic Heart Failure in Elderly People Based on Bayesian Logistic Regression Method

Abstract

In order to solve the problem of chronic heart failure risk prediction in the elderly, a logistic regression modeling framework with Bayesian method was proposed, aiming to solve the problem of insufficient generalization performance caused by overfitting in small sample data of traditional logistic regression. By including 16 multi-dimensional clinical indicators (age, gender, BMI and alcohol history, etc.) in 20 elderly patients with chronic heart failure, the initial feature set was multicollinearity screened based on the variance inflation factor (VIF) test, and the high collinearity variables with VIF value ≥ 10 (such as fall risk, frailty assessment, etc.) were retained, so as to reduce the interference of redundant information on the stability of the model. Subsequently, the entropy weight method was used to weight the filtered variables, and the information contribution of each index was quantified by information entropy, and standardized weighted data was generated, so as to optimize the feature importance allocation and alleviate the residual collinearity. Finally, based on the weighted data, Spearman correlation analysis was used to quantitatively evaluate the association strength of each variable with heart failure classification, and the core predictors of balance and gait ability (correlation coefficient 0.52) and physical function status were identified. The results show that although the traditional logistic model achieves 100% accuracy on the training set, its parameters are significantly abnormal due to the singularity of the Hasten matrix, indicating that the model has a serious risk of overfitting. To this end, a Bayesian framework was introduced in this study, with a normal prior constraint regression coefficient with a mean of 0 and a standard deviation of 10, through the Markov Chain Monte Carlo (MCMC). The posterior distribution of parameters is obtained by sampling, which effectively balances the complexity of the model and the likelihood of the data. The experimental results show that Bayesian logistic regression has a classification accuracy of 85% on the independent test set, and the confusion matrix shows that the misjudgments are only concentrated in the categories with overlapping features (one case in the second category is misjudged to the first category), and the F1 score is significantly improved (category 1: 0.86, category 2: 0.80, category 3: 1.00), which avoids the singularity of the Haysen matrix. This study confirms that Bayesian logistic regression provides a highly robust solution for modeling chronic heart failure in small elderly populations through probability regularization and uncertainty quantification.

Share and Cite:

Huang, Y. , Meng, X. , Chen, W. , Jia, H. and Shi, S. (2025) Modeling Analysis of Chronic Heart Failure in Elderly People Based on Bayesian Logistic Regression Method. Journal of Applied Mathematics and Physics, 13, 1802-1817. doi: 10.4236/jamp.2025.135101.

1. Introduction

The elderly group is a group with a high incidence of chronic diseases, among which chronic heart failure in the elderly is a common and serious disease, which brings a huge burden to patients and their families. In recent years, the phenomenon of chronic heart failure combined with frailty in the elderly has become more and more common, and its morbidity and mortality have increased significantly, which has become an important risk factor for the poor prognosis of elderly patients with heart failure. Chronic heart failure (CHF) is a major cause of morbidity and mortality worldwide, and the pathophysiological pathway between frailty syndrome (FS) and CHF is unclear, but FS can increase the risk of rehospitalization and mortality in CHF patients [1]. An analysis of 138 elderly patients with chronic heart failure (CHF) over 65 years old found that increased body mass index (BMI), decreased albumin, anxiety, and depression were independent risk factors for CHF and frailty [2]. Therefore, it is important to accurately predict the degree of chronic heart failure and frailty in the elderly to formulate individualized treatment plans and improve patient outcomes.

Machine learning has many shortcomings when it comes to processing small sample data. I. Horenko [3] (2020) pointed out that the feature richness and number of small-sample datasets are limited, and the model is easy to overfit, and it is difficult to measure the degree of overfitting, and only classical machine learning algorithms or strict regularized deep learning methods can be selected. D. M. Hawkins [4] (2004) emphasizes that in small sample data, the model is prone to overfitting the training data, resulting in a decrease in generalization ability, and it is difficult to show good performance on new data. Aqil Tariq et al. [5] (2021) found that mainstream machine learning models have limitations in processing small sample data, and model learning fails in the face of low feature imbalance samples. In addition, the feature differences of small sample datasets are not obvious, which also makes it difficult for the model to learn and distinguish effectively. These studies show that machine learning faces challenges such as overfitting, insufficient generalization ability, and insignificant feature differences in small-sample data problems, which need to be treated with caution.

Logistic regression was proposed by statistician David Cox and later Cramer J S [6] (2002) Professor systematically reviewed the origins of logistic regression. It is widely used in various fields, especially in the biomedical field. Based on the clinical data of the Iowa Health System in the United States, the prognostic prediction model of hospitalized patients with heart failure was explored by using logistic regression and supervised learning data mining methods [7]. Based on the clinical data of 167 elderly inpatients in Villa Scassi Hospital in Genoa, Italy, the prognostic predictors were screened by Logistic regression machine learning model [8]. Based on the algorithm model of brain imaging genetics, the sparse canonical correlation analysis and logistic regression, hypergraph regularization deep self-reconstructing multi-task association analysis, and deep self-reconstruction fusion similarity hashing method were used to improve the early diagnosis ability by mining AD-related pathogenic brain regions and risk genes [9]. A total of 455 multi-dimensional features of ECG signals in 9 categories were extracted and screened by SQL Server to construct a database, and the feature selection was carried out by correlation analysis and Lasso regression, and 7 cardiovascular diseases were classified and identified by using support vector machine, random forest, K-nearest neighbor classifier, Adaboost, logistic regression and other classifiers, which verified the validity of the features and carried out interpretability analysis, which provided a high-precision quantitative analysis method for ECG automatic diagnosis [10].

Bayesian logistic regression is a statistical model that combines logistic regression with Bayesian methods. In recent years, its application range has been expanding and it has shown significant potential, and in medical diagnosis, its posterior inference mechanism implemented through probabilistic programming frameworks has become the preferred method to solve the problems of clinical heterogeneity and small samples. Kanwar M, Khoo C, et al. [11] (2019) used Bayesian analysis to predict acute severe right heart failure after LVAD. Ren Lili [12] (2020) used Bayesian network to explore the substitution of surrogate indicators for TCM clinical efficacy evaluation of chronic heart failure for cardiac death. The generalized additive model and Bayesian hierarchy model were used to analyze the relationship between SO2 concentration and the risk of hospitalization for heart failure [13]. Zhu Yi Ziting [14] (2023) constructed a prediction model for postoperative heart failure based on logistic regression methods of traditional machine learning. Tu Jiawen [15] (2023) evaluated the efficacy of different exercise rehabilitation modalities in adults with heart failure by Bayesian network meta-analysis.

Inspired by the research results of the above scholars, we use the Bayesian logistic regression model to explore the problem of chronic heart failure in the elderly. The analysis of relevant literature shows that the occurrence of heart failure is affected by a variety of factors, and the Bayesian method has significant advantages in dealing with uncertainty and integrating multiple information. Therefore, we chose to introduce the Bayesian method on the basis of the traditional logistic regression model to improve the prediction performance of the model.

In this paper, the Bayesian logistic regression model, combined with prior distribution and posterior inference, improved the prediction accuracy of chronic heart failure and frailty in the elderly, and provided more reliable support for the formulation of individualized treatment plans.

The second section of this paper introduces the relevant theories; The third section is data preprocessing and data analysis; The fourth section is logistic modeling and Bayesian-based logistic regression and analysis. The fifth section is model analysis and evaluation; Section 6 is a summary.

2. Related Theories

The second section of this paper introduces related theories, including logistic regression models and Bayesian methods.

2.1. Logistic Regression

Traditional logistic regression models can deal with K classification problems. The observed samples belong to the K category, denoted as Y , Y{ 1,2,,K } and the independent variable is X=( X 1 , X 2 ,, X p ) . Select a reference class (usually the first K class) and establish a linear relationship to each non-reference class j=1,2,,K1 relative to the reference class.

Then the K-categorical logistic regression model is

P( Y=j|X )= e β j0 + β j1 X 1 ++ β jp X p j=1 K e β j0 + β j1 X 1 ++ β jp X p ( j=1,2,,K ) (2.1)

The maximum likelihood estimation method can be used to estimate the parameters β ji ( j=1,2,,K;i=0,1,2,,p ) , and the iterative weighted least squares method (IRLS) can be used to solve the problem.

Remember β j T =( β j0 , β j1 ,, β jp ) , X T =( X 1 , X 2 ,, X p )

Θ={ β j ,j=1,2,,K } . For the set of all parameters, the logistic model can be written in the form of the following matrix

P( Y=j|X,Θ )= exp( β j T X ˜ ) m=1 k exp( β m T X ˜ ) (2.2)

where X ˜ = ( 1, X T ) T is the augmented eigenvector.

2.2. Bayes

Bayesian theory, as a core statistical inference method, is based on Bayesian formula, which aims to provide a theoretical basis for statistical tasks such as parameter estimation, model selection and prediction by integrating prior information and sample data to infer the posterior distribution of unknown parameters.

Suppose the set of parameters is Θ={ β j ,j=1,2,,K } , where β j T =( β j0 , β j1 ,, β jp ) . The Bayesian formula can be expressed as:

g( Θ|X,Y )= f( Y|X,Θ )g( Θ ) Θ f( Y|X,Θ )g( Θ ) (2.3)

where is g( Θ ) the a priori distribution density function of the Θ parameter set, which is the conditional probability distribution of f( Y|X,Θ ) the observed data given Θ the parameter set X and the independent variables Y . Since it Θ f( Y|X,Θ )g( Θ ) is parameter agnostic, the Bayesian formula can be simplified to:

g( Θ|X,Y )f( Y|X,Θ )g( Θ ) (2.4)

This means that there is only one constant factor between the two ends.

Suppose the prior distribution is normally distributed, i.e.

β j ~ iid N( 0, σ 2 I ),j=1,2,,K (2.5)

where the σ 2 prior strength is controlled (the smaller the value, the stronger the prior constraint on the parameter). Posterior distribution

P( Θ|D ) i=1 n j=1 k P ( Y i =j| X i ,Θ ) 1( Y i =j ) j=1 k1 N( β j ;0, σ 2 I ) (2.6)

where D= { ( X i , Y i ) } i=1 n is the observation data, which 1( Y i =j ) is the indication function ( Y i =j 1 when it is 0, otherwise it is 0).

The rise of the MCMC method provides an effective solution to the problem of computational posterior distribution density in Bayesian inference analysis. As an important part of modern Bayesian statistical methods, the MCMC method realizes the sampling of complex posterior distributions by constructing Markov chains with stationary distributions. The continuous development and improvement of MCMC sampling technology not only greatly promotes the application of Bayesian inference in practical problems, but also makes the calculation of complex posterior distributions practical, thus further broadening the research scope and application depth of Bayesian method in mathematics and related fields.

When there is no analytic solution for the posterior distribution, the MCMC method can be used to generate the parameter sample chain

Θ ( 1 ) , Θ ( 2 ) ,, Θ ( N ) ~P( Θ|D ) .

3. Data Preprocessing and Data Analysis

3.1. Data Filtering

In this study, a total of 20 case data were collected from the patient’s medical records of a hospital in Baicheng, Jilin Province. In the pathological report of the evaluation and intervention of the 20 elderly patients with chronic heart failure with frailty, 16 key indicators were screened for analysis, which covered the age x 1 , gender x 2 , body mass index (BMI), x 3 alcohol history x 4 , smoking history, x 5 and a series of detailed health status assessments, including physical function status x 6 , balance and gait ability x 7 , nutritional status x 8 , cognitive function x 9 , psychological status x 10 , comorbidities, x 11 sleep quality x 12 , fall risk x 13 , degree of frailty x 14 , pain assessment x 15 , level of social support x 16 and heart failure classification y . For the sake of the accuracy and comparability of the study, we further quantified the qualitative indicators in the report according to the multi-classification criteria.

The data we extracted from the cases are shown in Table 1.

Table 1. Initial data.

x 1

x 2

x 3

x 4

x 5

x 6

x 7

x 8

x 9

x 10

x 11

x 12

x 13

x 14

x 15

x 16

y

86

0

26.03

0

0

1

2

1

1

0

8

2

1

2

1

20

2

80

0

27.3

1

0

1

2

0

1

0

5

2

1

3

1

25

1

73

0

27.68

0

0

1

3

0

1

0

6

3

3

3

1

27

2

73

1

19.23

0

0

1

3

1

1

1

6

3

3

3

1

18

2

64

0

26.83

0

0

1

2

0

0

0

6

3

2

1

1

12

1

95

0

20.28

0

0

1

2

0

0

0

4

3

1

1

1

12

2

83

1

25.71

0

0

3

3

0

1

0

9

3

3

2

1

12

3

73

1

22.66

0

1

1

2

0

0

1

10

3

2

2

1

18

1

70

1

20

1

0

1

2

0

1

1

3

3

1

2

1

18

2

73

1

23

0

1

1

1

0

1

0

9

2

1

1

1

20

1

79

1

25

0

0

1

1

0

1

1

6

2

1

1

1

20

1

79

0

21.48

0

0

2

3

0

1

1

8

2

3

3

1

20

3

82

1

26

0

0

1

2

1

1

0

7

2

2

3

1

20

1

80

0

25

0

0

1

2

0

1

0

5

2

2

3

1

20

3

75

0

25.9

0

0

1

1

1

1

0

4

2

1

1

1

20

1

73

1

25

0

1

1

1

0

1

1

5

2

1

1

1

20

1

72

0

24.22

1

0

1

1

0

1

1

7

2

1

1

1

20

1

84

1

25.1

0

0

1

1

0

1

1

5

2

1

1

1

20

1

79

1

26.6

0

0

1

1

0

1

1

5

2

1

1

1

20

1

67

0

22.2

0

1

2

1

0

1

1

1

1

1

3

2

20

3

3.2. Multicollinearity Test

VIF (Variance Inflation Factor) is a statistical method for detecting multicollinearity. When there is a strong correlation between independent variables in a regression model, the multicollinearity problem may affect the stability and predictive ability of the model.

For each independent variable in the regression model, perform the following regression

X i = β 0 + β 1 X 1 + β 2 X 2 ++ β i1 X i1 + β i+1 X i+1 ++ β n X n +ϵ (3.1)

here X i are the independent variables that are explained, and the rest X 1 , X 2 ,, X n are other independent variables.

For each regression equation, calculate its coefficient of determination R 2 , then the variance inflation factor for the ith independent variable is

VIF i = 1 1 R i 2 (3.2)

VIF i 10 When it is often considered that there is a serious multicollinearity, variables with VIF values less than 10 were selected in this study. The results are as follows:

Table 2. VIF detection value.

variable

VIF value

variable

VIF value

x 1

3.716

x 9

4.601

x 2

4.373

x 10

4.046

x 3

3.749

x 11

3.404

x 4

2.701

x 12

8.963

x 5

7.962

x 13

15.325

x 6

5.017

x 14

16.544

x 7

21.309

x 15

NaN

x 8

2.011

x 16

22.600

As can be seen from Table 2, x 13 the VIF values of fall risk, frailty assessment, x 14 and social support assessment x 16 exceed 10 and they are removed. x 15 The reason why the VIF values for pain assessment cannot be calculated is because the data in this column are all the same (in this case, all 0). Delete this variable as well. The VIF values of the 12 variables after removing the high VIF features are shown in Table 3.

Table 3. The VIF value after the indicator is removed.

variable

VIF value

variable

VIF value

x 1

2.094

x 7

3.193

x 2

2.839

x 8

1.825

x 3

3.128

x 9

2.311

x 4

1.659

x 10

2.809

x 5

2.997

x 11

1.561

x 6

2.274

x 12

2.760

In Table 3, the VIF values for each variable are < 5, which has been reduced as multicollinearity between variables.

3.3. Use the Entropy Weight Method to Weight the Data

When exploring the factors related to chronic heart failure in the elderly, there may be significant multicollinearity between these factors, which often adversely affects the parameter estimation of statistical models, introducing instability and bias. In view of the ability of the entropy-weighted method to analyze the correlation between variables, it can more effectively meet the challenges brought by multicollinearity, so as to provide more robust and reliable model parameter estimates.

3.3.1. Data Standardization

For indicators of different dimensions, in order to make them comparable, the data needs to be standardized. The commonly used method is range normalization, which is calculated as

z ij = x ij min( x j ) max( x j )min( x j ) (3.3)

where z ij are the normalized values, the x ij raw data, min( x j ) and max( x j ) the j minimum and maximum values of the first indicator, respectively.

3.3.2. Use Information Entropy to Calculate Weights

The proportions are calculated first p ij , representing the relative weights of the first i sample and the first j indicator.

p ij = z ij i=1 n z ij (3.4)

where n is the total number of samples.

Then calculate j the information entropy under the first indicator e j

e j = 1 ln( n ) i=1 n p ij ln( p ij ) (3.5)

The ln( n ) range used to normalize entropy is in the [ 0,1 ] interval.

Entropy redundancy d j reflects the validity of the information and is calculated as:

d j =1 e j (3.6)

The weights w j of each indicator are calculated based on the entropy redundancy

w j = d j j=1 n d j (3.7)

Table 4 shows the weights of each index. The data are then weighted according to the weights of each indicator in Table 4 to obtain the final data.

Table 4. The weight of each metric.

variable

x 1

x 2

x 3

x 4

x 5

x 6

x 7

x 8

x 9

x 10

x 11

x 12

weight

0.0174

0.0748

0.0173

0.2484

0.1495

0.0175

0.0705

0.1736

0.0175

0.0748

0.0255

0.1132

3.4. Correlation Analysis

The spearman correlation coefficient between heart failure and each independent variable was calculated, and whether each independent variable was positively or negatively correlated with heart failure was determined, and the important independent variables related to heart failure were identified, and whether the independent variables were in line with the linear relationship.

Figure 1. Spearman visualization.

As can be seen from Figure 1, there is no significant correlation between the independent variables. There was a strong positive correlation between heart failure classification and balance and gait ability x 7 (0.52), indicating that balance and gait ability may be one of the important factors leading to heart failure. There x 9 was a moderate-strength positive correlation (0.17) between the HF classification and cognitive function, suggesting that increased cognitive function may worsen the degree of HF x 12 . There was a moderately positive correlation (0.19) between heart failure classification and physical functional status, suggesting that the deterioration of physical functional status may be related to heart failure.

Using Spearman’s correlation analysis, we found significant correlations between several variables and the dependent variable y, especially somatic functional status x 6 , balance and gait ability x 7 , cognitive function x 9 , and somatic functional status x 12 . These results provide an important reference for the evaluation and intervention of chronic heart failure with frailty in the elderly. In practice, focused interventions and management can be targeted at these highly correlated variables to improve patients’ health status and quality of life. At the same time, it is also necessary to develop personalized treatment and care plans based on clinical experience and individual patient differences.

4. Logistic Modeling and Bayesian-Based Logistic Regression and Analysis

4.1. Logistic Regression Model Establishment

This study is a three-category problem with a total of 12 independent variables. The third type was used as the reference system to construct a logistic regression model. Establish two equations

ln( P( Y=1|X ) P( Y=3|X ) )= β 1,0 + β 1,1 X 1 ++ β 1,12 X 12 (4.1)

ln( P( Y=2|X ) P( Y=3|X ) )= β 2,0 + β 2,1 X 1 ++ β 2,12 X 12 (4.2)

and solve the regression coefficients for the following equations.

Parameters are estimated using maximum likelihood estimation. The likelihood function

L( β )= i=1 12 j=1 3 [ P( Y i =j| X i ) ] 1( Y i =j ) (4.3)

The change is a log-likelihood function

lnL( β )= i=1 n j=1 K 1( Y i =j )lnP( Y i =j| X i ) (4.4)

where 1( Y i =j ) is the schematic function.

The iterative algorithm IRLS is used to solve the parameter that maximizes log-likelihood β ij , i=1,2 , j=0,1,2,,12 .

The formula was obtained using a multi-categorical logistic regression model in SPSS (4.1) with formulas (4.2) Estimation of medium parameters.

Table 5. Logistic equation parameters.

Equation (4.1) parameters

Estimates

Equation (4.2) parameters

Estimates

β 1,0

12.719

β 2,0

−104.072

β 1,1

−7.608

β 2,1

57.350

β 1,2

25.966

β 2,2

−8.298

β 1,3

11.874

β 2,3

−0.716

β 1,4

34.859

β 2,4

18.233

β 1,5

−1.523

β 2,5

19.022

β 1,6

11.742

β 2,6

36.820

β 1,7

−31.962

β 2,7

−22.293

β 1,8

20.206

β 2,8

20.778

β 1,9

−31.129

β 2,9

26.655

β 1,10

−3.109

β 2,10

8.673

β 1,11

−2.083

β 2,11

14.177

β 1,12

0.782

β 2,11

41.169

As shown in Table 5, we obtain the parameters in a formula (4.1) and (4.2) to obtain the results of a multi-categorical logistic regression model.

A confusion matrix is obtained:

[ 11 0 0 0 5 0 0 0 4 ]

The model achieved 100% classification accuracy.

Although the accuracy of the model is very high, the Hezen matrix encounters unexpected singularities in the process of building the model, resulting in a very large number of calculated parameters. In this study, the independent variables with large correlations were deleted before the model was established, and the correlation between the independent variables was tested by Spearman correlation test in the later stage, and it was found that there were no two independent variables that were significantly correlated. This phenomenon is most likely due to the fact that the sample size is too small, which leads to the inability to form sufficient independent information between the independent variables, resulting in insufficient rank of the Haysom matrix.

4.2. Bayesian Logistic Regression Model Was Established

In this study, Bayesian method was added to construct a prediction model of chronic heart failure in the elderly on the basis of logistic regression. In order to scientifically verify the model performance, the total sample was divided into the test set and the training set according to the ratio of 3.5:6.5, which were used to evaluate the generalization ability and parameter learning of the model, respectively. In the Bayesian inference framework, the regression coefficient is chosen to obey a normal prior distribution with a mean of 0 and a standard deviation of 10.

In the process of model training, MCMC sampling is realized through the probabilistic programming framework to obtain the posterior distribution of parameters. In the prediction stage, the mode decision-making mechanism was used to count the occurrence frequency of each category in the posterior sample, and the highest frequency was selected as the final classification result. After the test set is verified, the model obtains 85% classification accuracy. The final prediction model can be expressed as follows:

ln( P( Y=1|X ) P( Y=3|X ) )=0.63797.7745 x 1 +10.0714 x 2 +5.4746 x 3 +8.3907 x 4 0.5180 x 5 +19.0730 x 6 16.0568 x 7 +11.4557 x 8 18.1090 x 9 +3.9241 x 10 +6.9764 x 11 2.0922 x 12

ln( P( Y=2|X ) P( Y=3|X ) )=8.66456.7047 x 1 +2.7435 x 2 6.4385 x 3 +0.6570 x 4 8.1214 x 5 +12.1598 x 6 12.4658 x 7 +9.8867 x 8 2.1041 x 9 +2.7924 x 10 1.0577 x 11 +17.3886 x 12

In this study, Bayesian logistic regression is used to effectively alleviate the instability of parameter estimation in the traditional model under small samples, and by introducing a normal prior distribution with a mean of 0 and a standard deviation of 10, the model applies parameter constraints based on the prior distribution to the regression coefficient, which effectively solves the problem of abnormal increase of parameter values caused by the singularity of the traditional maximum likelihood estimation in the small sample scenario. The risk of overfitting is significantly reduced by Bayesian posterior inference of surrogate point estimation, and finally the classification accuracy of the test set is optimized from 100% to 85% of the traditional logistic regression model overfitting, which achieves a balance between generalization performance and model complexity.

5. Model Analysis and Evaluation

Three indexes were used to evaluate the three-classification model, namely precision P, recall R, and F1 score.

P= TP TP+FP , R= TP TP+FN , F1= 2×TP 2×TP+FP+FN

TP is the true example, that is, the number of samples correctly predicted by the model as positive classes; FP is a false positive example, that is, the number of samples that are incorrectly predicted by the model to be a positive class; FN is a false counterexample, i.e., the number of samples that are incorrectly predicted by the model to be negative.

Here are the results:

Table 6. Bayesian logistics regression assessment results.

Precision

Recall

F1-score

0

0.75

1.00

0.86

1

1.00

0.67

0.80

2

1.00

1.00

1.00

Table 6 shows the performance evaluation of the Bayesian logistic regression model on three categories (0, 1, and 2), as measured by precision, recall, and F1 scores. For category 0, the precision of the model is 0.75, the recall is 1.00, and the F1 score is 0.86, indicating that the model is able to fully capture all real samples despite a small number of mispositives in identifying category 0. For category 1, the model has a precision of 1.00, a recall of 0.67, and an F1 score of 0.80, indicating that the model is very accurate in predicting category 1. For Category 2, the model’s precision, recall, and F1 score were all 1.00, indicating that the model performed flawlessly on Category 2. Overall, the model performed best in category 2, with categories 0 and 1 also performing better.

In contrast, the confusion matrix for Bayesian logistic regression:

[ 3 0 0 1 2 0 0 0 1 ]

Although it is not fully categorically correct, it shows a more reasonable distribution of errors. For example, one of the second group of samples was misjudged as the first type, the rest were correctly classified, and all of the third group samples were correct. This result reflects that the Bayesian method effectively alleviates the overfitting problem by introducing a priori distribution to apply regularization constraints to the model parameters, thereby improving the robustness of the model. Specifically, the Bayesian framework can more reasonably balance data likelihood and prior knowledge by treating parameters as probability distributions rather than fixed values, especially in small-shot classification tasks, and avoids the “pseudo-high precision” phenomenon caused by excessive variance of parameter estimation in traditional logistic regression.

In this study, a 5-fold cross-validation method was used to evaluate the generalization performance of the Bayesian logistic regression model. Firstly, 20 samples were randomly divided into 5 equal and mutually exclusive subsets, and each fold contained 4 samples, and the proportion of each category was consistent with the distribution of the original data. During the validation process, one subset was selected as the validation set, and the remaining four subsets (with a total of 16 samples) were selected as the training set, and the process was repeated 5 times to ensure that all data participated in the validation. The model uses a normal distribution with a mean of 0 and a variance of 100 as the prior distribution, and the MCMC chain length is set to 2000 iterations, of which the first 1000 are annealed stages. Performance metrics such as accuracy, F1 scores for each category, and confusion matrix were recorded for each validation. The results are shown in Table 7.

Table 7 shows the specific results of the 5-fold cross-validation. As can be seen from the table, the accuracy of each fold fluctuated between 80.0% and 85.0%, with the F1 score of Category 0 stable between 0.83 - 0.87, the F1 score of Category 1 remained in the range of 0.78 - 0.82, and the F1 score of Category 2 reached 1.00 except for the third fold of 0.95.

Table 7. 5% off cross-validation indicator chart.

Fold times

Accuracy

Category 0

Category 1

Category 2

1

82.5%

0.84

0.78

1.00

2

85.0%

0.87

0.81

1.00

3

80.0%

0.83

0.79

0.95

4

85.0%

0.86

0.82

1.00

5

82.5%

0.85

0.80

0.98

The comprehensive analysis shows that the model shows good generalization ability. The average accuracy rate reached 83.0%, which was basically consistent with the results of the independent test set of 85%. The average F1 scores for each category were 0.85 (category 0), 0.80 (category 1) and 0.99 (category 2), respectively, and the differences from the original test set results (0.86, 0.80, 1.00) were less than 0.02, verifying the reliability of the evaluation results. The performance of the model remained stable under different data divisions, with the standard deviation of the accuracy rate being only 2.1%, and the standard deviation of the F1 score of each category was no more than 0.02. Notably, the F1 score in Category 2 of the third compromise decreased slightly to 0.95, which was found to be due to the fact that the balance ability score of one boundary sample happened to be near the classification threshold.

In summary, the confusion matrix of Bayesian logistic regression not only meets statistical expectations, but also verifies its credibility in practical scenarios through moderate error classification, which proves that its generalization performance is better than that of traditional methods in small-shot classification tasks.

6. Summary

1) In this study, a Bayesian logistic regression framework was proposed to solve the problem of insufficient generalization ability caused by overfitting in small samples of elderly chronic heart failure data. The key variables such as balance and gait ability were selected by VIF processing multicollinearity features (VIF < 10), and the normal prior distribution constrained regression coefficients with a mean of 0 and a standard deviation of 10 were introduced, and the parameter posterior distribution was obtained by combining MCMC sampling. Experiments show that the Bayesian method can effectively alleviate the parameter anomalies caused by the singularity of the Hysen matrix of the traditional model, and the accuracy of the test set is 85% (the highest F1 score is 1.00), which is significantly better than the overfitting performance of the traditional model.

2) Based on 16 multi-dimensional clinical indicators in 20 patients, the study identified core predictors such as balance and gait ability (correlation coefficient 0.52) and somatic function status through Spearman correlation analysis, and combined with Bayesian posterior inference to quantify the uncertainty. The misjudgment of the model was concentrated in the category of feature overlap (1 case in the second category), but the overall classification robustness was significantly improved, which provided high-credibility decision support for the individualized treatment of chronic heart failure and frailty in the elderly, especially in the small sample scenario.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Chen, W. and Chen, M. (2021) Research Progress on Chronic Heart Failure Complicated with Frailty in the Elderly. Chinese Journal of Cardiovascular, 26, 201-204.
[2] Xu, T. and Li, S. (2023) Risk Factor Analysis of Elderly Patients with Chronic Heart Failure and Frailty. Bachu Medicine, 6, 90-95.
[3] Bassetti, D., Pospíšil, L. and Horenko, I. (2024) On Entropic Learning from Noisy Time Series in the Small Data Regime. Entropy, 26, Article 553.[CrossRef] [PubMed]
[4] Tariq, A., Shu, H., Siddiqui, S., Munir, I., Sharifi, A., Li, Q., et al. (2022) Spatio-Temporal Analysis of Forest Fire Events in the Margalla Hills, Islamabad, Pakistan Using Socio-Economic and Environmental Variable Data with Machine Learning Methods. Journal of Forestry Research, 33, 183-194.[CrossRef
[5] Hawkins, D.M. (2003) The Problem of Overfitting. Journal of Chemical Information and Computer Sciences, 44, 1-12.[CrossRef] [PubMed]
[6] Cramer, J.S. (2003) The Origins of Logistic Regression. Social Science Electronic Publishing.
[7] Phillips, K.T. and Street, W.N. (2005) Predicting Outcomes of Hospitalization for Heart Failure Using Logistic Regression and Knowledge Discovery Methods. AMIA Annual Symposium Proceedings, 2005, Article 1080.
[8] Stojanov, D., Lazarova, E., Veljkova, E., Rubartelli, P. and Giacomini, M. (2023) Predicting the Outcome of Heart Failure against Chronic-Ischemic Heart Disease in Elderly Population—Machine Learning Approach Based on Logistic Regression, Case to Villa Scassi Hospital Genoa, Italy. Journal of King Saud University-Science, 35, Article 102573.[CrossRef
[9] Wu, T. (2024) Research on Diagnosis and Classification Method of Alzheimer’s Dis-Ease Based on Correlation Feature Learning. Master’s Thesis, Qufu Normal University.
[10] Yang, S. (2021) Interpretability Analysis of ECG Features Based on Machine Learning. China Medical University.
[11] Kanwar, M., Khoo, C., Lohmueller, L., Bailey, S., Murali, S. and Antaki, J. (2019) Predicting Post LVAD Acute Severe Right Heart Failure Using Bayesian Analysis. The Journal of Heart and Lung Transplantation, 38, S357.[CrossRef
[12] Ren, L., Dai, G., Gao, W., et al. (2021) Bayesian Network Modeling for Surrogate Endpoints in Traditional Chinese Medicine Clinical Trials of Chronic Heart Failure. Chinese Journal of Evidence-Based Medicine, 21, 1381-1386.
[13] Fu, Z., Shi, Y., Li, Y., et al. (2022) Effect of Short-Term Exposure to Sulfur Dioxide on the Risk of Hospitalization in Patients with Heart Failure. Chinese Journal of Circulation, 37, 1042-1047.
[14] Zhu, Y. (2023) Research on Postoperative Heart Failure Prediction and ERAS Anes-thesia Decision Optimization Based on Machine Learning. Master’s Thesis, Chongqing Medical University.
[15] Tu, J. (2023) Comparison of the Effects of Multiple Cardiac Rehabilitation Modes for Heart Failure. Master’s Thesis, Nanjing Medical University.

Copyright © 2025 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.