^{1}

^{1}

^{*}

This study explored and reviewed the logistic regression (LR) model, a multivariable method for modeling the relationship between multiple independent variables and a categorical dependent variable, with emphasis on medical research. Thirty seven research articles published between 2000 and 2018 which employed logistic regression as the main statistical tool as well as six text books on logistic regression were reviewed. Logistic regression concepts such as odds, odds ratio, logit transformation, logistic curve, assumption, selecting dependent and independent variables, model fitting, reporting and interpreting were presented. Upon perusing the literature, considerable deficiencies were found in both the use and reporting of LR. For many studies, the ratio of the number of outcome events to predictor variables (events per variable) was sufficiently small to call into question the accuracy of the regression model. Also, most studies did not report on validation analysis, regression diagnostics or goodness-of-fit measures; measures which authenticate the robustness of the LR model. Here, we demonstrate a good example of the application of the LR model using data obtained on a cohort of pregnant women and the factors that influence their decision to opt for caesarean delivery or vaginal birth. It is recommended that researchers should be more rigorous and pay greater attention to guidelines concerning the use and reporting of LR models.

Logistic regression (LR) analysis has become an increasingly employed statistical tool in medical research, especially over the last two decades [

The logistic function was invented in the 19^{TH} century by Pierre François Verhulst a French mathematician for the description of growth of human populations, and the course of autocatalytic chemical reactions [

LR is used when the research method is focused on whether or not an event occurred, rather than when it occurred (time course information is not used). It is particularly appropriate for models involving disease state (diseased or healthy) and decision making (yes or no), and therefore is widely used in studies in the health sciences. There are more complex forms which can deal with situations where the predicted variable takes more than two categories, it is then referred to as polychotomous or multinomial logistic regression [

As in all models, certain assumptions are made in order to fit the model to the data. LR does not assume a linear relationship between the dependent and independent variables, but between the logit of the outcome and the predictor values [

Studies with small to moderate sample sizes employing LR overestimate the effect they measure [

In addition to its many uses for developing models that will predict events in the physical sciences [

To examine if commonly recommended assumptions for multivariable LR are addressed, Ottenbacher et al. [

Bagley et al. [

In a review of four multivariate methods appearing in the literature from 1985 to 1989, Concato et al. [

Multivariable LR is a sophisticated statistical technique and concern has been expressed regarding its use and interpretation [

The LR gives each predictor a coefficient which measures its independent contribution to variation in the dependent variable. The dependent variable Y takes the value 1 if the response is “Yes” and takes a value 0 if the response is “No”.

The model form for Predicted Probabilities is expressed as a natural logarithm (ln) of the odds ratio:

ln [ P ( Y ) 1 − P ( Y ) ] = β 0 + β 1 X 1 + β 2 X 2 + ⋯ + β k X k (1)

and,

P ( Y ) 1 − P ( Y ) = e β 0 + β 1 X 1 + β 2 X 2 + ⋯ + β k X k (2)

P ( Y ) = e β 0 + β 1 X 1 + ⋯ + β k X k − P ( Y ) e β 0 + β 1 X 1 + ⋯ + β k X k (3)

P ( Y ) = e β 0 + β 1 X 1 + ⋯ + β k X k 1 + e β 0 + β 1 X 1 + ⋯ + β k X k (4)

where, ln [ P ( Y ) 1 − P ( Y ) ] is the log (odds) of the outcomes, Y is the dichotomous

outcome; X 1 , X 2 , ⋯ , X k are the predictor variables, β 0 , β 1 , β 2 , ⋯ , β k are the regression (model) coefficients and β_{0} is the intercept.

In Equation (4), the logistic regression model directly relates the probability of Y to the predictor variables. The goal of LR is to estimate the k + 1 unknown parameters β in Equation (4). This is done with maximum likelihood estimation which entails finding the set of parameters for which the probability of the observed data is greatest. The regression coefficients indicate the degree of association between each independent variable and the outcome. Each coefficient represents the amount of change we would expect in the response variable if there was a one unit change in the predictor variable. The objective of LR is to correctly predict the category of outcome for individual cases using the best model. To accomplish this goal a model is created that include all predictor variables that are useful in predicting the response variable. LR calculates the probability of success over probability of failure. The results of the analysis are in the form of an odds ratio.

The binary dependent variable has the values of 0 and 1 and the predicted value (probability) must be bounded to fall within the same range. To define a relationship bounded by 0 and 1, LR uses the logistic curve to represent the relationship between the independent and dependent variable. At very low levels of the independent variable, the probability approaches 0, but never reaches 0. Likewise, if the independent variable increases, the predicted values increase up the curve and approach 1 but never equal to 1.

The logistic transformation ensures that estimated values do not fall outside the range of 0 and 1. This is achieved in two steps, firstly the probability is re-stated as odds which is defined as the ratio of the probability of the event occurring to the probability of it not occurring. For example, if a horse has a probability of 0.8 of winning a race, the odds of it winning are 0.8/(1 − 0.8) = 4:1. To constrain the predicted values to within 0 and 1, the odds value can be converted back into a probability; thus,

P r o b a b i l i t y ( e v e n t ) = o d d s ( e v e n t ) 1 + o d d s ( e v e n t ) (5)

It can therefore be shown that the corresponding probability is 4/(1 + 4) = 0.8. Also, to keep the odds values form going below 0, which is the lower limit (there is no upper limit), the logit value which is calculated by taking the logarithm of the odds, must be computed. Odds less than 1 have a negative logit value, odds ratio greater than 1.0 have positive logit values and the odds ratio of 1.0 (corresponding to a probability of 0.5) have a logit value of 0.

When an independent variable X_{i} increases by one unit (X_{i}_{+1}), with all other factors remaining constant, the odds of the dependent variable increase by a factor exp(β_{i}) which is called the odds ratio (OR) and ranges from zero (0) to positive infinity. It indicates the relative amount by which the odds of the dependent variable increase (OR > 1) or decrease (OR < 1) when the value of the corresponding independent variable increases by one (1) unit.

In many cases, that outcome event is easily categorized into classes of having occurred, or not having occurred. For example, the occurrence of a heart attack or not; or delivering through caesarean or not, are relatively easily discerned and coded as either having happened, or not having happened. Once this categorization has been achieved, the predictors of that outcome can be studied [

Another aspect to consider in the development of a LR study concerns the selection of which variables to analyse as potential predictors of the outcome. This can only be achieved by a careful study of the literature in relation to the outcome, in order to ensure that the full range of potential predictors is included [

Further, if interaction terms between the variables are to be considered, then the omission of some variables could potentially have major impacts of the results. Unfortunately, the solution is not simply to include as many variables as possible, as the inclusion of variables that are unrelated to the outcome in question, this (the addition of unrelated variables) has the tendency to inflate the apparent predictive validity of the final model [

There may well be constraints acting on any particular study that lead to bias in the selection of the data used for the analysis. One potential constraint is the sample size, which limits the number of variables that can be studied. There is some debate as to the number of participants per variable that are needed, however, Agresti [

Another source of selection bias in the variables that are studied is that of missing data, where the presence of missing data in the sample can drive down the sample size if those participants with missing data are excluded, or can lead to the exclusion of certain variables from the analysis if large amounts of data are missing. Unfortunately, both of these outcomes can lead to bias in the variables selected that may be highly important as it will leave the sample as self-selected—that is, comprising only those individuals who chose to supply certain data, or only that data which is readily supplied by the sample, as well as other reasons why the other data are missing [

Finally, in addition to selection bias effects from these sources, the selection of variables is also constrained by the properties of the data that are collected. For example, predictor variables that are related to one another (that show colinearity or multi-colinearity) or predictor variables that have excessively influential observations (outliers), will impact adversely on the results of a LR. Particularly, in small or moderate samples, colinearity can result in overall levels of significance from the LR when individual predictors are not in themselves predictive of the outcome, or in the degree of relationship between a predictor and the outcome being incorrectly established [

The goodness-of-fit for the LR model can be assessed in several ways. First, is to assess the overall model (relationship between all of the independent variables and dependent variable). Second, the significance of each of the independent variables needs to be assessed. Thirdly, the predictive accuracy or discriminating ability of the model needs to be evaluated, and finally, the model needs to be validated.

1) The likelihood ratio test

The overall fit of a model shows how strong a relationship between all of the independent variables, taken together, and dependent variable is. It can be assessed by comparing the fit of the two models with and without the independent variables. A LR model with the k independent variables is said to provide a better fit to the data if it demonstrates an improvement over the model with no independent variables (the null model). The overall fit of the model with k coefficients can be examined through a likelihood ratio test, which tests the null hypothesis:

H 0 : β 1 = β 2 = ⋯ = β k = 0 (6)

To do this, the deviance with just the intercept (−2 log likelihood of the null model) is compared with the deviance when the k independent variables have been added (−2 log likelihood of the given model). The difference between the two yields a goodness of fit index G, χ^{2} statistic with k degrees of freedom (DoF) [

G = χ 2 = ( − 2 log likelihood of null model ) − ( − 2 log likelihood of given model ) (7)

An equivalent formula sometimes presented in the literature is,

G = χ 2 = − 2 log ( l i k e l i h o o d o f t h e n u l l m o d e l l i k e l i h o o d o f t h e g i v e n m o d e l ) (8)

where, the ratio of the maximum likelihood is calculated before taking the natural logarithm (ln) and multiplying by −2. The term “likelihood ratio test” is used to describe this test. If the p-value for the overall model fit statistic is less than the significance level of the test, conventionally 0.05 (P < 0.05), then H_{0} is rejected, with the conclusion that there is evidence that at least one of the independent variables contributes to the prediction of the outcome.

2) Hosmer-Lemeshow test

The Hosmer-Lemeshow test is used to examine whether the observed proportions of events are similar to the predicted probabilities of occurrence in subgroups of the model population. The Hosmer-Lemeshow test is performed by dividing the predicted probabilities into deciles (10 groups based on percentile ranks) and then computing a Pearson’s Chi-square (χ^{2}) that compares the predicted to the observed frequencies in a 2-by-10 table. The value of the test statistics is expressed as,

H = ∑ g = 1 10 O g − E g E g (9)

where, O g and E g denote the observed and expected events for the g^{th} risk decile group. The test statistic asymptotically follows a χ^{2} distribution with 8 (number of groups minus 2) DoF. Small values (with large P-value closer to 1) indicate a good fit to the data, therefore, good overall model fit. Large values (with P < 0.05) indicate a poor fit to the data. Hosmer and Lemeshow [

If the overall model works well, the next question is how important each of the independent variables is. The LR coefficient for the i^{th} independent variable shows the change in the predicted log odds of having an outcome for one unit change in the i^{th} independent variable, all other things being equal. That is, if the i^{th} independent variable, with regression coefficient b, is changed by 1 unit while all of the other predictors are held constant, log odds of outcome is expected to change b units. There are a couple of different tests designed to assess the significance of an independent variable in logistic regression including the likelihood ratio test and the Wald statistic [

1) Wald statistic

Statistical tests of significance can be applied to each variable’s coefficients. For each coefficient, the null hypothesis that the coefficient is zero is tested against the alternative that the coefficient is not zero using a Wald test, W j . A Wald test can also be used to compare a full model containing all the predictor variables with a reduced model with some coefficients set to zero. The Wald statistic can be used to assess the contribution of individual predictors or the significance of individual coefficients in a given model [^{2} distribution:

W j = β j 2 S E β j 2 (10)

Each Wald statistic is compared with a χ^{2} critical value with 1 DoF.

2) Likelihood ratio test

The likelihood-ratio test used to assess overall model fit can also be used to assess the contribution of individual predictors to a given model. The likelihood ratio test for a particular parameter compares the likelihood of obtaining the data when the parameter is zero, L 0 with the likelihood L 1 of obtaining the data evaluated at the maximum likelihood estimation of the parameter.

The test statistic is calculated as follows:

G = − 2 ln L 0 L 1 = − 2 ln ( L 0 − L 1 ) (11)

This statistic is compared with a χ^{2} distribution with 1 DoF. To assess the contribution of individual predictors one can enter the predictors hierarchically, then compare each new model with the previous model to determine the contribution of each predictor.

The classification table (

Observed | Predicted | |
---|---|---|

1 | 0 | |

1 | a | b |

0 | c | d |

observed outcomes and dichotomous predicted outcomes. The table has the following form.

Where, a, b, c and d are the number of observations in the corresponding cells.

If the logistic regression model has a good fit, we expect to see many counts in the a and d cells, and few in the b and c cells. In an analogy with medical diagnostic testing, we can consider,

Sensitivity = a / ( a + b ) and Specificity = d / ( c + d ) (12)

where, higher sensitivity and specificity indicate a better fit of the model.

Extending the above two-by-two idea (

Validation is an important test of the regression’s internal validity, a crucial step in the argument that the regression model is not an idiosyncratic artifact but instead that it has captured essential relationships in the domain of study. An important question is whether results of the LR analysis on the sample can be extended to the population the sample has been chosen from. This question is referred as model validation. In practice, a model can be validated by deriving a model and estimating its coefficients in one data set, and then using this model to predict the outcome variable from the second data set, then check the residuals, and so on. When a model is validated using the data on which the model was developed, it is likely to be over-estimated. Thus, the validity of model should be assessed by carrying out tests of goodness of fit and discrimination on a different data set [

If the model is developed with a sub-sample of observations and validated with the remaining sample, it is called internal validation. The most widely used methods for obtaining a good internal validation are data-splitting, repeated data-splitting, jackknife technique and bootstrapping [

If, however, the model does not fit the data set exactly, some indication of how well it does fit should be given. A summary of goodness-of-fit measures describe how well the entire model matches the observed values; in addition, regression diagnostics (including residual, leverage, and influence measures) are important in revealing the effect of individual subjects on the estimated model. A perfect fit has −2 LL value of 0 and R LOGIT 2 of 1. The Cox and Snell R^{2} measure and Nagelkerke R^{2} measure are common in most statistical software packages [

Since the estimates of the included variable may be sensitive to changes in the variable(s) omitted, some researchers have chosen to retain all the variables representing the same factor if at least one of them is statistically significant. They refer to such a model as the full model [

The following four types of information should be included when presenting the LR results; 1) An overall evaluation of the logistic model; 2) statistical tests of individual predictors; 3) goodness-of-fit statistics; and 4) an assessment of the predicted probabilities. We demonstrate this from recent work on variables informing expectant mothers to opt for caesarean delivery or vaginal birth [^{2} statistic.

From

Explanatory Variable | Co-Efficient β | Standard Error | P-Value | Wald Test W_{j} | Odds Ratio OR |
---|---|---|---|---|---|

Baby’s Birth Weight (3.5 kg and above as Reference) | |||||

2.5 - 3.5 kg | −1.5381 | 0.3988 | 0.00012 | −3.857 | 0.2148 |

Less than 2.5 kg | −1.6042 | 0.5148 | 0.00183 | −3.116 | 0.2010 |

Parity (None as Reference) | |||||

One | 1.1588 | 0.5700 | 0.04205 | 2.033 | 3.1861 |

Two | 1.0248 | 0.5063 | 0.04296 | 2.024 | 2.7865 |

Three | 1.1322 | 0.5273 | 0.03178 | 2.147 | 3.1025 |

Above Three | 1.6898 | 0.6047 | 0.0052 | 2.794 | 5.4184 |

Figures in italics are significant (P < 0.05). See Reference [

Test | Categories | χ^{2} | DoF | P-value |
---|---|---|---|---|

Overall Model Evaluation | Likelihood Ratio Test | 12.02 | 2 | 0.002 |

Wald Test | 11.06 | 2 | 0.004 | |

Goodness of Fit Test | Hosmer and Lemeshow Test | 5.975 | 8 | 0.65 |

Likelihood | Cox & Snell R^{2} Square | Nagelkerke R^{2} | |
---|---|---|---|

1 | 273.175 | 0.576 | 0.723 |

Predicted | |||||
---|---|---|---|---|---|

Caesarean Delivery | Percentage % Correct | ||||

Observed | Yes | No | |||

Step 1 | Caesarean Delivery | Yes | 158 | 31 | 83.6 |

No | 32 | 148 | 82.2 | ||

Overall Percentage % | 82.9 |

^{a}The cut off value is 0.500.

weight from 2.5 kg to 3.5 kg and (β = −1.6042, P < 0.001 and OR = 0.2010) for babies with birth weight below 2.5 kg show that the probability of caesarean delivery is higher for babies with birth weight above 3.5 kg than babies with birth weights below 3.5 kg. That is, the relative probability of caesarean delivery decreases by 78.52% for babies with birth weight from 2.5 kg to 3.5 kg and 79.9% for babies with birth weight below 2.5 kg.

It could also be seen from

From _{0}, there is no difference between the predicted and actual values against H_{1}, there is difference between the predicted and actual values. At p-value of 0.650 the null hypothesis is accepted and we conclude that insignificant differences remain between the actual and expected values, suggesting that the model fitted the data well.

A model summary of the logistic model is presented in ^{2} of 0.723 for the Nagelkerke R^{2} and 0.576 for the Cox and Snell R^{2} That is, the fitted model can explain or account for 72.3% of the variation in the dependent variable. This is an indication of a good model.

This study explored the components LR model, a type of multivariable method used frequently for modeling the relationship between multiple independent variables and a categorical dependent variable, with emphasis on medical research. Six text books on logistic regression and 37 research articles published between 2000 and 2018 which employed logistic regression as the main statistical tool were reviewed. Logistic regression concepts such as odds, odds ratio, logit transformation, logistic curve, assumption, selecting dependent and independent variables, fitting, reporting and interpreting were presented. Upon perusing literature, considerable deficiencies were found in both the use and reporting of LR. For many studies, the ratio of the number of outcome events to predictor variables (events per variable) was sufficiently small to call into question the accuracy of the regression model. Also, most studies did not report validation analysis, regression diagnostics or goodness-of-fit measures. Proper use of this powerful and sophisticated modeling technique requires considerable care both in the specification of the form of the model, in the calculation and interpretation of the model’s coefficients. We presented an example of how the LR should be applied. It is recommended that researchers be more thorough and pay greater attention to these guidelines concerning the use and reporting of LR models. In future, researchers could compare LR with other emerging classification algorithms to enable better or more rigorous evaluations of such data.

The idea was developed by EYB. Literature was reviewed by both authors. Both authors contributed to manuscript writing and approved the final manuscript.

We thank the anonymous reviewers whose comments made this manuscript more robust.

This study attracted no funding.

The authors declare no conflicts of interest regarding the publication of this manuscript.

Boateng, E.Y. and Abaye, D.A. (2019) A Review of the Logistic Regression Model with Emphasis on Medical Research. Journal of Data Analysis and Information Processing, 7, 190-207. https://doi.org/10.4236/jdaip.2019.74012