A Review of the Logistic Regression Model with Emphasis on Medical Research

This study explored and reviewed the logistic regression (LR) model, a multivariable method for modeling the relationship between multiple independent variables and a categorical dependent variable, with emphasis on medical research. Thirty seven research articles published between 2000 and 2018 which employed logistic regression as the main statistical tool as well as six text books on logistic regression were reviewed. Logistic regression concepts such as odds, odds ratio, logit transformation, logistic curve, assumption, selecting dependent and independent variables, model fitting, reporting and interpreting were presented. Upon perusing the literature, considerable deficiencies were found in both the use and reporting of LR. For many studies, the ratio of the number of outcome events to predictor variables (events per variable) was sufficiently small to call into question the accuracy of the regression model. Also, most studies did not report on validation analysis, regression diagnostics or goodness-of-fit measures; measures which authenti-cate the robustness of the LR model. Here, we demonstrate a good example of the application of the LR model using data obtained on a cohort of pregnant women and the factors that influence their decision to opt for caesarean delivery or vaginal birth. It is recommended that researchers should be more rigorous and pay greater attention to guidelines concerning the use and reporting of LR models.


Introduction
Logistic regression (LR) analysis has become an increasingly employed statistical The logistic model agreed very well with the actual course of the population of France, Belgium, Essex (UK), and Russia for the periods up to the early 1830's.
The logistic function was discovered anew in 1920 by Pearl and Reed in a study of the population growth of the USA [8].
LR is used when the research method is focused on whether or not an event occurred, rather than when it occurred (time course information is not used). It is particularly appropriate for models involving disease state (diseased or healthy) and decision making (yes or no), and therefore is widely used in studies in the health sciences. There are more complex forms which can deal with situations where the predicted variable takes more than two categories, it is then referred to as polychotomous or multinomial logistic regression [9].
As in all models, certain assumptions are made in order to fit the model to the data. LR does not assume a linear relationship between the dependent and independent variables, but between the logit of the outcome and the predictor values [10]. The dependent variable must be categorical; the independent variables need not be interval; nor normally distributed, nor linearly related, nor of equal variance within each group, and lastly, the categories (groups) must be mutually exclusive and exhaustive. A case can only be in one group and every case must be a member of one of the groups. LR has the power to accommodate both categorical and continuous independent variables. Although the power of the analysis is increased if the independent variables are normally distributed and do have a linear relationship with the dependent variable [11]. Inspection of these assumptions shows that this technique can be employed somewhat more flexibly than traditional regression techniques, making it suitable for many clinically relevant situations. For any given case, LR computes the probability that a case with a particular set of values for the independent variables is a member of the modeled category. Larger samples are needed than for linear regression because maximum likelihood coefficients are large sample estimates [12].
Studies with small to moderate sample sizes employing LR overestimate the effect they measure [4] [13]. Thus, large sample sizes are required for LR to provide sufficient numbers in both categories of the outcome variable. Also, the more independent variables are included, the larger the sample size required.
With small sample sizes, the Hosmer-Lemeshow test has low power and is unlikely to detect subtle deviations from the logistic model. Hosmer and Lemeshow recommend sample sizes greater than 400 and a minimum number of cases per E. Y. Boateng, D. A. Abaye Journal of Data Analysis and Information Processing independent variable is ten [4] [13].
In addition to its many uses for developing models that will predict events in the physical sciences [14], economics [15] [16] and political sciences [17], LR is increasingly being applied in medical research [18] [19] [20]. Examples of the use of logistic regression in medicine include a study of the factors that predict whether an improvement or no improvement will occur after an intervention [21] [22], the presence or absence of a disease in relation to a variety of factors [23], to explore the effects of and relationships between multiple predictors [24] [25], to determine which of a range of potential predictors actually are important [23] [26] and, to determine whether newly explored variables add to the predictive validity of already established models [27]. The other applications of LR are to develop novel statistical methods based on ranked-data [28].
To examine if commonly recommended assumptions for multivariable LR are addressed, Ottenbacher et al. [29] surveyed 99 articles from two journals; the Journal of Clinical Epidemiology and the American Journal of Epidemiology, under 10 criteria, six dealing with computation and four with reporting multivariable LR results. Their study revealed that three of the 10 criteria were addressed in 50% or more of the articles. Statistical significance testing or confidence intervals were reported in all articles. Methods for selecting independent variables were described in 82% and specific procedures used to generate the models were discussed in 65%. Fewer than 50% of the articles indicated if interactions were tested or met the recommended events per independent variable ratio of 10:1. Fewer than 20% of the articles described conformity to a linear gradient, examined collinearity, reported information on validation procedures, goodness-of-fit, discrimination statistics, or provided complete information on variable coding.
There was no significant difference (P > 0.05) in the proportion of articles meeting the criteria across the two journals. They concluded that articles reviewed frequently did not report commonly recommended assumptions for using multivariable LR.
Bagley et al. [30] also identified 15 peer-reviewed articles and reported on substantial shortcomings in the use and reporting of LR results. Their study revealed that none of the articles reported any goodness-of-fit measures or regression diagnostics. The majority of the studies had events-per-variable ratios near or below 10, suggesting that those regression results themselves may be particularly unreliable, and finally, none of the studies reported any validation analysis.
In a review of four multivariate methods appearing in the literature from 1985 to 1989, Concato et al. [31] reported that LR was the most frequently used pro- The study revealed significant increases in the use of LR, proportional hazard regression and methods for the analysis of data from complex sample surveys. Multivariable LR is a sophisticated statistical technique and concern has been expressed regarding its use and interpretation [29] [34] [35] [36]. The concerns have focused on assumptions associated with the appropriate use, correct interpretation and complete reporting of multivariable LR. The quality of the LR analysis depends heavily on researchers understanding the assumptions inherent in the method and following principles developed to ensure their sound application. Explicitness in modeling is also necessary for reporting the results to other researchers for verification and replication. It is against this back drop that this article aims to re-examine the components of and reporting requirements of the LR model as applied in medical research, and places emphasis on a more thorough and rigorous reporting for a wider audience.

The Logistic Regression Model
The LR gives each predictor a coefficient which measures its independent contribution to variation in the dependent variable. The dependent variable Y takes the value 1 if the response is "Yes" and takes a value 0 if the response is "No".
The model form for Predicted Probabilities is expressed as a natural logarithm (ln) of the odds ratio: where, is the log (odds) of the outcomes, Y is the dichotomous outcome; 1 2 , , , k X X X  are the predictor variables, 0 1 2 , , , , k β β β β  are the regression (model) coefficients and β 0 is the intercept.
In Equation (4), the logistic regression model directly relates the probability of Y to the predictor variables. The goal of LR is to estimate the k + 1 unknown parameters β in Equation (4). This is done with maximum likelihood estimation which entails finding the set of parameters for which the probability of the ob-

The Logistic Curve
The binary dependent variable has the values of 0 and 1 and the predicted value (probability) must be bounded to fall within the same range. To define a relationship bounded by 0 and 1, LR uses the logistic curve to represent the relationship between the independent and dependent variable. At very low levels of the independent variable, the probability approaches 0, but never reaches 0.
Likewise, if the independent variable increases, the predicted values increase up the curve and approach 1 but never equal to 1.

Transforming a Probability into Odds and Logit Values
The logistic transformation ensures that estimated values do not fall outside the range of 0 and 1. This is achieved in two steps, firstly the probability is re-stated as odds which is defined as the ratio of the probability of the event occurring to the probability of it not occurring. For example, if a horse has a probability of 0.8 of winning a race, the odds of it winning are 0.8/(1 − 0.8) = 4:1. To constrain the predicted values to within 0 and 1, the odds value can be converted back into a probability; thus, It can therefore be shown that the corresponding probability is 4/(1 + 4) = 0.8. Also, to keep the odds values form going below 0, which is the lower limit (there is no upper limit), the logit value which is calculated by taking the logarithm of the odds, must be computed. Odds less than 1 have a negative logit value, odds ratio greater than 1.0 have positive logit values and the odds ratio of 1.0 (corresponding to a probability of 0.5) have a logit value of 0.

Interpreting the Odds Ratio (OR)
When an independent variable X i increases by one unit (X i+1 ), with all other factors remaining constant, the odds of the dependent variable increase by a factor exp(β i ) which is called the odds ratio (OR) and ranges from zero (0) to positive infinity. It indicates the relative amount by which the odds of the dependent variable increase (OR > 1) or decrease (OR < 1) when the value of the corresponding independent variable increases by one (1) unit.

Selecting the Dependent Variables
In many cases, that outcome event is easily categorized into classes of having occurred, or not having occurred. For example, the occurrence of a heart attack or not; or delivering through caesarean or not, are relatively easily discerned and coded as either having happened, or not having happened. Once this categoriza- tion has been achieved, the predictors of that outcome can be studied [37]. In other cases, the outcome may be treated as dichotomous, but, in fact, it derives from the censoring of continuous data; that is, a cutoff criterion has been produced and the data recoded from continuous to categorical at the cutoff point. In these cases, the situation in choosing the outcome variable may be more complicated [16]. In some instances, continuous outcomes translate relatively easily into a dichotomous event. These cases are most often concerned with measures for which well-established cutoff points for the presence of an event have been developed. The presence or absence of high blood pressure is one such example, where a systolic pressure of greater than 140 mm/Hg is considered to be high [32]. It is worth noting that many multi-category or even continuous variables can be reduced to dichotomous ones. For example, if the health condition of patients is expressed on, say a seven-category scale, from "completely healthy" to "terminal condition", this could be reduced to two categories such as "healthy" and "unhealthy" [9].

Selecting Potential Predictors
Another aspect to consider in the development of a LR study concerns the selection of which variables to analyse as potential predictors of the outcome. This can only be achieved by a careful study of the literature in relation to the outcome, in order to ensure that the full range of potential predictors is included [20]. However, there are a number of drawbacks in selecting predictor variables that can lead to the presented logistic model appearing to explain greater or lesser amounts of variance than it actually may explain in reality [38]. The results of any LR will depend on the variables selected as potential predictors, put simply, if a variable is not selected for analysis, then it cannot feature in the final model. However, the choice regarding whether or not to include factors in the initial data set can impact on the results [37].
Further, if interaction terms between the variables are to be considered, then the omission of some variables could potentially have major impacts of the results. Unfortunately, the solution is not simply to include as many variables as possible, as the inclusion of variables that are unrelated to the outcome in question, this (the addition of unrelated variables) has the tendency to inflate the apparent predictive validity of the final model [33]. There is no one best way to tell that the set of predictors that have been chosen are appropriate, but a number of There may well be constraints acting on any particular study that lead to bias in the selection of the data used for the analysis. One potential constraint is the sample size, which limits the number of variables that can be studied. There is E. Y. Boateng, D. A. Abaye Journal of Data Analysis and Information Processing some debate as to the number of participants per variable that are needed, however, Agresti [39] suggests that a minimum of 10 participants are needed for every variable studied; a suggestion that is based on some statistical evidence confirming the reliability of logistic regressions performed on different numbers of events per variable [40]. This obviously places some constraints on the number of variables that can be employed in a study, although it should be noted that most studies of medical outcome using LR do follow this rule.
Another source of selection bias in the variables that are studied is that of missing data, where the presence of missing data in the sample can drive down the sample size if those participants with missing data are excluded, or can lead to the exclusion of certain variables from the analysis if large amounts of data are missing. Unfortunately, both of these outcomes can lead to bias in the variables selected that may be highly important as it will leave the sample as self-selected-that is, comprising only those individuals who chose to supply certain data, or only that data which is readily supplied by the sample, as well as other reasons why the other data are missing [19].
Finally, in addition to selection bias effects from these sources, the selection of variables is also constrained by the properties of the data that are collected. For example, predictor variables that are related to one another (that show colinearity or multi-colinearity) or predictor variables that have excessively influential observations (outliers), will impact adversely on the results of a LR. Particularly, in small or moderate samples, colinearity can result in overall levels of significance from the LR when individual predictors are not in themselves predictive of the outcome, or in the degree of relationship between a predictor and the outcome being incorrectly established [22]. Although LR is particularly useful in providing a parsimonious combination of the best predictor variables, such a procedure has the tendency to capitalize on chance sample characteristics [17].
The set of predictors yielded by one sample may not hold for another sample. It is therefore considered desirable when employing this procedure to correct for capitalizing on chance by cross-replicating to a new sample.

Evaluation of the LR Model
The goodness-of-fit for the LR model can be assessed in several ways. First, is to assess the overall model (relationship between all of the independent variables and dependent variable). Second, the significance of each of the independent variables needs to be assessed. Thirdly, the predictive accuracy or discriminating ability of the model needs to be evaluated, and finally, the model needs to be validated.

Overall Model Evaluation 1) The likelihood ratio test
The overall fit of a model shows how strong a relationship between all of the independent variables, taken together, and dependent variable is. It can be as- An equivalent formula sometimes presented in the literature is, 2 2 log likelihood of the null model G likelihood of the given model where, the ratio of the maximum likelihood is calculated before taking the natural logarithm (ln) and multiplying by −2. The term "likelihood ratio test" is used to describe this test. If the p-value for the overall model fit statistic is less than the significance level of the test, conventionally 0.05 (P < 0.05), then H 0 is rejected, with the conclusion that there is evidence that at least one of the independent variables contributes to the prediction of the outcome.

2) Hosmer-Lemeshow test
The Hosmer-Lemeshow test is used to examine whether the observed proportions of events are similar to the predicted probabilities of occurrence in subgroups of the model population. The Hosmer-Lemeshow test is performed by dividing the predicted probabilities into deciles (10 groups based on percentile ranks) and then computing a Pearson's Chi-square (χ 2 ) that compares the predicted to the observed frequencies in a 2-by-10 table. The value of the test statistics is expressed as, where, g O and g E denote the observed and expected events for the g th risk decile group. The test statistic asymptotically follows a χ 2 distribution with 8 (number of groups minus 2) DoF. Small values (with large P-value closer to 1) indicate a good fit to the data, therefore, good overall model fit. Large values (with P < 0.05) indicate a poor fit to the data. Hosmer and Lemeshow [4] do not recommend the use of this test when n is small (i.e. n < 400).

Statistical Significance of Individual Regression Coefficients
If the overall model works well, the next question is how important each of the independent variables is. The LR coefficient for the i th independent variable shows the change in the predicted log odds of having an outcome for one unit E. Y. Boateng, D. A. Abaye Journal of Data Analysis and Information Processing change in the i th independent variable, all other things being equal. That is, if the i th independent variable, with regression coefficient b, is changed by 1 unit while all of the other predictors are held constant, log odds of outcome is expected to change b units. There are a couple of different tests designed to assess the significance of an independent variable in logistic regression including the likelihood ratio test and the Wald statistic [40].

1) Wald statistic
Statistical tests of significance can be applied to each variable's coefficients.
For each coefficient, the null hypothesis that the coefficient is zero is tested against the alternative that the coefficient is not zero using a Wald test, Wald test can also be used to compare a full model containing all the predictor variables with a reduced model with some coefficients set to zero. The Wald statistic can be used to assess the contribution of individual predictors or the significance of individual coefficients in a given model [41]. The Wald statistic is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient. The Wald statistic is asymptotically distributed as a χ 2 distribution: Each Wald statistic is compared with a χ 2 critical value with 1 DoF.

2) Likelihood ratio test
The likelihood-ratio test used to assess overall model fit can also be used to assess the contribution of individual predictors to a given model. The likelihood ratio test for a particular parameter compares the likelihood of obtaining the data when the parameter is zero, 0 L with the likelihood 1 L of obtaining the data evaluated at the maximum likelihood estimation of the parameter.
The test statistic is calculated as follows: This statistic is compared with a χ 2 distribution with 1 DoF. To assess the contribution of individual predictors one can enter the predictors hierarchically, then compare each new model with the previous model to determine the contribution of each predictor. Table  The classification table (Table 1) is a method to evaluate the predictive accuracy of the logistic regression model [42]. In this If the logistic regression model has a good fit, we expect to see many counts in the a and d cells, and few in the b and c cells. In an analogy with medical diagnostic testing, we can consider,

( )
Sensitivity a a b = + and ( ) where, higher sensitivity and specificity indicate a better fit of the model.

Discrimination with Receiver Operating Characteristic Curves
Extending the above two-by-two idea (Table 1)

Validation of the LR Model
Validation is an important test of the regression's internal validity, a crucial step in the argument that the regression model is not an idiosyncratic artifact but instead that it has captured essential relationships in the domain of study. An important question is whether results of the LR analysis on the sample can be extended to the population the sample has been chosen from. This question is referred as model validation. In practice, a model can be validated by deriving a model and estimating its coefficients in one data set, and then using this model to predict the outcome variable from the second data set, then check the residuals, and so on. When a model is validated using the data on which the model was developed, it is likely to be over-estimated. Thus, the validity of model should be assessed by carrying out tests of goodness of fit and discrimination on a different data set [44].
If the model is developed with a sub-sample of observations and validated E. Y. Boateng, D. A. Abaye Journal of Data Analysis and Information Processing with the remaining sample, it is called internal validation. The most widely used methods for obtaining a good internal validation are data-splitting, repeated data-splitting, jackknife technique and bootstrapping [45]. If the validity is tested with a new independent data set from the same population or from a similar population, it is called external validation. Obtaining a new data set allows us to check the model in a different context. If the first model fits the second data set, there is some assurance of generalizability of the model. However, if the model does not fit the second data, the lack of fit can be either due to the different contexts of the two data sets or true lack of fit of the first model [25].

Determining the Number of Significant Variables to Retain
Since the estimates of the included variable may be sensitive to changes in the variable(s) omitted, some researchers have chosen to retain all the variables representing the same factor if at least one of them is statistically significant. They refer to such a model as the full model [46] [47] while others chose to eliminate all insignificant variables from the model to increase efficiency of estimation and refer to such a model as the reduced model [48]. To increase the efficiency in medical research, the reduced model with only the statistically significant variables retained is mostly used. In the reduced model, variables with P-value less than or equal to α-value are treated as statistically significant [49].

Reporting and Interpreting LR Results
The following four types of information should be included when presenting the LR results; 1) An overall evaluation of the logistic model; 2) statistical tests of individual predictors; 3) goodness-of-fit statistics; and 4) an assessment of the predicted probabilities. We demonstrate this from recent work on variables informing expectant mothers to opt for caesarean delivery or vaginal birth [49]. Tables 2-5 are examples to illustrate the presentation of these four types of information. Table 2 presents the logistic regression model with statistical significance of individual regression coefficients (β) tested using the Wald χ 2 statistic.
From Table 2 baby's birth weight has a significant effect on the event (P < 0.05). Compared with babies with birth weight above 3.5 kg, babies with birth weight less than 3.5 kg were found to have a decreased probability on the event.
The negative sign of the estimated coefficients and the sign of the odds ratio be-   The cut off value is 0.500. weight from 2.5 kg to 3.5 kg and (β = −1.6042, P < 0.001 and OR = 0.2010) for babies with birth weight below 2.5 kg show that the probability of caesarean delivery is higher for babies with birth weight above 3.5 kg than babies with birth weights below 3.5 kg. That is, the relative probability of caesarean delivery decreases by 78.52% for babies with birth weight from 2.5 kg to 3.5 kg and 79.9% for babies with birth weight below 2.5 kg.
It could also be seen from Table 2 that parity was estimated to be a significant  Table 3 that the logistic model with independent variables was more effective than the null model. A model summary of the logistic model is presented in Table 4. It could be observed that the model has a relatively larger pseudo R 2 of 0.723 for the Nagelkerke R 2 and 0.576 for the Cox and Snell R 2 That is, the fitted model can explain or account for 72.3% of the variation in the dependent variable. This is an indication of a good model.

Conclusion
This study explored the components LR model, a type of multivariable method E. Y. Boateng, D. A. Abaye Journal of Data Analysis and Information Processing used frequently for modeling the relationship between multiple independent variables and a categorical dependent variable, with emphasis on medical research. Six text books on logistic regression and 37 research articles published between 2000 and 2018 which employed logistic regression as the main statistical tool were reviewed. Logistic regression concepts such as odds, odds ratio, logit transformation, logistic curve, assumption, selecting dependent and independent variables, fitting, reporting and interpreting were presented. Upon perusing literature, considerable deficiencies were found in both the use and reporting of LR. For many studies, the ratio of the number of outcome events to predictor variables (events per variable) was sufficiently small to call into question the accuracy of the regression model. Also, most studies did not report validation analysis, regression diagnostics or goodness-of-fit measures. Proper use of this powerful and sophisticated modeling technique requires considerable care both in the specification of the form of the model, in the calculation and interpretation of the model's coefficients. We presented an example of how the LR should be applied. It is recommended that researchers be more thorough and pay greater attention to these guidelines concerning the use and reporting of LR models. In future, researchers could compare LR with other emerging classification algorithms to enable better or more rigorous evaluations of such data.