^{1}

^{2}

As the population ages, Alzheimer’s disease is rapidly increasing, and the diagnosis of the disease is still poorly understood. In comparison to cancer, 90% of patients become aware of their diagnosis, but only 45% of the people with Alzheimer’s are aware. Thus, the need for biomarkers for reliable diagnosis is tremendous to help in finding treatment for this serious disease. Hence, the main aim of this paper is to utilize information from baseline measurements to develop a statistical prediction model using multiple logistic regression to distinguish Alzheimer’s disease patients from cognitively normal individuals. Our optimal predictive model includes six risk factors and two interaction terms and has been evaluated using classification accuracy, sensitivity, specificity values and area under the curve.

Alzheimer’s disease causes memory loss, and it is not a normal part of aging. It is the only disease that cannot be prevented, treated or even slowed. A recent fact from Alzheimer’s Association report in 2018 shows that only deaths from Alzheimer’s disease have increased significantly while from other major causes of death in the United States have decreased significantly. The bar chart in

In comparison to cancer, 90% of patients become aware of their diagnosis, but only 45% of the people with Alzheimer’s are aware [

Brain imaging is used to detect some brain changes caused by Alzheimer’s disease, that is, detecting the levels of plaques and tangles, the two types of disorders in the brain associated with the presence of Alzheimer’s. Plaques are found between the dying cells in the brain from the buildup of a protein called beta-amyloid and tangles are twisted fibers within the dying cells from the other protein called tau. Beta-Amyloid and tau proteins are normally fragmented that the body produces, but in Alzheimer’s the proteins are abnormal.

Cerebrospinal fluid analysis (CSF) is collecting the clear fluid that protects and surrounds the brain and spinal cord to determine the levels of beta-amyloid, total tau (T-tau) and phosphorylated tau (P-tau) proteins. Since CSF is in direct contact with the brain and spine, collecting a sample of the fluid can be a useful diagnostic tool for this neurodegenerative disease.

The primary goal of the present study is to develop the best statistical model to correctly predict Alzheimer’s patients with their demographic, CSF, laboratory and brain imaging factors using logistic regression model. This model will allow us to accurately evaluate the probability that a patient is diagnosed with Alzheimer’s disease. Moreover, we can rank the significant contributing risk factors based on their relative importance to the response. Hence, medical doctor can use our proposed data-driven model as a decision supportive before starting any treatment.

In the present study, we used data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. The primary goal of ADNI is to detect and track the progression of Alzheimer’s disease by combining clinical, imaging, genetic and biological markers of participants to help researchers and doctors develop new treatments. More information about ADNI visits http://adni.loni.usc.edu.

Our data consist of 169 subjects with an age range from 58 - 94 years old. We have information about their demographic characteristics, neuropsychological test, laboratory data, cerebrospinal fluid analysis, and brain imaging data.

In the cerebrospinal fluid analysis, we have a concentration of P-tau and amyloid beta levels in picograms per milliliter (pg/ml) from the cerebrospinal fluid. The laboratory data consist of the levels of vitamin B12 in nanograms per milliliter (ng/mL), thyroid stimulating hormone in milliunits per liter (mU/L), Hemoglobin in grams per deciliter (g/dL) and cholesterol in milligram per deciliter (mg/dL) as they have been linked to Alzheimer’s disease.

MRI scan includes measures about total brain volume, whole brain gray matter volume, whole brain white matter volume, and intracranial volume.

Our response in this Analysis is the status of the participants as cognitively normal individuals (CN) or Alzheimer’s disease (AD) based on SPARE-AD score (Spatial Pattern of Abnormalities for Recognition of Early AD). SPARE-AD is an imaging analysis of the spatial patterns of brain atrophy to distinguish individuals with AD from CN. Positive diagnostics values indicate the presence of Alzheimer’s disease and negative values indicate a normal pattern of brain structure [

Several studies have mentioned that women are more likely than men, to be identified with Alzheimer’s disease [

● Are male and female equality diagnosed with Alzheimer’s disease?

To answer this question, we used the hypothesis test to determine whether the difference between the two proportions is significant. That is, to test the hypothesis

that H 0 : P 1 = P 2 vs. H 1 : P 1 ≠ P 2 , where P 1 = 0.5643 = ( 57 101 ) is the proportion of male with AD and P 2 = 0.5441 = ( 37 68 ) is the proportion of female with

AD. A p-value = 0.7951 indicate that at 5% level of significance, there is no statistically significant difference between the percentage of males and females diagnosed with Alzheimer’s disease.

For our analysis, we used multiple logistic regression to predict the status of the patients as CN or AD. The logistic regression is a method used to describe and explain the relationship between binary response and the statistically significant risk factors. It can answer questions like: do age, body weight, vitamin B12, cholesterol level, tau, and beta-amyloid proteins influence on the probability of having Alzheimer’s disease?

Mathematically, let Y be the binary response and its possible outcome by 1 (“AD”) and 0 (“CN”). The distribution of Y is specified by probability P ( Y = 1 ) = π of AD and P ( Y = 0 ) = ( 1 − π ) of CN, where E ( Y ) = π is the mean of Y. Let π ( x ) denote the probability of selecting AD patient given the risk factors x. The logistic regression model has a linear form for the logit of this probability defined as [

logit [ π ( x ) ] = log ( π ( x ) 1 − π ( x ) ) = ∑ β j x i j , (1)

where β j is the coefficient of the j^{th} risk factor ( j = 1 , ⋯ , p ) , x i j is the i^{th} observed value of the risk factor j ( i = 1 , ⋯ , n ) and ( π ( x ) 1 − π ( x ) ) is the odds which

expresses the ratio between the probability of predicting AD patient to the probability of CN.

The logistic regression model implies the analytic for the probability of selecting AD patient given by the risk factors as:

π ( x ) = exp ( ∑ β j x i j ) 1 + exp ( ∑ β j x i j ) . (2)

We partition our data set into two parts training and testing with 75% and 25% of the data, respectively. We started with the full logistic regression model that includes all predictors and their possible interactions. Our logistic model with all independent variables and their possible interactions to predict whether the patient has Alzheimer’s disease is given by:

logit [ P 1 − P ] = β 0 + β 1 X 1 + β 2 X 2 + ⋯ + β j X j , (3)

where P denote the probability of selecting AD patient, β_{j}’s denote the coefficients and X’s are the risk factors and possible interactions. Using backward elimination algorithm to remove the term in the complex model that has the largest P_value and stop when any further elimination leads to poor fit. In addition to the minimum AIC (Akaike information criterion) that judges the quality of the model by how close the fitted values to the true expected values, that means, selecting the best statistical predictive model that minimize,

AIC = − 2 ln ( L ) + 2 k ,

where L is the value of the likelihood and k is the number of parameters in the model. Thus, our optimal data-driven statistical logistic model that predicts the patient’s condition with minimum AIC is given by:

log [ P 1 − P ] = 7.55 − 0.003 Abeta + 0.170 PTau + 10.18 Thyroid + 0.002 VB12 − 0.14 Chelost − 0.44 Hem + 0.01 ( Chelost ∩ Hemog ) − 0.87 ( Thyroid ∩ Hemog ) (4)

The symbol ( ∩ ) means interaction and as we can see from our proposed model, six risk factors and only two interaction terms are statistically significant contributing to the prediction of the patient’s condition, namely, phosphorylated tau protein (P-tau), beta-amyloid protein, thyroid stimulating hormone, vitamin B12, cholesterol, hemoglobin, and the interaction between (cholesterol ∩ hemoglobin) and (thyroid stimulating hormone ∩ hemoglobin). Furthermore, as we can see, age is not one of the significant risk factors in our optimal predictive model, and this holds that Alzheimer’s disease is not part of normal aging.

The coefficients in the logistic regression indicate the change in the expected log odds relative to the one-unit change in (X_{j}) holding all other predictors are constant [

Similarly, the interpretation of the coefficient (−0.003) of beta-amyloid protein means that as the beta-amyloid protein level decrease, the odds of the participant diagnosed with AD will increase while holding all other variables constant. Alternatively, by using the odds ratio exp ( − 0.003 ) = 0.997 , with all other predictors unchanged, every unit decrease in the beta-amyloid protein increases the odds of being Alzheimer’s patient by a factor of 0.997.

Model EvaluationTo evaluate our optimal predictive model, we used classification accuracy, sensitivity, specificity values and area under the curve (AUC) for testing data. The proportions of correctly identified AD and CN participants from the multiple logistic model is called “accuracy”. The proportions of actual Alzheimer’s patients who are correctly identified from our predictive model as having the disease is known as “sensitivity” and the proportions of actual cognitively normal individuals who are correctly identified from the model is known as “specificity”. A perfect predictive model would be described as 100% sensitive (that is predicting all sick people from Alzheimer’s disease group as Alzheimer’s) and 100% specific (that is predicting all normal individual as cognitively normal). For any test, however, there is usually a trade-off between these two measures and can be explored graphically by the receiver operating characteristic curve (ROC).

We used the confusion matrix of the testing data to get the values needed to assess the model. The confusion matrix is a classification table describe how well our multiple logistic regression model does in predicting Alzheimer’s patients from cognitively normal individuals.

Actual class | Total | |||
---|---|---|---|---|

CN | AD | |||

Predicted class | CN | TN = 10 | FN = 5 | 15 |

AD | FP = 2 | TP = 18 | 20 | |

Total | N =12 | P = 23 | 35 |

identified as sick, and FN is the number of Alzheimer’s cases predicted incorrectly by our model as a healthy individual.

Using the confusion matrix, we found out that our model accuracy is ( T P + T N N + P ) = 80 % and it correctly predicts 78.26% of all Alzheimer’s disease cases (the sensitivity = ( T P P ) ). Also, it correctly identifies 83.33% of those who don’t have Alzheimer’s disease (the specificity = ( T N N ) ). A summary of our classification results is given in

Another method to evaluate our model graphically is the receiver operating characteristic (ROC). Each point on the ROC curve represents a (sensitivity, 1-specificity) pair corresponding to a different decision cut-off point. The area under the ROC curve (AUC) is a measure of how well the model can distinguish between two diagnostic groups. For our proposed model, the AUC value is 87.68% which implies that our model does well in discriminating between the two classes of the patient’s condition.

Evaluation value | Percentage |
---|---|

Accuracy | 80% |

Sensitivity | 78.26% |

Specificity | 83.33% |

After validating our proposed model, we need to rank the risk factors in terms of their importance to Alzheimer’s diagnostic. We identified the relative importance of the risk factors by the absolute value of their standardized coefficients (weights) and pseudo partial correlation. In the standardized coefficients, the higher the absolute value points to the greater strength of association with Alzheimer’s diagnostic [

Standardized weight = β i s / s d i , (5)

where β i is the estimated coefficient (weight) for predictor i, s d i is the sample standard deviation for predictor i, and s = π / 3 .

The pseudo partial correlation is given by:

r = ± ( W i − 2 K ) / − 2 L L 0 (6)

where W i is the Wald chi-square statistic for predictor i, K is the degrees of freedom of predictor i, and − 2 L L 0 is the log-likelihood of the model with only intercept term. The closer the value to 1 or −1, the stronger the association between a predictor and the outcome [

Thus, the relative importance of the significantly contributing risk factors in our predictive model is presented in

Rank | Risk Factor | Standardized Weights | Pseudo Partial Correlation |
---|---|---|---|

1 | P-Tau protein | 4.384 | 0.542 |

2 | Beta-amyloid | 3.568 | −0.410 |

3 | Thyroid ∩ Hemoglobin | 2.514 | −0.243 |

4 | Thyroid | 2.171 | 0.212 |

5 | Vitamin B12 | 1.665 | 0.196 |

6 | Cholesterol | 1.554 | −0.154 |

7 | Cholesterol ∩ Hemoglobin | 1.496 | 0.147 |

8 | Hemoglobin | 0.349 | −0.019 |

The importance of knowing the causes of the disease helps find the best way to cure it. While several top causes of death are decreasing, Alzheimer’s deaths are on the rise. Thus, in the present study, we developed a statistical predictive model using multiple logistic regression to predict Alzheimer’s disease patients by selecting the relevant risk factors using backward elimination. We found that six risk factors and only two interaction terms namely, phosphorylated tau protein (P-tau), beta-amyloid protein, thyroid stimulating hormone, vitamin B12, cholesterol, and the interaction between (cholesterol ∩ hemoglobin) and (thyroid stimulating hormone ∩ hemoglobin) were significantly contributing to Alzheimer’s disease.

We evaluated the quality of the proposed model by classification accuracy, sensitivity, specificity values and area under the curve, the result of which attested to the effectiveness of the model. Then, we examine the relationship between the response and the significant contributing predictors and rank them based on their standardized coefficients. By defining and ranking the statistically significant risk factors, they will be useful as a screening tool to discriminate Alzheimer’s disease patients from cognitively normal individuals.

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

The authors declare no conflicts of interest regarding the publication of this paper.

Habadi, M. and Tsokos, C.P. (2020) Alzheimer’s Disease: The Relative Importance Diagnostic. Advances in Alzheimer’s Disease, 9, 77-86. https://doi.org/10.4236/aad.2020.94006