TITLE:
Statistical Analysis of a Diabetes Dataset and the Impact of Principal Component Analysis on Prediction Accuracy
AUTHORS:
Elizabeth Diamond, Faith Idoko, Michael Olowe
KEYWORDS:
Logistic Regression, Discriminant Analysis, Principal Component Analysis
JOURNAL NAME:
Open Journal of Nursing,
Vol.15 No.8,
August
22,
2025
ABSTRACT: This paper aims to investigate the effectiveness of logistic regression and discriminant analysis in predicting diabetes in patients using a diabetes dataset. Additionally, the paper explores the impact of principal component analysis (PCA) on the prediction accuracy of these methods. The dataset used for this study contains clinical and demographic information of patients with and without diabetes. Logistic regression (LR) and discriminant analysis (DA) were employed to build predictive models using the dataset. The models were then evaluated using various performance metrics such as sensitivity, specificity, and accuracy. The hypothesis (D0 = patient does NOT have diabetes, whereas D1 = patient HAS diabetes) is determined with statistical analysis. Results show both logistic regression and discriminant analysis can accurately predict diabetes in patients. Performing PCA did not improve the prediction accuracy of these statistical techniques on the diabetes dataset. The analysis dataset contained 390 patient records with 14 clinical variables. While the dataset provides valuable insights, the relatively small sample size may limit the generalization of the results to broader populations. Our findings suggest that logistic regression or discriminant analysis can be a powerful tool for predicting diabetes in patients, aiding in early detection and effective prevention or management of the disease.