Comparative Analysis of Machine Learning Algorithms for Early Diabetes Prediction Using the PIMA Indians Diabetes Dataset
Comparative Analysis of Machine Learning Algorithms for Early Diabetes Prediction Using the PIMA Indians Diabetes Dataset
Authors:
Jirtus Sanasam
Department of CSE, Sharda School of Engineering & Technology(SSET),
(Sharda University)
Greater Noida, Uttar Pradesh, India.
jirtussanasam@gmail.com
Dr. V Sathiyasuntharam
Department of CSE, Sharda School of Engineering & Technology(SSET),
(Sharda University)
Greater Noida, Uttar Pradesh, India.
velayudham.sathiyasuntharam@sharda.ac.in
Dr. Gauri Shankar Mishra
Department of CSE, Sharda School of Engineering & Technology(SSET),
(Sharda University)
Greater Noida, Uttar Pradesh, India.
gourisankar.mishra@sharda.ac.in
Abstract—Diabetes Mellitus is a chronic metabolic disease that affects more than 537 million individuals globally, with a further increase in the number to 700 million estimated to occur by 2045. Timely and accurate diagnosis is important to avoid fatal outcomes of this condition. This paper aims to compare the results of the use of various supervised machine learning techniques for predicting diabetes on the example of the Pima Indians Diabetes Database available in the UCI Machine Learning repository. As the PIMA Indians Diabetes Database included 768 samples with eight clinical attributes, data preprocessing included imputation of missing values through Multivariate Iterative Imputation and oversampling using the SMOTE technique. Five classification algorithms were assessed – namely, logistic regression, Support Vector Machine (SVM), Random Forest, eXtreme Gradient Boosting (XGBoost), and a neural network – in terms of Accuracy, Precision, Recall, F1-score, and AUC-ROC. The most effective algorithm in this regard was the Extreme Gradient Boosting, which scored the highest accuracy (75.32%), F1-Score (0.6724), and AUC-ROC (0.8280). The second-best classifier was the random forest algorithm, scoring accuracy of 74.03% and AUC-ROC of 0.8235.
Keywords—Diabetes Mellitus, Machine Learning, Deep Learning, Healthcare Analytics