Comparative Analysis of Machine Learning Models for Diabetes Prediction Using PIMA and Synthetic Datasets

Notification

Announcement!

ISJEM Invites papers for various areas like engineering, Management, Science & other multi discplinary subjects. Please submit your paper for review.

ISJEM assigns a digital object identifier (DOI) to each published paper, making it easier for the paper to be cited in various major databases like Google Scholar, ResearchGate, Academia.edu, etc…

ISJEM takes 24–48 hours to publish a research paper. Within 24 hours, the submitted paper will be reviewed and notified of its status, and it will be published once the processing fee is successfully received.

Comparative Analysis of Machine Learning Models for Diabetes Prediction Using PIMA and Synthetic Datasets

Version

File Size 660.33 KB

Downloads 4

Files 1

Published 17 April 2026

Updated 17 April 2026

Comparative Analysis of Machine Learning Models for Diabetes Prediction Using PIMA and Synthetic Datasets

Authors:

MUTHUKURU REDDY MOHAMMAD, K MOHAN VAMSI, NAKKALA SEETHARAMA RAJU, Dr. REHKHA K.K, Dr. M. NISHA, Dr. T. KUMANAN

Abstract— Diabetes mellitus has been classified as a long-term metabolic disease that is currently increasing in scale, affecting about 422 million people around the globe and 77 million people in India. Predicting diabetes early on and accurately is extremely important in order to limit the number of complications associated with the disease and the burden placed on healthcare systems. This paper examines the results of a systematic and comparative evaluation of four (4) different supervised classifiers: (logistic regression, support vector machine (SVM), and random forest, decision tree) from two datasets – the PIMA Indian Dataset. Which consists of 768 records of Pima Indian females from Arizona, USA, that have been used to provide an established benchmark for many years, and a synthetic dataset of 1,500 records was also created. The synthetic dataset was generated by the make classification method available in scikit-learn with distinctive statistic characteristics than those found in the India dataset (e.g., higher mean BMI and older mean age) so that there could be an experimental comparison of classifiers under different distributional assumptions. The second database does not have any clinical significance, since it was used strictly for algorithm evaluation in a controlled statistical environment (to evaluate simulation design characteristics).

Keywords— Machine Learning, Diabetes Prediction, Random Forest, Support Vector Machine, PIMA Indian Dataset, Confusion Matrix, ROC Curve, Cross-Validation.