Comparative Analysis of Machine Learning Models for Diabetes Prediction Using PIMA and Synthetic Datasets
Comparative Analysis of Machine Learning Models for Diabetes Prediction Using PIMA and Synthetic Datasets
Authors:
MUTHUKURU REDDY MOHAMMAD, K MOHAN VAMSI, NAKKALA SEETHARAMA RAJU, Dr. REHKHA K.K, Dr. M. NISHA, Dr. T. KUMANAN
Abstract— Diabetes mellitus has been classified as a long-term metabolic disease that is currently increasing in scale, affecting about 422 million people around the globe and 77 million people in India. Predicting diabetes early on and accurately is extremely important in order to limit the number of complications associated with the disease and the burden placed on healthcare systems. This paper examines the results of a systematic and comparative evaluation of four (4) different supervised classifiers: (logistic regression, support vector machine (SVM), and random forest, decision tree) from two datasets – the PIMA Indian Dataset. Which consists of 768 records of Pima Indian females from Arizona, USA, that have been used to provide an established benchmark for many years, and a synthetic dataset of 1,500 records was also created. The synthetic dataset was generated by the make classification method available in scikit-learn with distinctive statistic characteristics than those found in the India dataset (e.g., higher mean BMI and older mean age) so that there could be an experimental comparison of classifiers under different distributional assumptions. The second database does not have any clinical significance, since it was used strictly for algorithm evaluation in a controlled statistical environment (to evaluate simulation design characteristics).
Keywords— Machine Learning, Diabetes Prediction, Random Forest, Support Vector Machine, PIMA Indian Dataset, Confusion Matrix, ROC Curve, Cross-Validation.