Interpretable Loan Default Probability Prediction using Machine Learning
Interpretable Loan Default Probability Prediction using Machine Learning
Author(s) Priti Vankar
Department of Computer Engineering /
Parul University, Vadodara, India
Abstract—Accurate and interpretable credit risk assessment is a fundamental requirement of modern lending institutions operating under increasingly stringent regulatory frameworks. This paper presents a comparative study of Random Forest and Logistic Regression for loan default prediction, evaluated on a dataset of 15,200 anonymized loan records sourced from the GoMask financial platform. The preprocessing pipeline incorporates median/mode imputation, one-hot encoding, StandardScaler normalization, and SMOTE-based class-imbalance correction. Both classifiers are rigorously assessed using stratified 5-fold cross-validation and a held-out test set (n = 3,040) across accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrix decomposition. Logistic Regression achieves 91% accuracy, an F1-score of 0.90, and a ROC-AUC of 0.94—outperforming Random Forest on probability calibration (ROC-AUC = 0.85)—while providing full model transparency through auditable sigmoid coefficients that directly satisfy GDPR Article 22 and FCRA explainability mandates. The preferred model is deployed within a Flask-based interactive dashboard supporting real-time single-applicant scoring, bulk CSV inference, and a 25-chart exploratory analytics suite. This work demonstrates that an inherently interpretable classifier can match ensemble accuracy in credit scoring while meeting compliance requirements, and provides a fully reproducible, open-source blueprint for regulation-ready financial AI.