Stratified sampling in Cohort-based Data for Machine Learning Model Development
- Version
- Download 17
- File Size 357.32 KB
- Download
Stratified sampling in Cohort-based Data for Machine Learning Model Development
Vaibhav Tummalapalli
Atlanta, USA
vaibhav.tummalapalli21@gmail.com
Abstract—Cohort-based data is a prevalent structure in many industries, enabling longitudinal analyses and tracking customer behaviors over time. However, sampling such data for model development presents unique challenges, especially when events (e.g., responses, purchases) are unevenly distributed across cohorts. Random sampling can introduce biases, leading to models that fail to generalize. This paper presents a stratified sampling framework designed to maintain the proportional representation of events and non-events within each cohort, even when oversampling or undersampling is applied. The approach ensures stable and unbiased models, offering insights into practical implementation and evaluation metrics
Keywords—Random, Stratified Sampling, Machine Learning, Cohort analysis, Class Imbalance, Sampling Bias.