Data Leakage: How Data Collection Impacts the Decisions We Make and Vice Versa

I wrote this post on Cardinal Path’s blog. There is a lot to consider when building a model: Data leakage.

Data leakage occurs when the data you are using to train a machine learning algorithm happens to include unexpected information related to what you are trying to predict, allowing the model or algorithm to make unrealistically good predictions. In other words, since the data you are using to predict already contains the prediction, hidden in some variable, the results of the model may not actually be useful.

For example, let’s say we’re trying to predict which customers made a purchase, and the dataset used to make the predictions contained a customer ID (which we assume to be random), but the customer ID started with a 1 if they made a purchase. We can now make a model with impeccable predictions (just use the rule if first digit = 1). However, these rules are actually useless. They don’t help to determine whether a new customer will make a purchase based on the rules that actually matter. Of course, this example is a bit ridiculous, since using a customer ID as a predictor in a model would be naive, but here are some more concrete examples of where this happens.