Overfitting is an issue within machine learning and statistics. It occurs when we build models that closely explain a training data set, but fail to generalize when applied to other data sets. Overfitting is a part of life as a data scientist. We all do it to some degree or another. In the case of forecasting in data science competitions, it might actually be advantageous to overfit to Kaggle’s public leaderboard. However, if you have an independent, and identically distributed (iid), split between train and test data sets, then it’s probably better to come up with a leak-free cross-validation (CV) scheme.
Preventing leakage—the creation of unexpected additional information in the training data set resulting in a machine learning algorithm making unrealistically good predictions—is actually a difficult task and typical k-fold or random split may not always prevent leakage. You may need to use some kind of stratified sampling scheme. Given Kaggle’s typical allowance of two entries, you may want to select one entry that maximizes local CV scores and one that maximizes public leaderboard.
When thinking about overfitting, here are some points about the Data Science Bowl to keep in mind:
Am I making multiple comparison corrections on my score given a large number of submissions? As a general rule, you should expect a public to private ranking drop proportional to the number of submissions (rank) unless you follow a very rigid CV scheme.
Do I want to overfit on purpose for ensembling gains? Occasionally an overfit model can work well within a large data set. This is sometimes seen within neural net ensembles. The best individual scores do not always make for the best ensemble marginal gains. Note: this does not mean you want the final ensemble model to overfit. Also diversity via surrogate loss functions can sometimes add value via underfitting. You could try hinge loss instead of logistic loss. Ensembling many diverse models can help mitigate overfitting in some cases. Strategies for ensembling can involve bagging from the row data, use of different algorithms, or trying different groups of features.
If the Data Science Bowl allows semi-supervised learning, in theory leveraging more data is always better given infinite AWS computational resources. However, in practice, these methods have not always produced champions. Consider it an open challenge.
You can bound the error given iid data train/public test/private test splits and fixed sample sizes. So, overfitting in this case is not a bad idea when the number of test set rows (observations) is very large (in the billions) and the number of columns (features) is less than the number of rows. Small train and test sets will require the data scientist to become more cautious and more vigilant with CV.
It is good to know exactly how the public to private data split was made, and exactly how the train to test data split was made. However, this might not always be available for every data science competition. The best way to avoid overfitting in data science is to only make a single Kaggle entry based upon local CV. Tim Salimans obtained second place in “Don’t Overfit!” with a single submission. However, it’s often more fun to grind your way into a stochastic (public) leaderboard descent. I suggest you explore and find options that work best for you. The leaderboard is temporary (even in the Data Science Bowl), but the data science knowledge you gain can last a lifetime.
—Written by Mike Kim