Suppose that you are using a machine learning model to fit a feature set and predict outcomes. Training the model on your full data set is a bad idea as it could result in overfitting. This is because the model will adjust the parameters and hyper-parameters until it finds an optimal combination, molding itself to your data points and losing the generality needed to predict new results.

To prevent overfitting, a standard approach is to divide your dataset into two pieces: a training set and a test set. The model is first trained on the training set (comprising perhaps 70% of the data) and then it's performance is tested on the test set (the remaining 30%).

Now suppose you've gone through this process and your model performs poorly on the test set. You cannot simply modify the model parameters so that they better fit the out-of-sample data, as this would again result in overfitting.

Nor can you train multiple models and choose the best-performing on the test set as your choice going forward. You will have implicitly made a judgement based on out-of-sample performance by doing so. This also constitutes overfitting, albeit at a higher level.

So the question is: how do you improve your model if it performs poorly on the test set?

A part of the answer is to divide your full dataset into more pieces. You could divide the training set into it's own constituent training and test sets, iterating on the model until you are ready to measure performance on the out-of-sample. Moreover, you could set aside a part of the data as a validation set and use that pick to between models. In this case, you would train multiple models on the training set, verify and compare their performance on the validation set and pick the best one, and finally test the performance on the test set.

After all this, what if the model still performs poorly on the out-of-sample data? Are you doomed now that you've used all your data?

Most would say that if the out-of-sample failure gives you a new idea, you are not at fault for designing a new model to test that hypothesis. Or that if you change the input features or something similarly fundamental, that's ok too.

The problem is that if you go through this process enough times, you are bound to find a model that fits. Creating a feedback loop between out-of-sample performance and model design, even if design changes are fundamental, could constitute overfitting over time. Understanding this boundary and knowing when you're at fault is perhaps more of an art than a science.