Designing a great validation set (fastai)
GridSearchCV
The most basic train-test split
Probabilistic evaluation methods
Adversarial validation
The correct data transformation process
- Split the dataset first and set your test set aside
- Transform the train set
- Transform the rest of the data
<aside>
💡 Bottom line: Never transform your data before splitting it.
</aside>
In more detail:
- Split the dataset before you do anything else. Set aside the data you'll use to test the model.
- To learn about your data, use the split you'll use to train the model. Leave the test data alone. Don't look at it.
- When it's time to preprocess the data, do it on every split individually. Fit your transformation process on the train data. Use the same global statistics to process the rest of the dataset.
- Evaluate your model on the test data as few times as possible. Ideally, use your test data only once.
- If you reuse your test data, don't use the results of your model as feedback to improve it. The more you stick to this, the more you'll delay the inevitable overfitting.
Wrong and correct example in the picture below:
