Splitting strategies
The most basic train-test split
Adversarial Validation
Validation is the method you use to correctly evaluate the errors that your model produces and to measure how its performance improves or decreases based on your experiments.
Number of splits
You ideally should have three splits:
Splitting strategies
- Training set: This is the data your model learns from directly. The model sees these examples during training and updates its weights based on them.
- Validation set: This data is used during the training process to tune hyperparameters (like learning rate) and make decisions like when to stop training. The model doesn't learn from this data, but you use the performance on this set to make training decisions. In fastai, this is what guides the learning rate finder, early stopping, and other training decisions.
- Test set: This is a completely held-out dataset that you only use once at the very end of your model development process. It provides an unbiased evaluation of your final model. Jeremy emphasizes that if you make decisions based on test set performance, it's no longer a true test set.
Recommended approach to train-test splits
The recommended approach to train-test splits focuses on creating validation sets that truly represent your production environment. Here are the key principles:
- Time-based splits for time series data: If your data has a time component, split chronologically. Use older data for training and newer data for validation to simulate how the model will perform in production.
- Random splits for most other cases: For data without time components, random splits are usually appropriate, but with careful consideration of data distribution.
- k-fold cross-validation: For smaller datasets, k-fold cross-validation (typically 5-fold) helps provide more robust performance estimates. If we want to keep all best practices shown here and still use k-fold cross validation then look at k-fold cross validation with the three-split technique (fastai)
- Split proportions: A common recommendation is:
- 80% training, 20% validation for larger datasets
- Adding a separate test set (e.g., 70/15/15 or 80/10/10) for final evaluation
- Check if you have an independent and identically distributed (i.i.d.) dataset. You can get hints by experimenting using stratified schemes (when stratifying according to a certain feature, the results improve decisively, for instance).