Designing a great validation set (fastai)
According to fastai and Jeremy Howard's approach, you can combine k-fold cross validation with the three-split methodology, though it requires careful implementation.
Here's how it works with k-fold cross validation:
- First split off a true test set: Before doing any k-fold CV, set aside a completely untouched test set (e.g., 10-20% of your data).
- Perform k-fold CV on the remaining data: Take the remaining 80-90% and divide it into k folds (typically 5). For each iteration:
- Use k-1 folds as training data
- Use 1 fold as validation data
- Train and evaluate your model k times with different training/validation splits
- Final evaluation: After selecting your best model configuration through k-fold CV, train on all k folds combined and evaluate on your original held-out test set.
You can apply all the best practices within this framework:
- Time-based considerations: You can do time-based k-fold CV where each fold represents a different time period
- Stratification: Ensure each fold maintains the same class distribution for classification tasks
- Group integrity: Keep related samples (e.g., same patient) in the same fold
- Hyperparameter tuning: Use the average performance across all k validation sets to select hyperparameters
In fastai, while there's no built-in function that does the complete 3-split with k-fold process, you can implement it by:
- First manually creating your test set
- Then using
sklearn.model_selection.KFold or similar to create your k training/validation splits
- Using fastai's
DataBlock to load each fold configuration
The key insight from Jeremy Howard is that cross-validation is primarily useful for smaller datasets where a single validation split might not be representative, but the principles of having truly independent test data still apply.