k-fold cross validation with the three-split technique (fastai)

Designing a great validation set (fastai)

According to fastai and Jeremy Howard's approach, you can combine k-fold cross validation with the three-split methodology, though it requires careful implementation.

Here's how it works with k-fold cross validation:

First split off a true test set: Before doing any k-fold CV, set aside a completely untouched test set (e.g., 10-20% of your data).
Perform k-fold CV on the remaining data: Take the remaining 80-90% and divide it into k folds (typically 5). For each iteration:
- Use k-1 folds as training data
- Use 1 fold as validation data
- Train and evaluate your model k times with different training/validation splits
Final evaluation: After selecting your best model configuration through k-fold CV, train on all k folds combined and evaluate on your original held-out test set.

You can apply all the best practices within this framework:

Time-based considerations: You can do time-based k-fold CV where each fold represents a different time period
Stratification: Ensure each fold maintains the same class distribution for classification tasks
Group integrity: Keep related samples (e.g., same patient) in the same fold
Hyperparameter tuning: Use the average performance across all k validation sets to select hyperparameters

In fastai, while there's no built-in function that does the complete 3-split with k-fold process, you can implement it by:

First manually creating your test set
Then using sklearn.model_selection.KFold or similar to create your k training/validation splits
Using fastai's DataBlock to load each fold configuration

The key insight from Jeremy Howard is that cross-validation is primarily useful for smaller datasets where a single validation split might not be representative, but the principles of having truly independent test data still apply.