Designing a great validation set (fastai)

According to fastai and Jeremy Howard's approach, you can combine k-fold cross validation with the three-split methodology, though it requires careful implementation.

Here's how it works with k-fold cross validation:

  1. First split off a true test set: Before doing any k-fold CV, set aside a completely untouched test set (e.g., 10-20% of your data).
  2. Perform k-fold CV on the remaining data: Take the remaining 80-90% and divide it into k folds (typically 5). For each iteration:
  3. Final evaluation: After selecting your best model configuration through k-fold CV, train on all k folds combined and evaluate on your original held-out test set.

You can apply all the best practices within this framework:

In fastai, while there's no built-in function that does the complete 3-split with k-fold process, you can implement it by:

  1. First manually creating your test set
  2. Then using sklearn.model_selection.KFold or similar to create your k training/validation splits
  3. Using fastai's DataBlock to load each fold configuration

The key insight from Jeremy Howard is that cross-validation is primarily useful for smaller datasets where a single validation split might not be representative, but the principles of having truly independent test data still apply.