k-fold cross-validation
Designing a great validation set (fastai)
<aside>
💡 Suggestion: building a reliable validation scheme, by favouring more a k-fold over a train-test-split, given its probabilistic nature and ability to generalize to unseen data.
</aside>
Divide dataset. Use test set for all the models that you train using the remaining part of the data.
- very simple
- usually 80 / 20
- Python: train_test_split
Wichtige Anmerkungen
- Since the extraction process is based on randomness, you always have the chance of extracting a non-representative sample (→ especially true for small dataset)
- In addition, to ensure that your test sampling is representative, especially with regard to how the training data relates to the target variable, you can use stratification, which ensures that the proportions of certain features are respected in the sampled data. You can use the stratify parameter in the train_test_split function and provide an array containing the class distribution to preserve.
- Frequently evaluate on the test set can lead to Overfitting .