<aside> 💡 Suggestion: building a reliable validation scheme, by favouring more a k-fold over a train-test-split, given its probabilistic nature and ability to generalize to unseen data.

</aside>

State of Competitive Machine Learning 2022 report: Almost twice as many winning solutions used k-fold CV instead of a fixed validation set.

<aside> 💡 You can apply cross-validation to a full sklearn pipeline as shown here.

</aside>

Cross-validation is a technique used to evaluate the performance of a model by dividing the data into training and validation sets. The model is trained on the training set and its performance is evaluated on the validation set. This process is repeated multiple times with different partitions of the data, giving an overall estimate of the model’s performance. Cross-validation is important for avoiding overfitting as it gives an unbiased estimate of the model’s performance on unseen data.

Untitled

The k variations scores are then averaged and that averaged score value is the k-fold validation score, which will tell you the estimated average model performance on any unseen data. It is one of the best validation strategies.

Can be used to…

compare predictive models
selecting the hyperparameters for your model that will perform the best in the test set.

Which k works best?

The smaller the k (minimum is 2), the more bias in learning. Your model validated on a smaller k will be less well-performing with respect to a model trained on a larger k.