No preprocessing needed!

There are models, like Decision Trees, who don’t need much preprocessing. Scaling, dummy variables, outliers,… is not necessary as explained in this fastai tutorial.

<aside> 💡 Don’t overthink data preprocessing!

</aside>

Fast way to drastically improve your data quality:

Train a model (using cross-validation)
Compute loss over every validation sample
Find the samples with the highest AND lowest loss
Analyze those and throw out the garbage

Many people assume the high loss samples are the only bad apples, but many data anomalies (missing data, encoding errors, etc) can cause ~0 loss as well.

Data Exploration Tools

Zuerst muss ich den Datensatz verstehen!

Data Exploration (EDA)

How to understand your data?

Visualization / Grafiken / Report