How to start a new ML project?

Feature Engineering

<aside> 💡 First step when you construct your first model on a new problem

→ make sure your model can overfit!

</aside>

No preprocessing needed!

There are models, like Decision Trees, who don’t need much preprocessing. Scaling, dummy variables, outliers,… is not necessary as explained in this fastai tutorial.

<aside> 💡 Don’t overthink data preprocessing!

</aside>

Fast way to drastically improve your data quality:

  1. Train a model (using cross-validation)
  2. Compute loss over every validation sample
  3. Find the samples with the highest AND lowest loss
  4. Analyze those and throw out the garbage

Many people assume the high loss samples are the only bad apples, but many data anomalies (missing data, encoding errors, etc) can cause ~0 loss as well.

Data Exploration Tools

Zuerst muss ich den Datensatz verstehen!

Data Exploration (EDA)

How to understand your data?

Visualization / Grafiken / Report