<aside> 💡 Keep in simple as explained here! Here are the most basic preprocessing steps everybody should know!

</aside>

<aside> 💡 Pandas brings its own pipeline for data preprocessing steps! Look here: Pandas Pipeline for data preprocessing steps ! Its a great way to simplify preprocessing steps!

</aside>

<aside> 💡 First step when you construct your first model on a new problem

→ make sure your model can overfit!

</aside>

<aside> 💡 Data processing can cause leakage when you have processed the training and test set together before splitting it! Therefore, split first and then start preprocessing!

</aside>

Data Preprocessing

Feature Engineering

How to understand your data?

How can we do feature engineering with Python?

Feature-Engine

scikit-lego

There are excellent Python libraries for feature engineering. Pandas, Scikit-learn and Feature-engine provide tools for missing data imputation, categorical encoding, variable transformation and discretization. With these libraries we can also create new variables by combining existing features.

Category Encoders is an extremely popular library to encode categorical features. It has the most extensive tool set to transform the categories into numbers.

For a short tutorial on how to use Python libraries for feature engineering in machine learning, check my article Python libraries for feature engineering.

Untitled

Featuretools and tsfresh are 2 Python libraries that offer out-of-the-box automated feature engineering tools. Featuretools provides tools to create predictor variables from transactions and it is a great Python library to handle features that come from more than o ne dataset. It automates the aggregation of data into new features that can be used for classification or regression.