Imputation - (missing data)

How to understand your data?

Handling missing data is a crucial step in data preprocessing, ensuring robust analyses and accurate model training. Missing data can be highly annoying to deal with. Some algorithms can't deal with it, so your first instinct might just be to impute the values. It turns out though, that sometimes you're able to 'ignore' these missing values. It all depends on the machine learning algorithm that you're interested in using. Vincent explains why tree based algorithms can handle missing data and why it often is better to use them instead of wrong imputation → calmcode video

Untitled

<aside> 💡 Remember, the key is to select a method that aligns with your data characteristics and analysis objectives. And don’t forget to split the data before imputing your values

</aside>

Article - main feature engineering techniques

Untitled

Here are some common methods for imputing missing values:

Untitled

Mean/Median Imputation: Replace missing values with the mean or median of the available data. Useful when the missingness is random and there are few missing data points.
Arbitrary Value Imputation: Replace missing values with pre-defined arbitrary values. Best suited for scenarios where missingness carries meaning, and you want to capture that information explicitly. This imputation can be well handled by tree based models.
End of tail imputation: Automates arbitrary value imputation, by automatically selecting an imputation value at the far end of the variable’s distribution. Same considerations as per Arbitrary value imputation.
Frequent Category Imputation: Impute missing categorical data with the most frequent category. Useful when the missingness is random and there are few missing data points.
Create the “missing” category: Replace missing data with a specific “string” called “missing” or similar. In general, the go-to imputation method for categorical variables.
Adding Missing Indicators: Create binary indicator variables that flag missing values. This technique can provide information about the presence of missing data to the model. Used in combination with mean / median or frequent category imputation.