Data Preprocessing & Feature Engineering
Feature Engineering
Common preprocessing steps
Deriving features with transformations is the simplest approach, but often the most effective. For instance, computing feature ratios (dividing one feature by another) can prove quite effective because many algorithms cannot mimic divisions (for example, Gradient Boost) or can have a hard time trying to (for example, Deep Neural Networks). Here are the most common transformations to try out:
- Time feature processing: Splitting a date into its elements (year, month, day); transforming it into week of the year and weekday; computing differences between dates; computing differences with key events (for instance, holidays).
- Numeric feature transformations: Scaling; normalization; logarithmic or exponential transformations; separating the integer and decimal parts; summing, subtracting, multiplying, or dividing two numeric features. Scaling obtained by standardization (the z-score method used in statistics) or by normalization (also called min-max scaling) of numeric features can make sense if you are using algorithms sensitive to the scale of features, such as any neural network.
- Binning of numeric features: This is used to transform continuous variables into discrete ones by distributing their values into a number of bins. Binning helps remove noise and errors in data and it allows easy modeling of non-linear relationships between the binned features and the target variable when paired with one-hot encoding (see the sklearn implementation, for instance here).
- Categorical feature encoding: One-hot encoding; a categorical data processing that merges two or three categorical features together; or the more sophisticated taget encoding .
- Splitting and aggregating categorical features based on the levels: For instance, in the Titanic competition you can split names and surnames, as well as their initials, to create new features.
- Polynomial features are created by raising features to an exponent. See, for instance, this Sklearn function.
While they are not proper feature engineering but more data cleaning techniques, missing data and outlier treatments involve making changes to the data that nevertheless transform your features, and they can help signals from the data emerge:
- Missing values treatment: Make binary features that point out missing values, because sometimes missingness is not random and a missing value could have some important reason behind it. Usually, missingness points out something about the way data is recorded, acting like a proxy variable for something else. It is just like in census surveys: if someone doesn't tell you their income, it means they are extremely poor or are extremely rich. If required by your learning algorithm, replace the missing values with the mean, median, or mode (it is seldom necessary to use methods that are more sophisticated). For more, see this reference: A Guide to Handling Missing values in Python.
<aside>
💡 Just keep in mind that some models can handle missing values by themselves and do so fairly better than many standard approaches, because the missing-values handling is part of their optimization procedure. The models that can handle missing values by themselves are all Gradient Boost based models:
-
XGBoost
-
CatBoost
-
LightGBM
</aside>
-
Outlier capping or removal: Exclude, cap to a maximum or minimum value, or modify outlier values in your data. To do so, you can use sophisticated multivariate models, such as those present in sklearn (see here). Otherwise, you can simply locate the outlying samples in a univariate fashion, basing your judgment on how many standard deviations they are from the mean, or their distance from the boundaries of the interquartile range (IQR). In this case, you might simply exclude any points that are above the value of 1.5 * IQR + Q3 (upper outliers)
and any points that are below Q1 - 1.5 * IQR (lower outliers). Once you have found
the outliers, you can also proceed by pointing them out with a binary variable.
All these data transformations can add predictive performance to your models, but they are seldom decisive in a competition. Though it is necessary, you cannot simply rely on basic feature engineering. See also Advanced procedures for extracting value from data.