Using feature importance to evaluate your work

Applying too much feature engineering can have side effects. If you create too many correlated features or features that are not important for the problem, models could take too long to complete their training and you may get worse results. This may seem like a paradox, but it is explained by the fact that every variable carries some noise (a random component due to measurement or recording errors) that may be picked by mistake by the model: the more variables you use, the higher the chance your model may pick up noise instead of signals. Therefore, you should try to keep only the relevant features in the dataset you use for training; consider feature selection as a part of your feature engineering process (the pruning phase).

Figuring out the features you need to keep is a hard problem because, as the number of available features grows, the number of possible combinations grows too. There are various ways to select features, but first it is important to think about the stage in your data preparation pipeline where the selection has to happen.

Based on our experiences, we suggest you consider placing feature selection at the end of your data preparation pipeline. Since features share a part of their variance with other features, you cannot evaluate their effectiveness by testing them one at a time; you have to consider them all at once in order to correctly figure out which you should use.

In addition, you should then test the effectiveness of your selected features using cross-validation. Therefore, after you have all the features prepared and you have a consistent pipeline and a working model (it doesn't need to be a fully optimized model, but it should work properly and return acceptable results for the competition), you are ready to test what features should be retained and what could be discarded. At this point, there are various ways to operate feature selection:

Classical approaches used in statistics resort to forward addition or backward elimination by testing each feature entering or leaving the set of predictors. Such an approach can be quite time-consuming, though, because it relies on some measure of internal importance of variables or on their effect on the performance of the model with respect to a specific metric, which you have to recalculate for every feature at every step of the process.
For regression models, using lasso selection (Regularization - Lasso (L1) Regression ) can provide a hint about all the important yet correlated features (the procedure may, in fact, retain even highly correlated features), by using the stability selection procedure. In stability selection, you test multiple times (using a bagging procedure) what features should be retained — considering only the features whose coefficients are not zero at each test — and then you apply a voting system to keep the ones that are most frequently assigned non-zero coefficients. You find more details here.
For tree-based models, such as Random Forest or Gradient Boost based models, a decrease in impurity or a gain in the target metric based on splits are common ways to rank features. A threshold can cut away the least important ones.
Always for tree-based models, but easily generalizable to other models, test-based randomization of features (or simple comparisons with random features) helps to distinguish features that do help the model to predict correctly from features that are just noise or redundant.

An example of how randomizing features helps in selecting important features is proposed in this example by Chris Deotte in the Ventilator Pressure Prediction competition: This Notebook tests the role of features in an LSTM-based neural network. First, the model is built and the baseline performance is recorded. Then, one by one, features are shuffled and the model is required to predict again. If the resulting prediction worsens, it suggests that you shuffled an important feature that shouldn't be touched. Instead, if the prediction performance stays the same or even improves, the shuffled feature is not influential or even detrimental to the model.

<aside> 💡 There is also No Free Lunch in importance evaluation. Shuffling doesn't require any re-training, which is a great advantage when training a fresh model costs time. However, it can fail in certain situations. Shuffling can sometimes create unrealistic input combinations that make no sense to evaluate. In other cases, it can be fooled by the presence of highly correlated features (incorrectly determining that one is important and the other is not). In this case, proceeding by removing the feature (instead of shuffling it), retraining the model, and then evaluating its performance against the baseline is the best solution.

</aside>

In another approach based on shuffled features, Boruta uses random features to test the validity of the model in an iterative fashion. An alternative version of the Boruta selection procedure, BorutaShap, leverages SHAP values in order to combine feature selection and for explainability reasons. The resulting selection is usually more reliable than simple rounds of removal or randomization of features, because features are tested multiple times against random features until they can statistically prove their importance. Boruta or BorutaShap may take up to 100 iterations and it can only be performed using tree-based machine learning algorithms.

If you are selecting features for a linear model, Boruta may actually overshoot. This is because it will consider the features important both for their main effects and their interactions together with other features (but in a linear model, you care only about the main effects and a selected subset of interactions). You can still effectively use Boruta when selecting for a linear model by using a gradient boosting whose max depth is set to one tree, so you are considering only the main effects of the features and not their interactions.

You can have a look at how simple and quick it is to set up a BorutaShap feature selection by following this tutorial Notebook presented during the 30 Days of ML competition.