Categorical features are usually not a challenge to deal with, thanks to simple functions offered by Scikit-learn such as:

These functions can transform categories into numeric features and then into binary features that are easily dealt with by machine learning algorithms. However, when the number of categories to deal with is too large, the dataset resulting from a one-hot encoding strategy becomes sparse (most values in it will be zero values) and cumbersome to handle for the memory and processor of your computer or Notebook. In these situations, we talk about a high-cardinality feature, which requires special handling.

The idea behind this approach is to transform the many categories of a categorical feature into their corresponding expected target value. In the case of a regression, this is the average expected value for that category; for a binary classification, it is the conditional probability given that category; for a multiclass classification, you have instead the conditional probability for each possible outcome.

For instance, in the Titanic GettingStarted competition, where you have to figure out the survival probability of each passenger, target encoding a categorical feature, such as the gender feature, would mean replacing the gender value with its average probability of survival. In this way, the categorical feature is transformed into a numeric one without having to convert the data into a larger and sparser dataset. In short, this is target encoding and it is indeed very effective in many situations because it resembles a stacked prediction based on the high-cardinality feature. Like stacked predictions, however, where you are essentially using a prediction from another model as a feature, target encoding brings about the risk of Overfitting. In fact, when some categories are too rare, using target encoding is almost equivalent to providing the target label. There are ways to avoid this.

Before seeing the implementation you can directly import into your code, let's see an actual code example of target encoding. This code was used for one of the top-scoring submissions of the PetFinder.my Adoption Prediction competition:

<p. 356 of 826 in The Kaggle Book>

(a class which is quite simple to apply)

Untitled

TIPP

Instead of writing your own code, you can use the package and its Target Encoder. It is an out-of-the-box solution that works exactly like the code in this section.

https://www.youtube.com/watch?v=589nCGeWG1w