XGBoost vs CatBoost vs LightGBM

CatBoost Videos

<aside> 💡 Algorithm can handle missing values. This is true for all Gradient Boost based models. See also this post.

</aside>

The name of CatBoost comes from putting together the two words "Category" and "Boosting." In fact, its strong point is its ability to handle categorical variables, which make up most of the information in most relational databases, by adopting a mixed strategy of one-hot encoding and target encoding. Target encoding is a way to express categorical levels by assigning them an appropriate numeric value for the problem at hand.

The idea used by CatBoost to encode categorical variables is not new, but it is a kind of feature engineering that has been used before, mostly in data science competitions. Target encoding, also known as likelihood encoding, impact coding, or mean encoding, is simply a way to transform your labels into a number based on their association with the target variable. If you have a regression, you could transform labels based on the mean target value typical of that level; if it is a classification, it is simply the probability of classification of your target given that label (the probability of your target conditional on each category value). It may appear a simple and smart feature engineering trick but it has side effects, mostly in terms of overfitting, because you are taking information from the target into your predictors.

CatBoost has quite a few parameters (see here). We have limited our discussion to the eight most important ones:

Parameter Tuning - CatBoost

CatBoost (Part 1)

One of the defining features of CatBoost is its concerted effort to avoid data leakage at all costs. In this section, we'll see how it eliminates a potential threat in Target encoding by ordering the data and encoding it sequentially. This ordered approach is central to everything CatBoost does and we'll see it again in Part 2 when we talk about how it builds trees.

https://youtu.be/KXOTSkPL2X4

Application

Gradient boosting is a powerful machine-learning technique that achieves state-of-the-art results in a variety of practical tasks. This tutorial will explain details of using gradient boosting in practice, we will solve a classification problem using the popular GBDT library CatBoost. Link to Code.

https://twitter.com/marktenenholtz/status/1663169915396784128