Feature Encoding Videos
Feature Encoding Articles
Category Encoders
When you're working with scikit-learn you'll often need to deal with categorical data. The way you deal with this type of data really matters. Calmcode zeigt, wie wir die kategorischen Features noch besser nutzen, um noch bessere Prognosen zu machen (dirty cat).
- In fastai there are only two types of data: categorical variables and continuous variables. You can find more in the docu. In the linked video we see how we can mark a variable as a categorical variable. This way they are treated in a very specific and useful way. No dummy variables needed anymore!
- Hier sehen wir, wie wir OneHotEncoder richtig verwenden und worauf wir achten sollten.
- Hier wird erklärt was fastai macht mit unbekannten Kategorien während interference. Es wird automatisch zu OTHER. Allerdings ist nur fastai so simple im Umgang mit diesem Problem!
Why is discretization useful?
Several regression and classification models, like decision trees and Naive Bayes, perform better with discrete values. Decision trees make decisions based on discrete attribute partitions. A decision tree assesses all feature values while training to determine the ideal cut-point. As a result, the more values the feature has, the longer the training time of the decision tree. Therefore, the discretization of continuous features can speed up the training process.
Feature Encoding Methods in Machine Learning
Binary Encoding
One-Hot Encoding
Label Encoding
Embeddings
Choosing the Right Encoding Method
- Consider the nature of your categorical variable:
- Ordinal (natural order): Label encoding is appropriate
- Nominal (no natural order): One-hot encoding or alternatives
- Consider the cardinality (number of unique values):
- Low cardinality: One-hot encoding works well
- Medium cardinality: Binary encoding for better efficiency
- High cardinality: Embeddings or binary encoding
- Consider your model type:
- Tree-based models (Random Forest, XGBoost): Label encoding often works well
- Linear models (Linear Regression, Logistic Regression): One-hot encoding is usually better
- Neural networks: Embeddings often perform best