<aside> 💡 Code examples in Python Feature Engineering Cookbook
</aside>
Equal-width binning is the process of dividing continuous variables into a predetermined number of equal-width intervals. These are examples of contiguous equal-width intervals: 0-10, 10-20, 20-30, and so on.
The number of bins into which the variable will be sorted is determined by the user. It usually depends on the desired granularity of the final variable. The number of bins can become a parameter that needs to be optimized to minimize the information loss while maximizing the variable simplification, which in turn results in returning the maximum performance of a classifier or a regression model.
The steps involved in performing equal-width binning are as follows:
The width of each bin (step 3), is given by:
bin_width = (max_value - min_value) / number_of_bins
where ‘max_value’ is the maximum value of the variable being binned, ‘min_value’ is the minimum value, and the ‘number_of_bins’ is the desired number of bins.
Using the bin width, the bin edges or limits (step 4) are calculated as follows:
To illustrate this with an example, if we have a variable with minimum and maximum values of 0 and 100, respectively, and we want to sort the values into 5 bins, the bin width is given by (100-0)/2, which is 20. Then, the bins are [0–20], [20–40], and [40–60]. [60-80] and [80-100].
Sorting variables into bins of equal width preserves the variable distribution. Hence, if the variable is skewed, it will still be skewed after the discretization.