Equal-Width Binning

Equal-width binning is the process of dividing continuous variables into a predetermined number of equal-width intervals. These are examples of contiguous equal-width intervals: 0-10, 10-20, 20-30, and so on.

The number of bins into which the variable will be sorted is determined by the user. It usually depends on the desired granularity of the final variable. The number of bins can become a parameter that needs to be optimized to minimize the information loss while maximizing the variable simplification, which in turn results in returning the maximum performance of a classifier or a regression model.

The steps involved in performing equal-width binning are as follows:

Define the number of bins you want to create.
Find the variable’s minimum and maximum and calculate the value range.
Determine the width of each bin by dividing the value range by the number of bins.
Define the bins, that is, each bin’s lower and upper limits.
Assign the observations to the appropriate bin based on their value.

The width of each bin (step 3), is given by:

bin_width = (max_value - min_value) / number_of_bins

where ‘max_value’ is the maximum value of the variable being binned, ‘min_value’ is the minimum value, and the ‘number_of_bins’ is the desired number of bins.

Using the bin width, the bin edges or limits (step 4) are calculated as follows:

bin_1 = [min_value, min_value + bin_width],
bin_2 = [min_value + bin_width, min_value + 2 bin_width]
…
bin_n = [min_value + (n-1) bin_width, max_value]

To illustrate this with an example, if we have a variable with minimum and maximum values of 0 and 100, respectively, and we want to sort the values into 5 bins, the bin width is given by (100-0)/2, which is 20. Then, the bins are [0–20], [20–40], and [40–60]. [60-80] and [80-100].

Sorting variables into bins of equal width preserves the variable distribution. Hence, if the variable is skewed, it will still be skewed after the discretization.