*45*

In statistics,Â **binningÂ **is the process of placing numerical values into *bins*.

The most common form of binning is known asÂ **equal-width binning**, in which we divide a dataset intoÂ *kÂ *bins of equal width.

A less commonly used form of binning is known asÂ **equal-frequency binning**, in which we divide a dataset intoÂ *kÂ *bins that all have an equal number of frequencies.

This tutorial explains how to perform equal frequency binning in python.

**Equal Frequency Binning in Python**

Suppose we have a dataset that contains 100 values:

import numpy as np import matplotlib.pyplot as plt #create data np.random.seed(1) data = np.random.randn(100) #view first 5 valuesdata[:5] array([ 1.62434536, -0.61175641, -0.52817175, -1.07296862, 0.86540763])

**Equal-Width Binning:**

If we create a histogram to display these values, Python will use equal-width binning by default:

#create histogram with equal-width bins n, bins, patches = plt.hist(data, edgecolor='black') plt.show() #display bin boundaries and frequency per bin bins, n (array([-2.3015387 , -1.85282729, -1.40411588, -0.95540447, -0.50669306, -0.05798165, 0.39072977, 0.83944118, 1.28815259, 1.736864 , 2.18557541]), array([ 3., 1., 6., 17., 19., 20., 14., 12., 5., 3.]))

Each bin has an equal width of approximately .4487, but each bin doesnâ€™t contain an equal amount of observations. For example:

- The first bin extends from -2.3015387 to -1.8528279 and contains 3 observations.
- The second bin extends from -1.8528279 to -1.40411588 and contains 1 observation.
- The third bin extends from -1.40411588 to -0.95540447 and contains 6 observations.

And so on.

**Equal-Frequency Binning:**

To create bins that contain an equal number of observations, we can use the following function:

#define function to calculate equal-frequency bins def equalObs(x, nbin): nlen = len(x) return np.interp(np.linspace(0, nlen, nbin + 1), np.arange(nlen), np.sort(x)) #create histogram with equal-frequency bins n, bins, patches = plt.hist(data, equalObs(data, 10), edgecolor='black') plt.show() #display bin boundaries and frequency per bin bins, n (array([-2.3015387 , -0.93576943, -0.67124613, -0.37528495, -0.20889423, 0.07734007, 0.2344157 , 0.51292982, 0.86540763, 1.19891788, 2.18557541]), array([10., 10., 10., 10., 10., 10., 10., 10., 10., 10.]))

Each bin doesnâ€™t have an equal width, but each bin *doesÂ *contain an equal amount of observations. For example:

- The first bin extends from -2.3015387 to -0.93576943Â and contains 10 observations.
- The second bin extends from -0.93576943 to -0.67124613 and contains 10 observations.
- The third bin extends from -0.67124613 to -0.37528495 and contains 10 observations.

And so on.

We can see from the histogram that each bin is clearly not the same width, but each bin does contain the same amount of observations which is confirmed by the fact that each bin height is equal.