How do I include the upper boundary of the bins in Matplotlib hist

1.3k views Asked by At

When creating a histogram using hist() from matplotlib, the data falls into bins as such:

lb ≤ x < ub. How do I force it to behave like this: lb < x ≤ ub?

Additionally, the frequency table is shifted one bin lower compared to Excel, which produces an inaccurate measurement for my purpose.

import numpy as np
data = np.array([23.5, 28, 29, 29, 29.5, 29.5, 30, 30, 30])
bins = np.array([20, 25, 30])
# Excel               1, 8
# Python          1,  5

Using the table as a reference, how do I force hist() to put values between 25 and 30 in bin 30 and not bin 25?

# in Python: 20 <-> 20 ≤ x < 25
# in Excel:  25 <-> 20 < x ≤ 25
1

There are 1 answers

1
scleronomic On

Maybe numpy.digitize might be interesting for you (from the documentation):

Return the indices of the bins to which each value in input array belongs.

`right`    order of bins  returned index `i` satisfies
=========  =============  ============================
``False``  increasing     ``bins[i-1] <= x < bins[i]``
``True``   increasing     ``bins[i-1] < x <= bins[i]``
``False``  decreasing     ``bins[i-1] > x >= bins[i]``
``True``   decreasing     ``bins[i-1] >= x > bins[i]``

Hopefully this clears also a common misunderstanding when working with bins. The bins correspond to the vertices of a grid and a data point falls between two vertices / in one bin. Therefore a data point does not correspond to one single point in the bins array but to two. Another thing one can see from this notation, is with bins=[20, 25, 30] bin 1 goes from 20-25 and bin 2 from 25-30, maybe the notation in excel is different?

Using the keyword right for a custom histogram function results in following code and plot.

import numpy as np
import matplotlib.pyplot as plt

data = np.array([15,
                 17, 18, 20, 20, 20,
                 23.5, 24, 25, 25,
                 28, 29, 30, 30, 30])
bins = np.array([15, 20, 25, 30])


def custom_hist(x, bins, right=False):
    x_dig = np.digitize(x, bins=bins, right=right)
    u, c = np.unique(x_dig, return_counts=True)
    h = np.zeros(len(bins), dtype=int)
    h[u] = c
    return h


plt.hist(data, bins=bins,  color='b', alpha=0.7, label='plt.hist')
# array([3., 5., 7.]

height = custom_hist(x=data, bins=bins, right=True)
width = np.diff(bins)
width = np.concatenate((width, width[-1:]))
plt.bar(bins-width, height=height, width=width,
        align='edge', color='r', alpha=0.7, label='np.digitize')
plt.legend()
# This function also allows different sized bins

custom hist

Note that in the case of right=True 15 belongs to the bin ?<x<=15 which gives you a fourth bar in the histogram even so it is not explicitly included in the bins. If this is not wanted you have to treat the edge cases separately and maybe add the values to the first valid bin. I guess that this is also the reason why we see an unexpected behaviour with your example data. Matplotlib applies lb ≤ x < ub for the bins but nevertheless the 30ths get associated with the bin 25-30. If we add an additional bin 30-35 we can see that now the 30ths are put in this bin. I guess that they apply the rule lb ≤ x < ub everywhere except at the edges, here the use lb ≤ x ≤ ub, which is also reasonable, but one has to be aware of it.

data = np.array([23.5, 28, 29, 29, 29.5, 29.5, 30, 30, 30])
plt.hist(data, bins=np.array([20, 25, 30]),  color='b', alpha=0.7, label='[20, 25, 30]')
plt.hist(data, bins=np.array([20, 25, 30, 35]),  color='r', alpha=0.7, label='[20, 25, 30, 35]')
plt.legend()

different bins