When creating a histogram using hist() from matplotlib, the data falls into bins as such:
lb ≤ x < ub. How do I force it to behave like this: lb < x ≤ ub?
Additionally, the frequency table is shifted one bin lower compared to Excel, which produces an inaccurate measurement for my purpose.
import numpy as np
data = np.array([23.5, 28, 29, 29, 29.5, 29.5, 30, 30, 30])
bins = np.array([20, 25, 30])
# Excel 1, 8
# Python 1, 5
Using the table as a reference, how do I force hist() to put values between 25 and 30 in bin 30 and not bin 25?
# in Python: 20 <-> 20 ≤ x < 25
# in Excel: 25 <-> 20 < x ≤ 25
Maybe
numpy.digitizemight be interesting for you (from the documentation):Hopefully this clears also a common misunderstanding when working with bins. The
binscorrespond to the vertices of a grid and a data point falls between two vertices / in one bin. Therefore a data point does not correspond to one single point in thebinsarray but to two. Another thing one can see from this notation, is withbins=[20, 25, 30]bin 1 goes from 20-25 and bin 2 from 25-30, maybe the notation in excel is different?Using the keyword
rightfor a custom histogram function results in following code and plot.Note that in the case of
right=True15belongs to the bin ?<x<=15 which gives you a fourth bar in the histogram even so it is not explicitly included in thebins. If this is not wanted you have to treat the edge cases separately and maybe add the values to the first valid bin. I guess that this is also the reason why we see an unexpected behaviour with your example data. Matplotlib applieslb ≤ x < ubfor the bins but nevertheless the 30ths get associated with the bin 25-30. If we add an additional bin 30-35 we can see that now the 30ths are put in this bin. I guess that they apply the rulelb ≤ x < ubeverywhere except at the edges, here the uselb ≤ x ≤ ub, which is also reasonable, but one has to be aware of it.