When creating a histogram using hist()
from matplotlib
, the data falls into bins as such:
lb ≤ x < ub
. How do I force it to behave like this: lb < x ≤ ub
?
Additionally, the frequency table is shifted one bin lower compared to Excel, which produces an inaccurate measurement for my purpose.
import numpy as np
data = np.array([23.5, 28, 29, 29, 29.5, 29.5, 30, 30, 30])
bins = np.array([20, 25, 30])
# Excel 1, 8
# Python 1, 5
Using the table as a reference, how do I force hist()
to put values between 25 and 30 in bin 30 and not bin 25?
# in Python: 20 <-> 20 ≤ x < 25
# in Excel: 25 <-> 20 < x ≤ 25
Maybe
numpy.digitize
might be interesting for you (from the documentation):Hopefully this clears also a common misunderstanding when working with bins. The
bins
correspond to the vertices of a grid and a data point falls between two vertices / in one bin. Therefore a data point does not correspond to one single point in thebins
array but to two. Another thing one can see from this notation, is withbins=[20, 25, 30]
bin 1 goes from 20-25 and bin 2 from 25-30, maybe the notation in excel is different?Using the keyword
right
for a custom histogram function results in following code and plot.Note that in the case of
right=True
15
belongs to the bin ?<x<=15 which gives you a fourth bar in the histogram even so it is not explicitly included in thebins
. If this is not wanted you have to treat the edge cases separately and maybe add the values to the first valid bin. I guess that this is also the reason why we see an unexpected behaviour with your example data. Matplotlib applieslb ≤ x < ub
for the bins but nevertheless the 30ths get associated with the bin 25-30. If we add an additional bin 30-35 we can see that now the 30ths are put in this bin. I guess that they apply the rulelb ≤ x < ub
everywhere except at the edges, here the uselb ≤ x ≤ ub
, which is also reasonable, but one has to be aware of it.