Python: histogram/ binning data from 2 arrays.

1k views Asked by At

I have two arrays of data: one is a radius values and the other is a corresponding intensity reading at that intensity:

e.g. a small section of the data. First column is radius and the second is the intensities.

29.77036614 0.04464427 29.70281027 0.07771409 29.63523525 0.09424901 29.3639355 1.322793 29.29596385 2.321502 29.22783249 2.415751 29.15969437 1.511504 29.09139827 1.01704 29.02302068 0.9442765 28.95463729 0.3109002 28.88609766 0.162065 28.81754446 0.1356054 28.74883612 0.03637681 28.68004928 0.05952569 28.61125036 0.05291172 28.54229804 0.08432806 28.4732599 0.09950128 28.43877462 0.1091304 28.40421016 0.09629156 28.36961249 0.1193614 28.33500089 0.102711 28.30037503 0.07161685

How can I bin the radius data, and find the average intensity corresponding to that binned radius.

The aim of this is to then use the average intensity to assign an intensity value to a radius data with a missing (NaN) data point.

I've never had to use the histogram functions before and have very little idea of how they work/ if its possible to do this with them. The full data set is large with 336622 number of data points, so I don't really want to be using loops or if statements to achieve this.
Many Thanks for any help.

2

There are 2 answers

0
tmdavison On BEST ANSWER

if you only need to do this for a handful of points, you could do something like this.

If intensites and radius are numpy arrays of your data:

bin_width = 0.1 # Depending on how narrow you want your bins

def get_avg(rad):
    average_intensity = intensities[(radius>=rad-bin_width/2.) & (radius<rad+bin_width/2.)].mean()
    return average_intensities

# This will return the average intensity in the bin: 27.95 <= rad < 28.05
average = get_avg(28.)
0
Bernhard On

It's not really histogramming what your are after. A histogram is more a count of items that fall into a specific bin. What you want to do is more a group by operation, where you'd group your intensities by radius intervals and on the groups of itensities you apply some aggregation method, like average or median etc.

What your are describing, however, sounds a lot more like some sort of interpolation you want to perform. So I would suggest to think about interpolation as an alternative to solve your problem. Anyways, here's a suggestion how you can achieve what you asked for (assuming you can use numpy) - I'm using random inputs to illustrate:

radius = numpy.fromiter((random.random() * 10 for i in xrange(1000)), dtype=numpy.float)
intensities = numpy.fromiter((random.random() * 10 for i in xrange(1000)), dtype=numpy.float)
# group your radius input into 20 equal distant bins
bins = numpy.linspace(radius.min(), radius.max(), 20)
groups = numpy.digitize(radius, bins)
# groups now holds the index of the bin into which radius[i] falls
# loop through all bin indexes and select the corresponding intensities
# perform your aggregation on the selected intensities
# i'm keeping the aggregation for the group in a dict
aggregated = {}
for i in range(len(bins)+1):
    selected_intensities = intensities[groups==i]
    aggregated[i] = selected_intensities.mean()