I have a table of float values and strings such as the following:
| FltVal | Category |
|---|---|
| 0.01 | A |
| 0.02 | A |
| 0.05 | B |
| 0.31 | B |
| 0.36 | A |
| 0.37 | C |
| 0.41 | B |
I would like to produce a new dataframe that bins the float values into bins of size 0.3 (so each bin is (x, x + 0.3]) and that overlaps in 0.01 increments. Additionally I'd like a column tallying the number of records following into the bin, as well as a column tallying each 'Category' total for the bin.
I first tried to get a basic count going with groupby and pd.cut (before trying to find counts for each Category):
import pandas as pd
floats = pd.Series([0.01,0.02,0.05,0.31,0.36,0.37,0.41])
categories = pd.Series(['A','A','B','B','A','C','B'])
data = {"FltVal": floats, "Category": categories}
df = pd.concat(data, axis=1)
grouped_vals = df.groupby(pd.cut(df['FltVal'],np.arange(df['FltVal'].min(),df['FltVal'].max(),0.3))).count()
Output:
FltVal Category
FltVal
(0.01, 0.31] 3 3
I think the problem there was that it couldn't consider another bin, let alone overlapping?
Then, with the idea I could just throw out non-sensible (like negatives) ranges afterward, I tried the following:
FltVal_range = np.arange(df['FltVal'].min(),df['FltVal'].max(),0.01)
FltVal_range_from = FltVal_range - 0.3
FltVal_range_to = FltVal_range
FltVal_intervals = pd.IntervalIndex.from_arrays(FltVal_range_from,FltVal_range_to)
binned_df = df.groupby(pd.cut(df['FltVal'], FltVal_intervals))
But got the ValueError message:
Overlapping IntervalIndex is not accepted.
If pd.cut doesn't accept overlapping bins, I guess it's a non-starter.
Regardless, desired output is the following:
| FloatBin | Count | Cat_A | Cat_B | Cat_C |
|---|---|---|---|---|
| 0.00, 0.30 | 3 | 2 | 1 | 0 |
| 0.01, 0.31 | 3 | 1 | 2 | 0 |
| 0.02, 0.32 | 2 | 0 | 2 | 0 |
| 0.03, 0.33 | 2 | 0 | 2 | 0 |
| 0.04, 0.34 | 2 | 0 | 2 | 0 |
| 0.05, 0.35 | 1 | 0 | 1 | 0 |
| 0.06, 0.36 | 2 | 1 | 1 | 0 |
| 0.07, 0.37 | 3 | 1 | 1 | 1 |
| 0.08, 0.38 | 3 | 1 | 1 | 1 |
| 0.09, 0.39 | 3 | 1 | 1 | 1 |
| 0.1, 0.4 | 3 | 1 | 1 | 1 |
| 0.11, 0.41 | 4 | 1 | 2 | 1 |
An efficient approach is necessary as the actual dataframe will have rows in the hundreds-of-thousands to millions.
I'm thinking of something similar to this answer but suitable for finding counts for all my categories.
All help is much appreciated!
One option using
janitor'sconditional_join:Output: