Calculate Average/mean for grouped values using np.arange and pd.cut with different stop values (and bins) or any other way

55 views Asked by At

I have the following dataframe

df2 = pd.read_csv('arangerr2.csv')
   hgroup   lowgroup    value   max_hg_value
0   A          B           0    39
1   A          B          18    39
2   A          B          38    39
3   A          C           0    39
4   A          C          19    39
5   A          C          39    39

I want to calculate the median for every value of 20 grouped by lowgroup using np.arange and pd.cut

start = 0
stop = df2['value'].max()
step = 20
bins = np.arange(start,stop + 10, step)
df2['bins'] = pd.cut(df['value']+1 , bins)
df2['mean'] = df2['value'].groupby(pd.cut(df2['value'], bins=bins, 
right=False)).transform('mean')
df2

   hgroup   lowgroup    value   max_hg_value    bins    mean
0   A          B         0            39      (0, 20]   9.25
1   A          B        18            39     (0, 20]    9.25
2   A          B        38            39     (20, 40]   38.50
3   A          C        0             39     (0, 20]    9.25
4   A          C        19            39     (0, 20]    9.25
5   A          C        39            39     (20, 40]   38.50

This seems to do the job perfectly. However, it seems to only work when the stop value is only one value or fixed. How do we solve this if we have multiple hgroups with different low groups and different max values.

What do we need to do to go from this

     Hgroup LowGoup  Value  Max_HG_value
0       A       B      0            39
1       A       B     18            39
2       A       B     38            39
3       A       C      0            39
4       A       C     19            39
5       A       C     39            39
6       B       D      0            50
7       B       D     17            50
8       B       D     34            50
9       B       D     55            50
10      B       E      0            50
11      B       E     14            50
12      B       E     22            50
13      B       E     50            50
14      C       F      0            69
15      C       F     10            69
16      C       F     25            69
17      C       F     50            69
18      C       F     65            69
19      C       G      0            69
20      C       G      9            69
21      C       G     30            69
22      C       G     48            69
23      C       G     69            69

to this

      Hgroup LowGoup  Value  Max_HG_value  Mean 
0       A       B      0            39   9.25
1       A       B     18            39   9.25
2       A       B     38            39  38.50
3       A       C      0            39   9.25
4       A       C     19            39   9.25
5       A       C     39            39  38.50
6       B       D      0            50   7.75
7       B       D     17            50   7.75
8       B       D     34            50  28.00
9       B       D     55            50  52.50
10      B       E      0            50   7.75
11      B       E     14            50   7.75
12      B       E     22            50  28.00
13      B       E     50            50  52.50
14      C       F      0            69   4.75
15      C       F     10            69   4.75
16      C       F     25            69  27.50
17      C       F     50            69  49.00
18      C       F     65            69  67.00
19      C       G      0            69   4.75
20      C       G      9            69   4.75
21      C       G     30            69  27.50
22      C       G     48            69  49.00
23      C       G     69            69  67.00

It seems like we need to apply np.arange and pd.cut for every single lowgroup within hgroup. I have tried multiple ways but i cant seem to get it right. Can someone help me

2

There are 2 answers

2
Corralien On

You don't need pd.cut as your interval is evenly spaced:

df2['mean'] = df2.groupby(df2['value'] // 20)['value'].transform('mean')
print(df2)

# Output
  hgroup lowgroup  value  max_hg_value   mean
0      A        B      0            39   9.25
1      A        B     18            39   9.25
2      A        B     38            39  38.50
3      A        C      0            39   9.25
4      A        C     19            39   9.25
5      A        C     39            39  38.50

For your second example, it seems you need to group by Hgroup too:

df3['mean'] = df3.groupby(['Hgroup', df['Value'] // 20])['Value'].transform('mean')
print(df3)

# Output
   Hgroup LowGoup  Value  Max_HG_value   mean
0       A       B      0            39   9.25
1       A       B     18            39   9.25
2       A       B     38            39  38.50
3       A       C      0            39   9.25
4       A       C     19            39   9.25
5       A       C     39            39  38.50
6       B       D      0            50   7.75
7       B       D     17            50   7.75
8       B       D     34            50  28.00
9       B       D     55            50  52.50
10      B       E      0            50   7.75
11      B       E     14            50   7.75
12      B       E     22            50  28.00
13      B       E     50            50  52.50
14      C       F      0            69   4.75
15      C       F     10            69   4.75
16      C       F     25            69  27.50
17      C       F     50            69  49.00
18      C       F     65            69  67.00
19      C       G      0            69   4.75
20      C       G      9            69   4.75
21      C       G     30            69  27.50
22      C       G     48            69  49.00
23      C       G     69            69  67.00
0
Gustavo De Leon On

Thanks, your answer is correct but I just realised I missed one field in my example. Intervals are unevenly spaced, that's why I'm using pd.cut. Here I have a snip of a real example.

I want to calculate the avg altitude when the values of Distance are between 0-20,20-40,40-60, 60-80, ∞-∞+20( that's why I'm using STEP=20 in previous example) grouped by Hgroup. I have thousands of data, so Im looking to programmatically calculate all these values and have STEP as the only variable

  Hgroup LowGoup Distance Altitude
0   A      B      0        56
1   A      B     18        50
2   A      B     38        20
3   A      C     0         60
4   A      C     19        32
5   A      C     39        12
6   B      D     0         20
7   B      D     17        50
8   B      D     34        21
9   B      D     55        23
10  B      E     0         50
11  B      E     14        60
12  B      E     22        21
13  B      E     50        20
14  C      F     0         60
15  C      F     10        63
16  C      F     25        23
17  C      F     50        21
18  C      F     65        45
19  C      G     0         40
20  C      G     9         24
21  C      G    30         23
22  C      G    48         56
23  C      G    69         60

I'd like to get to this answer

 Hgroup LowGoup Distance Altitude Mean_altitude
0   A      B       0      56      49.50
1   A      B       18     50      49.50
2   A      B      38      20      16.00
3   A      C       0      60      49.50
4   A      C       19     32      49.50
5   A      C       39     12      16.00
6   B      D       0      20      45.00
7   B      D       17     50      45.00
8   B      D       34     21      21.00
9   B      D       55     23      21.50
10  B      E       0      50      45.00
11  B      E       14     60      45.00
12  B      E       22     21      21.00
13  B      E       50     20      21.50
14  C      F       0      60      46.75
15  C      F       10     63      46.75
16  C      F       25     23      23.00
17  C      F       50     21      38.50
18  C      F       65     45      52.50
19  C      G       0      40      46.75
20  C      G       9      24      46.75
21  C      G       30     23      23.00
22  C      G       48     56      38.50
23  C      G       69     60      52.50

It'll be great if you could take an extra look to this dataset and help me out. @Corralien