Create bins and dataframe including a calculation

316 views Asked by At
import pandas as pd
import numpy as np
import random
import string

N = 100
J = [2012,2013,2014]
K = ['A','B','C','D','E','F','G','H']
L = ['h','d','a']

df = pd.DataFrame(
    np.random.uniform(1,10,size=(N, 3)),
    columns=list('XYZ')
)

df['ht'] =  pd.Series(random.choice(K) for _ in range(N))
df['at'] =  pd.Series(random.choice(K) for _ in range(N))
df['J'] = pd.Series(random.choice(J) for _ in range(N))
df['R'] = pd.Series(random.choice(L) for _ in range(N))

df1 = (df.X).groupby([df.ht, df.J]).agg(['sum', 'size']).unstack(fill_value=0)

print(df.head())

enter image description here

I'd like to create a new dataframe where column 'X' will be clustered in 10 even bins. Then a sum needs to be calculated per year per cluster: 'R' * 'X', where 'R' is 'h'.

EDIT;

Example of desired endresult:

bins/2012/2013/2014/Total_sum_years/Total_number_'h'

0 < 1.5 /15/8/5/28/7

1

There are 1 answers

1
Ted Petrou On BEST ANSWER

New guess

df_agg  = df.groupby([pd.cut(df.X, 10), 'R', 'J'])['X'].agg(['sum', 'size'])\
                                                       .unstack('J', fill_value=0)\
                                                       .reset_index('R')

df_agg = df_agg.loc[df_agg['R'] == 'h']

df_agg['total_sum_years'] = df_agg['sum'].sum(1)
df_agg['total_number_h'] = df_agg['size'].sum(1)

output

                 R        sum                       size            \
J                        2012       2013       2014 2012 2013 2014   
X                                                                    
(1.203, 2.0842]  h   1.421185   2.660724   3.380401    1    2    2   
(2.0842, 2.956]  h   4.984133   5.044891   0.000000    2    2    0   
(2.956, 3.828]   h   0.000000   3.190256   6.644137    0    1    2   
(3.828, 4.7]     h   4.086577   0.000000   0.000000    1    0    0   
(4.7, 5.572]     h   0.000000   9.595351   0.000000    0    2    0   
(5.572, 6.444]   h  11.659066   6.037559  12.452256    2    1    2   
(6.444, 7.316]   h   6.535510   0.000000   6.820929    1    0    1   
(7.316, 8.188]   h   0.000000   0.000000  23.259200    0    0    3   
(8.188, 9.0605]  h   8.944386   8.352764  25.645607    1    1    3   
(9.0605, 9.933]  h  18.863608  29.606962   9.222994    2    3    1   

                total_sum_years total_number_h  
J                                               
X                                               
(1.203, 2.0842]        7.462311              5  
(2.0842, 2.956]       10.029024              4  
(2.956, 3.828]         9.834393              3  
(3.828, 4.7]           4.086577              1  
(4.7, 5.572]           9.595351              2  
(5.572, 6.444]        30.148881              5  
(6.444, 7.316]        13.356440              2  
(7.316, 8.188]        23.259200              3  
(8.188, 9.0605]       42.942758              5  
(9.0605, 9.933]       57.693565              6  

I'm taking a shot here. Still not sure exactly what you want. I first redid your dataframe creation to be better.

N = 100
J = [2012,2013,2014]
K = ['A','B','C','D','E','F','G','H']
L = ['h','d','a']

df = pd.DataFrame(
    {'X': np.random.uniform(1,10,N),
     'Y': np.random.uniform(1,10,N),
     'Z': np.random.uniform(1,10,N),
     'ht':np.random.choice(K, N),
     'at':np.random.choice(K, N),
     'J':np.random.choice(J, N),
     'R':np.random.choice(L, N)
    })

I then grouped X with pd.cut and also grouped by J and ht

df.groupby([pd.cut(df.X, 10), 'ht', 'J'])['X'].sum().unstack('J', fill_value=0)

Which produced the following. I think this gets you close to what you want.

J                        2012       2013       2014
X               ht                                 
(1.203, 2.0842] A    2.076946   1.360880   1.544429
                B    0.000000   0.000000   1.434798
                C    0.000000   1.313596   1.835972
                D    0.000000   1.212149   1.280920
                F    0.000000   0.000000   3.545768
                H    1.421185   1.299844   0.000000
(2.0842, 2.956] A    2.331453   0.000000   0.000000
                B    0.000000   2.489030   0.000000
                C    5.689538   2.555860   0.000000
                D    5.338711   0.000000   0.000000
                H    0.000000   0.000000   6.428545
(2.956, 3.828]  A    0.000000   6.692342   0.000000
                B    0.000000   0.000000   3.211878
                C    0.000000   3.673353   3.062432
                D    0.000000   0.000000   3.432259
                E    3.789112   0.000000   3.612064
                G    0.000000   3.190256   3.117251
(3.828, 4.7]    E    8.758016   0.000000   4.302206
                G    4.086577   0.000000   0.000000
(4.7, 5.572]    A    5.268921   0.000000   0.000000
                B    0.000000   4.845556   0.000000
                C    0.000000   4.990270   5.078201
                E    0.000000   4.749795   0.000000
                F    0.000000   0.000000   5.260480
                G    4.811551   0.000000   0.000000
                H    0.000000   0.000000   4.817087
.....