Pandas Calculate Sum of Multiple Columns Given Multiple Conditions

3.8k views Asked by At

I have a wide table in a format as follows (for up to 10 people):

person1_status | person2_status | person3_status | person1_type | person_2 type | person3_type 
       0       |        1       |        0       |        7     |        4      |        6  

Where status can be a 0 or a 1 (first 3 cols).

Where type can be a # ranging from 4-7. The value here corresponds to another table that specifies a value based on type. So...

Type | Value
 4   |   10
 5   |   20
 6   |   30
 7   |   40

I need to calculate two columns, 'A', and 'B', where:

  1. A is the sum of values of each person's type (in that row) where status = 0.
  2. B is the sum of values of each person's type (in that row) where status = 1.

For example, the resulting columns 'A', and 'B' would be as follows:

A  | B
70 | 10

An explanation of this:

'A' has value 70 because person1 and person3 have "status" 0 and have corresponding type of 7 and 6 (which corresponds to values 30 and 40).

Similarly, there should be another column 'B' that has the value "10" because only person2 has status "1" and their type is "4" (which has corresponding value of 10).

This is probably a stupid question, but how do I do this in a vectorized way? I don't want to use a for loop or anything since it'll be less efficient...

I hope that made sense... could anyone help me? I think I'm brain dead trying to figure this out.

For simpler calculated columns I was getting away with just np.where but I'm a little stuck here since I need to calculate the sum of values from multiple columns given certain conditions while pulling in those values from a separate table...

hope that made sense

2

There are 2 answers

1
Ted Petrou On BEST ANSWER

Use the filter method which will filter the column names for those where a string appears in them.

Make a dataframe for the lookup values other_table and set the index as the type column.

df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values

df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)

Full example below:

Create fake data

df = pd.DataFrame({'person_1_status':np.random.randint(0, 2,1000) , 
                   'person_2_status':np.random.randint(0, 2,1000), 
                   'person_3_status':np.random.randint(0, 2,1000), 
                   'person_1_type':np.random.randint(4, 8,1000), 
                   'person_2_type':np.random.randint(4, 8,1000),
                   'person_3_type':np.random.randint(4, 8,1000)},
                 columns= ['person_1_status', 'person_2_status', 'person_3_status',
                           'person_1_type', 'person_2_type', 'person_3_type'])

 person_1_status  person_2_status  person_3_status  person_1_type  \
0                1                0                0              7   
1                0                1                0              6   
2                1                0                1              7   
3                0                0                0              7   
4                0                0                1              4   

   person_3_type  person_3_type  
0              5              5  
1              7              7  
2              7              7  
3              7              7  
4              7              7 

Make other_table

other_table = pd.Series({4:10, 5:20, 6:30, 7:40})

4    10
5    20
6    30
7    40
dtype: int64

Filter out status and type columns to their own dataframes

df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')

Make lookup table

df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values

Apply matrix multiplication and sum across rows.

df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)

Output

 person_1_status  person_2_status  person_3_status  person_1_type  \
0                0                0                1              7   
1                0                1                0              4   
2                0                1                1              7   
3                0                1                0              6   
4                0                0                1              5   

   person_2_type  person_3_type   A   B  
0              7              5  80  20  
1              6              4  20  30  
2              5              5  40  40  
3              6              4  40  30  
4              7              5  60  20  
0
piRSquared On

consider the dataframe df

mux = pd.MultiIndex.from_product([['status', 'type'], ['p%i' % i for i in range(1, 6)]])
data = np.concatenate([np.random.choice((0, 1), (10, 5)), np.random.rand(10, 5)], axis=1)
df = pd.DataFrame(data, columns=mux)
df

enter image description here

The way this is structured we can do this for type == 1

df.status.mul(df.type).sum(1)

0    0.935290
1    1.252478
2    1.354461
3    1.399357
4    2.102277
5    1.589710
6    0.434147
7    2.553792
8    1.205599
9    1.022305
dtype: float64

and for type == 0

df.status.rsub(1).mul(df.type).sum(1)

0    1.867986
1    1.068045
2    0.653943
3    2.239459
4    0.214523
5    0.734449
6    1.291228
7    0.614539
8    0.849644
9    1.109086
dtype: float64

You can get your columns in this format using the following code

df.columns = df.columns.str.split('_', expand=True)
df = df.swaplevel(0, 1, 1)