I am new to python but have been in the programming world for a while. I have already tried to do the following with if else loop using python dataframes and iloc and was successful. I want to use vectorization. The problem is that once a data changes its state based upon rule 1 (b < a) I want it to remain in that state as long as rule 2 (c < a) applies.

So the input and output will look like the following:

TABLE

I have tried numpy select and numpy where with shift but the compare happens upon the same column so I couldn't get it to work. It works the first time when the rule 2 applies not after that. since above is the sample problem I created from the one I have so that code is hard to provide

The Flag is what I expect but i don't get it

df is with loops and df1 is with numpy - I can't get it to look same

import pandas as pd
import numpy as np


df = pd.DataFrame({"a": [20,25,30,32,25,26,30,24,25,30,30,35,20,22,25,28,29,31,25,21], "b": [26,28,26,35,34,34,31,26,28,29,31,32,18,24,30,35,32,35,36,30], "c":[18,19,20,22,23,24,34,22,23,31,32,38,22,21,22,21,18,22,28,31]})

df1 = df

position=''
Flag = False
df =  df.assign(flag=Flag) 

for id1, id2 in zip(df.iterrows(), df.iloc[1:].iterrows()):

   if id1[1]['b'] < id1[1]['a'] :
        Flag=True
        position='rule1True' 
   elif ( id1[1]['c'] <  id1[1]['a'] and (position =='rule1True' or position == 'rule2True') ):
        Flag = True
        position = 'rule2True'
   else :
        Flag = False
        position = ''
   df.at[id1[0],'flag'] = Flag


print(df)

df1['rule1'] = np.select([df1['b'] < df1['a']],[True], default= False)
df1['rule2'] = np.select([( df1['rule1'].shift(1) & (df1['c'] < df1['a']))],[True], default= False)
df1['flag'] = np.select([( df1['rule1'] | df1['rule2'])],[True], default= False)

print(df1)

1 Answers

0
run-out On Best Solutions

My for loop solution was very slow. Here it is a proper vector solution, this works. Very fast.

mask1 = df['b'] < df['a']
mask2 = df['c'] < df['a']
mask3 = (mask1 == False) & (mask2 == False)

This True/False flag changes when mask1 mask3 alternate.

df.loc[mask1,'flag'] = True

df.loc[mask3,'flag'] = False

Use forward fill to fill in the NaNs.

df['flag'].ffill(inplace=True)

The first few rows might be Nan, they will be false. Fill them in with False

df['flag'] = df['flag'].fillna(False)

Tested on a million rows:

CPU times: user 917 ms, sys: 7.99 ms, total: 925 ms

Sample output:

     a   b   c   flag
0   22  21  37   True
1   21  31  30  False
2   25  23  32   True
3   34  36  22   True
4   31  36  30   True
5   20  32  36  False
6   25  24  24   True
7   24  20  29   True
8   36  36  22   True
9   36  24  25   True
10  22  24  20   True
11  22  24  27  False
12  31  37  26  False
13  37  24  22   True
14  28  22  26   True
15  27  27  32  False
16  26  32  36  False
17  32  37  30  False
18  28  37  36  False
19  37  22  24   True

Full code:

df = pd.DataFrame(np.random.randint(20, 38, [1000000,3]), columns=['a', 'b', 'c'])

mask1 = df['b'] < df['a']
mask2 = df['c'] < df['a']
mask3 = (mask1 == False) & (mask2 == False)

df.loc[mask1,'flag'] = True

df.loc[mask3,'flag'] = False

df['flag'].ffill(inplace=True)

df['flag'] = df['flag'].fillna(False)