I have two dataframes, df1 and df2. Both have the same IDs. In df2, one ID is in multiple rows (none or one of the columnB can have value 'a' and on every row, there is another value in `columnC).

df1 = pd.DataFrame({'ID': ['111.111', '222.222', '333.333','444.444], 'columnA': [np.nan,np.nan,np.nan,np.nan]}) # or columnA does not has to be defined yet
df2 = pd.DataFrame({'ID': ['111.111', '111.111','111.111', '222.222' ,'222.222', '333.333', '333.333', '444.444'],
'columnB': [np.nan,'a',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'columnC':['0432', '0987','5460','0955','4356','6029','7329','9006']})

That looks like:

df1                   df2
ID       columnA       ID     columnB   columnC
111.111              111.111             0432
222.222              111.111     a       0987
333.333              111.111             5460
444.444              222.222             0955
                     222.222             4356
                     333.333             6029
                     333.333             7329
                     444.444             9006

My goal: If columnB == 'a' OR none of columnC isin(validValues), I want to put the value 'AA' in the columnA of df1 with the same ID where one of the conditions gave true.

I need an efficient way for this.

Let's say validValues = {'0987', '2638', '5460', '9386'}

expected df1:

ID      columnA
111.111
222.222    AA
333.333
444.444    AA

I was trying the following 'til now:

df2['temp']=df2['columnB']

tmp_df = df2.groupby('ID').temp.apply(lambda x: 'AA' if (x=='a').any() else ' ')


tmp_df=tmp_df.to_frame()

df1['columnA']=df1.merge(tmp_df, how='outer',
               left_on=df1.ID, 
               right_on=tmp_df.index).drop('key_0', axis=1)['temp']


df2['temp']=df2['columnc']
tmp_df = df2.groupby('ID').temp.apply(lambda x: 'AA' if (x.isin(validValues)).any() else ' ')
tmp_df=tmp_df.to_frame()

df1['columnA']=df1.merge(tmp_df, how='outer',
               left_on=df1.ID, 
               right_on=tmp_df.index).drop('key_0', axis=1)['temp']

One problem is, that the 2nd merge overrides the first one. The solution is bad at all.

Also, it would be nice to delete thoose IDs where 'AA' was putted in in df1 from df2.

1 Answers

0
Ben Pap On
df2IDs = df2[(df2['columnB'] == 'a') | (df2['columnC'].isin(validValues))][ID].tolist()

df1.loc[df1['ID'].isin(df2IDs), 'columnA'] = 'a'

1) First filter df2 for the where columnB is a or columnC is in your validvalues, look at the ID column and save it to a list.

2) Take that list, and look in df1 for where the ID's match, look at columnA and set it equal to 'a'.

EDIT

df2IDs = df2.groupby('ID').filter(lambda x:any(x.columnB == 'a') | all(np.logical_not(x.columnC.isin(validValues))))['ID']

df1.loc[df1['ID'].isin(df2IDs), 'columnA'] = 'a'

I misunderstood your question. The first statement will return the groups whenever any of the member contain 'a' or that all the member's are not in valid values.