Fill NaN if values in another column are identical

75 views Asked by At

I have the following dataframe:

Out[117]: mydata
  author           email           ri                   oi
0      X1             NaN          NaN  0000-0001-8437-498X
1      X2             NaN          NaN                  NaN
2      X3       [email protected]  K-5448-2012  0000-0001-8437-498X
3      X4      [email protected]          NaN  0000-0001-8437-498X
4      X5       [email protected]          NaN  0000-0001-8437-498X

where column ri represents an author's ResearcherID, and oi the ORCID. One author may has more than one email address, so column email has duplicates.

Firstly, I'm trying to fill na in ri if the corresponding rows in oi share the same value, using a non-NaN value in ri. The result I want is:

  author           email           ri                   oi
0      X1             NaN  K-5448-2012  0000-0001-8437-498X
1      X2             NaN          NaN                  NaN
2      X3       [email protected]  K-5448-2012  0000-0001-8437-498X
3      X4      [email protected]  K-5448-2012  0000-0001-8437-498X
4      X5       [email protected]  K-5448-2012  0000-0001-8437-498X

Secondly, merging emails and using the merged value to fill na in column email, if the values in ri (or oi) are identical. I want to get a dataframe like the following one:

  author                  email           ri                       oi
0      X1   [email protected];[email protected]  K-5448-2012           0000-0001-8437-498X
1      X2                    NaN          NaN                           NaN
2      X3   [email protected];[email protected]  K-5448-2012           0000-0001-8437-498X
3      X4   [email protected];[email protected]  K-5448-2012           0000-0001-8437-498X
4      X5   [email protected];[email protected]  K-5448-2012           0000-0001-8437-498X

I've tried the following code:

final_df = pd.DataFrame()
na_df = mydata[mydata.oi.isna()]
for i in set(mydata.oi.dropna()):
    fill_df = mydata[mydata.oi == i]
    fill_df.ri = fill_df.ri.fillna(method='ffill')
    fill_df.ri = fill_df.ri.fillna(method='bfill')
    null_df = pd.concat([null_df, fill_df])
final_df = pd.concat([final_df, na_df])

This code returned the one I want in the the frist step, but is there an elegent way to approach this? Furthermore, how to get the merged value in email and then use the merged value as an input in the process of filling na?

1

There are 1 answers

2
Andy L. On

Try 2 transform. One for each column. On ri, use first. On email, use combination of dropna, unique, and join

g = df.dropna(subset=['oi']).groupby('oi')
df['ri']    = g.ri.transform('first')
df['email'] = g.email.transform(lambda x: ';'.join(x.dropna().unique()))

Out[79]:
  author                 email           ri                   oi
0     X1  [email protected];[email protected]  K-5448-2012  0000-0001-8437-498X
1     X2                   NaN          NaN                  NaN
2     X3  [email protected];[email protected]  K-5448-2012  0000-0001-8437-498X
3     X4  [email protected];[email protected]  K-5448-2012  0000-0001-8437-498X
4     X5  [email protected];[email protected]  K-5448-2012  0000-0001-8437-498X