Pandas Boolean Filter with Assignment resulting in NaN

240 views Asked by At

I'm curious why this toy example for simultaneous boolean index + assignment in Pandas doesn't work:

df = pd.DataFrame({'Source': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'Period': ['1 hr', '1 hr', '1 hr', '24 hr', '24 hr', '24 hr'],
                   'CO': [1.1, 1.2, 1.3, 2.1, 2.2, 2.3],
                   'DPM': [11.1, 11.2, 11.3, 12.1, 12.2, 12.3],
                   'NOx': [21.1, 21.2, 21.3, 22.1, 22.2, 22.3]})

The resulting toy DataFrame is here:

  Source Period   CO   DPM   NOx
0      A   1 hr  1.1  11.1  21.1
1      B   1 hr  1.2  11.2  21.2
2      C   1 hr  1.3  11.3  21.3
3      A  24 hr  2.1  12.1  22.1
4      B  24 hr  2.2  12.2  22.2
5      C  24 hr  2.3  12.3  22.3

Now, I want the final DataFrame to take the 24 hr values and assign it to the 1 hr values for sources A and B. The final DataFrame should look like this:

  Source Period   CO   DPM   NOx
0      A   1 hr  2.1  12.1  22.1
1      B   1 hr  2.2  12.2  22.2
2      C   1 hr  1.3  11.3  21.3
3      A  24 hr  2.1  12.1  22.1
4      B  24 hr  2.2  12.2  22.2
5      C  24 hr  2.3  12.3  22.3

I tried to do the following command:

df.loc[df['Source'].isin(['A', 'B']) & (df['Period'] == '1 hr'), ['CO', 'DPM', 'NOx']] =\ 
 df.loc[df['Source'].isin(['A', 'B']) & (df['Period'] == '24 hr'), ['CO', 'DPM', 'NOx']]

but at the end my DataFrame is replaced with NaNs:

  Source Period   CO   DPM   NOx
0      A   1 hr  NaN   NaN   NaN
1      B   1 hr  NaN   NaN   NaN
2      C   1 hr  1.3  11.3  21.3
3      A  24 hr  2.1  12.1  22.1
4      B  24 hr  2.2  12.2  22.2
5      C  24 hr  2.3  12.3  22.3

The filter expressions on both the LHS and RHS of the assignment are filtering correctly with the same number of rows, it seems that the assignment is where it's thrown off. How do I do this correctly? Note that I only want the CO, DPM, and NOx values to change, not any other columns.

1

There are 1 answers

0
AudioBubble On BEST ANSWER

The problem is that the indexes don't match. You can get around that issue by using the underlying numpy array:

msk = (df['Period'] == '24 hr')
cols = ['DPM', 'NOx']
df.loc[~msk & df['Source'].isin(['A','B']), cols] = df.loc[msk & df['Source'].isin(['A','B']), cols].to_numpy()

Output:

  Source Period   CO   DPM   NOx
0      A   1 hr  1.1  12.1  22.1
1      B   1 hr  1.2  12.2  22.2
2      C   1 hr  1.3  11.3  21.3
3      A  24 hr  2.1  12.1  22.1
4      B  24 hr  2.2  12.2  22.2
5      C  24 hr  2.3  12.3  22.3

Note that this only works as you expect if there is a one-to-one relation between "1 hr" and "24 hr" for each "Source" type.

You could also use groupby + last:

cols = ['DPM', 'NOx']
filt = df['Source'].isin(['A','B'])
df.loc[filt, cols] = df[filt].groupby('Source')[cols].transform('last')