Pandas data masking where the conditions come from other variables

80 views Asked by At

I have a dataframe and two lists as below:

seller1 = [5, 4, 3]
seller2 = [4, 2, 1] 

df = {'customer': [1, 1, 1, 2, 2, 2], 'time': [1,2,3,1,2,3], 'location': [3,4,2,4,3,3], 'demand':[10,12,15,20,8,16], 'price':[3,4,4,5,2,1]}
df = pd.DataFrame(df)

Which results in the following table:

   customer  time    location  demand price   
0   1          1        3        10     3   
1   1          2        4        12     4
2   1          3        2        15     4            
3   2          1        4        20     5 
4   2          2        3         8     2 
5   2          3        3        16     1 

The seller1 and seller2 lists show where the sellers are at time 1,2, and 3. I want to know the demand and the price if one of the sellers is there at the exact time and mask the demand data otherwise. For example, at time 1, seller one is at location 5 and seller 2 is at location 4. Likewise, customer 1 is at location 3 and customer 2 is at location 4. So, the sellers meet the first customer but not the second at t=1.

The end table I want to have is

   customer  time    location  demand  price   
0   1          1        3        None    None   
1   1          2        4        12      4
2   1          3        2        None    None            
3   2          1        4        20      5 
4   2          2        3        None    None 
5   2          3        3        16      1 

So far, I have

for i in range(df.shape[0]):
    if df["location"][i] != seller1[int(df["time"][i])-1] and df["location"][i] != seller2[int(df["time"][i])-1]:
        df["demand"][i] = np.nan
        df["price"][i] = np.nan

This is producing a SettingWithCopyWarning: and it doesn't look efficient with the for loop, either.

Is there a way to do this with df.mask()?

2

There are 2 answers

0
Shubham Sharma On BEST ANSWER

Annotated Code

# Create tuples corresponding to time and location for each seller
c = ['time', 'location']
sellers = [(i, x) 
           for s in (seller1, seller2) 
           for i, x in enumerate(s, 1)]

# Identify the rows where tuple pairs match with
# the time and location from the given dataframe
mask = df.set_index(c).index.isin(sellers)

# Mask the rows where codition doesn;t hold true
c = ['demand', 'price']
df.loc[~mask, c] = np.nan

Result

   customer  time  location  demand  price
0         1     1         3     NaN    NaN
1         1     2         4    12.0    4.0
2         1     3         2     NaN    NaN
3         2     1         4    20.0    5.0
4         2     2         3     NaN    NaN
5         2     3         3    16.0    1.0
0
jjsantoso On

It looks to me that it's better if you make the sellers list into a DataFrame and merge it to the main df, using time variable, so you can compare location of customer and sellers.

This should work:

import pandas as pd

seller1 = [5, 4, 3]
seller2 = [4, 2, 1]

data = {'customer': [1, 1, 1, 2, 2, 2], 'time': [1,2,3,1,2,3], 'location': [3,4,2,4,3,3], 'demand':[10,12,15,20,8,16], 'price':[3,4,4,5,2,1]}
df = pd.DataFrame(data)
sellers = pd.DataFrame(dict(seller1=seller1, seller2=seller2, time=range(1, len(seller1) + 1)))
df_sellers = df.merge(sellers, on='time').sort_values(['customer', 'time'])
match_sellers = (df_sellers['location'] == df_sellers['seller1']) | (df_sellers['location'] == df_sellers['seller2'])
df_sellers.loc[match_sellers, 'price_match'] = df_sellers.loc[match_sellers, 'price']

resulting dataframe