Pandas data masking where the conditions come from other variables

Question

Pandas data masking where the conditions come from other variables

80 views Asked by A Doe At 21 March 2023 at 03:22

I have a dataframe and two lists as below:

seller1 = [5, 4, 3]
seller2 = [4, 2, 1] 

df = {'customer': [1, 1, 1, 2, 2, 2], 'time': [1,2,3,1,2,3], 'location': [3,4,2,4,3,3], 'demand':[10,12,15,20,8,16], 'price':[3,4,4,5,2,1]}
df = pd.DataFrame(df)

Which results in the following table:

   customer  time    location  demand price   
0   1          1        3        10     3   
1   1          2        4        12     4
2   1          3        2        15     4            
3   2          1        4        20     5 
4   2          2        3         8     2 
5   2          3        3        16     1

The seller1 and seller2 lists show where the sellers are at time 1,2, and 3. I want to know the demand and the price if one of the sellers is there at the exact time and mask the demand data otherwise. For example, at time 1, seller one is at location 5 and seller 2 is at location 4. Likewise, customer 1 is at location 3 and customer 2 is at location 4. So, the sellers meet the first customer but not the second at t=1.

The end table I want to have is

   customer  time    location  demand  price   
0   1          1        3        None    None   
1   1          2        4        12      4
2   1          3        2        None    None            
3   2          1        4        20      5 
4   2          2        3        None    None 
5   2          3        3        16      1

So far, I have

for i in range(df.shape[0]):
    if df["location"][i] != seller1[int(df["time"][i])-1] and df["location"][i] != seller2[int(df["time"][i])-1]:
        df["demand"][i] = np.nan
        df["price"][i] = np.nan

This is producing a SettingWithCopyWarning: and it doesn't look efficient with the for loop, either.

Is there a way to do this with df.mask()?

Original Q&A

There are 2 answers

jjsantoso On 21 March 2023 at 04:12

It looks to me that it's better if you make the sellers list into a DataFrame and merge it to the main df, using time variable, so you can compare location of customer and sellers.

This should work:

import pandas as pd

seller1 = [5, 4, 3]
seller2 = [4, 2, 1]

data = {'customer': [1, 1, 1, 2, 2, 2], 'time': [1,2,3,1,2,3], 'location': [3,4,2,4,3,3], 'demand':[10,12,15,20,8,16], 'price':[3,4,4,5,2,1]}
df = pd.DataFrame(data)
sellers = pd.DataFrame(dict(seller1=seller1, seller2=seller2, time=range(1, len(seller1) + 1)))
df_sellers = df.merge(sellers, on='time').sort_values(['customer', 'time'])
match_sellers = (df_sellers['location'] == df_sellers['seller1']) | (df_sellers['location'] == df_sellers['seller2'])
df_sellers.loc[match_sellers, 'price_match'] = df_sellers.loc[match_sellers, 'price']

**Shubham Sharma** · Accepted Answer · 2023-03-21T04:26:44+00:00

Annotated Code

# Create tuples corresponding to time and location for each seller
c = ['time', 'location']
sellers = [(i, x) 
           for s in (seller1, seller2) 
           for i, x in enumerate(s, 1)]

# Identify the rows where tuple pairs match with
# the time and location from the given dataframe
mask = df.set_index(c).index.isin(sellers)

# Mask the rows where codition doesn;t hold true
c = ['demand', 'price']
df.loc[~mask, c] = np.nan

Result

   customer  time  location  demand  price
0         1     1         3     NaN    NaN
1         1     2         4    12.0    4.0
2         1     3         2     NaN    NaN
3         2     1         4    20.0    5.0
4         2     2         3     NaN    NaN
5         2     3         3    16.0    1.0

TechQA.

Pandas data masking where the conditions come from other variables

There are 2 answers

Annotated Code

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in DATA-MASKING

Popular Questions

Trending Questions