Sampling with conditions using Pandas

40 views Asked by At

Here is my df:

scenario month id type
A 2023-01 A01 HR
A 2023-02 A02 LR
A 2023-04 A04 HR
A 2023-04 A06 HR
B 2023-01 B01 LR
B 2023-02 B02 LR
B 2023-03 B03 HR
B 2023-03 B04 LR
B 2023-03 B05 HR
B 2023-03 B06 HR
B 2023-04 B07 HR
scenario sample_num
A 2
B 4

I want to take samples based on the scenarios, with the number of samples from each 'month' and 'type' should be equal (or close to each other).

If the required sample size is less than the total number of unique values of 'month', 'month' doesn't matter as long as condition on 'type' is met.

The desired result should be like this:

scenario month id type
A 2023-01 A01 HR
A 2023-02 A02 LR
B 2023-01 B01 LR
B 2023-02 B02 LR
B 2023-03 B03 HR
B 2023-04 B07 HR

I have thought of many solutions, but none really solves the problem.

1

There are 1 answers

3
mozway On

general logic

Make a mapping Series, and use groupby.apply and sample:

mapper = df2.set_index('scenario')['sample_num']
out = (df.groupby('scenario', group_keys=False)
         .apply(lambda g: g.sample(n=mapper[g.name]))
      )

Or by shuffling the whole DataFrame, then selecting with a groupby.cumcount:

mapper = df2.set_index('scenario')['sample_num']

out = (
 df.sample(frac=1)
   .loc[lambda d: d.groupby('scenario').cumcount().lt(d['scenario'].map(mapper))]
)

Example output:

   scenario    month   id type
2         A  2023-04  A04   HR
3         A  2023-04  A06   HR
10        B  2023-04  B07   HR
6         B  2023-03  B03   HR
7         B  2023-03  B04   LR
5         B  2023-02  B02   LR

taking into account the type

Now keep the same logic and sample 1/n rows where n is the number of types:

n_types = df['type'].nunique()
mapper = df2.set_index('scenario')['sample_num']//n_types

out = (df.groupby(['scenario', 'type'], group_keys=False)
         .apply(lambda g: g.sample(n=mapper[g.name[0]]))
       )

Or:

n_types = df['type'].nunique()
mapper = df2.set_index('scenario')['sample_num']//n_types

(df.sample(frac=1)
   .loc[lambda d: d.groupby(['scenario', 'type'])
                   .cumcount().lt(d['scenario'].map(mapper))]
)

Or with a custom function if the types should be considered independently per group:

mapper = df2.set_index('scenario')['sample_num']

def sample_equal(df, n):
    n_types = df['type'].nunique()
    return df.groupby('type').sample(n=n//n_types)

out = (df.groupby('scenario', group_keys=False)
         .apply(lambda g: sample_equal(g, n=mapper[g.name]))
      )

Example output:

  scenario    month   id type
2        A  2023-04  A04   HR
1        A  2023-02  A02   LR
6        B  2023-03  B03   HR
8        B  2023-03  B05   HR
7        B  2023-03  B04   LR
4        B  2023-01  B01   LR