Sampling with conditions using Pandas

Question

Sampling with conditions using Pandas

40 views Asked by Sang Nguyen At 09 May 2023 at 09:03

Here is my df:

scenario	month	id	type
A	2023-01	A01	HR
A	2023-02	A02	LR
A	2023-04	A04	HR
A	2023-04	A06	HR
B	2023-01	B01	LR
B	2023-02	B02	LR
B	2023-03	B03	HR
B	2023-03	B04	LR
B	2023-03	B05	HR
B	2023-03	B06	HR
B	2023-04	B07	HR

scenario	sample_num
A	2
B	4

I want to take samples based on the scenarios, with the number of samples from each 'month' and 'type' should be equal (or close to each other).

If the required sample size is less than the total number of unique values of 'month', 'month' doesn't matter as long as condition on 'type' is met.

The desired result should be like this:

scenario	month	id	type
A	2023-01	A01	HR
A	2023-02	A02	LR
B	2023-01	B01	LR
B	2023-02	B02	LR
B	2023-03	B03	HR
B	2023-04	B07	HR

I have thought of many solutions, but none really solves the problem.

Original Q&A

There are 1 answers

**mozway** · Answer 1 · 2023-05-09T09:13:23+00:00

general logic

Make a mapping Series, and use groupby.apply and sample:

mapper = df2.set_index('scenario')['sample_num']
out = (df.groupby('scenario', group_keys=False)
         .apply(lambda g: g.sample(n=mapper[g.name]))
      )

Or by shuffling the whole DataFrame, then selecting with a groupby.cumcount:

mapper = df2.set_index('scenario')['sample_num']

out = (
 df.sample(frac=1)
   .loc[lambda d: d.groupby('scenario').cumcount().lt(d['scenario'].map(mapper))]
)

Example output:

   scenario    month   id type
2         A  2023-04  A04   HR
3         A  2023-04  A06   HR
10        B  2023-04  B07   HR
6         B  2023-03  B03   HR
7         B  2023-03  B04   LR
5         B  2023-02  B02   LR

taking into account the type

Now keep the same logic and sample 1/n rows where n is the number of types:

n_types = df['type'].nunique()
mapper = df2.set_index('scenario')['sample_num']//n_types

out = (df.groupby(['scenario', 'type'], group_keys=False)
         .apply(lambda g: g.sample(n=mapper[g.name[0]]))
       )

Or:

n_types = df['type'].nunique()
mapper = df2.set_index('scenario')['sample_num']//n_types

(df.sample(frac=1)
   .loc[lambda d: d.groupby(['scenario', 'type'])
                   .cumcount().lt(d['scenario'].map(mapper))]
)

Or with a custom function if the types should be considered independently per group:

mapper = df2.set_index('scenario')['sample_num']

def sample_equal(df, n):
    n_types = df['type'].nunique()
    return df.groupby('type').sample(n=n//n_types)

out = (df.groupby('scenario', group_keys=False)
         .apply(lambda g: sample_equal(g, n=mapper[g.name]))
      )

Example output:

  scenario    month   id type
2        A  2023-04  A04   HR
1        A  2023-02  A02   LR
6        B  2023-03  B03   HR
8        B  2023-03  B05   HR
7        B  2023-03  B04   LR
4        B  2023-01  B01   LR

TechQA.

Sampling with conditions using Pandas

There are 1 answers

general logic

taking into account the type

Related Questions in PANDAS

Related Questions in SAMPLE

Related Questions in SAMPLE-DATA

Popular Questions

Trending Questions