Here is my df:
| scenario | month | id | type |
|---|---|---|---|
| A | 2023-01 | A01 | HR |
| A | 2023-02 | A02 | LR |
| A | 2023-04 | A04 | HR |
| A | 2023-04 | A06 | HR |
| B | 2023-01 | B01 | LR |
| B | 2023-02 | B02 | LR |
| B | 2023-03 | B03 | HR |
| B | 2023-03 | B04 | LR |
| B | 2023-03 | B05 | HR |
| B | 2023-03 | B06 | HR |
| B | 2023-04 | B07 | HR |
| scenario | sample_num |
|---|---|
| A | 2 |
| B | 4 |
I want to take samples based on the scenarios, with the number of samples from each 'month' and 'type' should be equal (or close to each other).
If the required sample size is less than the total number of unique values of 'month', 'month' doesn't matter as long as condition on 'type' is met.
The desired result should be like this:
| scenario | month | id | type |
|---|---|---|---|
| A | 2023-01 | A01 | HR |
| A | 2023-02 | A02 | LR |
| B | 2023-01 | B01 | LR |
| B | 2023-02 | B02 | LR |
| B | 2023-03 | B03 | HR |
| B | 2023-04 | B07 | HR |
I have thought of many solutions, but none really solves the problem.
general logic
Make a mapping Series, and use
groupby.applyandsample:Or by shuffling the whole DataFrame, then selecting with a
groupby.cumcount:Example output:
taking into account the type
Now keep the same logic and sample 1/n rows where n is the number of types:
Or:
Or with a custom function if the types should be considered independently per group:
Example output: