Downsample a pandas dataframe keeping same proportion of target in every month

65 views Asked by At

I have a pandas dataframe df with column 'TARGET' which takes values of 0 or 1 and column 'MONTH' which collects different months:

MONTH #_OBS_TARGET=0 #_OBS_TARGET=1
202207 44619 52960
202208 48093 55399
202209 50161 56528

I want to downsample my dataframe to have the same number of observations with TARGET = 0 and TARGET = 1 for every value of MONTH:

MONTH #_OBS_TARGET=0 #_OBS_TARGET=1
202207 44619 44619
202208 48093 48093
202209 50161 50161

I tred the following

for m in df['MONTH'].unique():
    number_of_ones = len(df[(df['MONTH']==m) & (df['TARGET']==1)])
    number_of_zeros = len(df[(df['MONTH']==m) & (df['TARGET']==0)])
    n_obs_to_drop = number_of_ones - number_of_zeros 
    df[df['MONTH']==m].drop(df[(df['MONTH']==m) & (df['TARGET']==1)].sample(n_obs_to_drop).index, inplace = True)

But clearly it is not deleting anything and I get the following warning (EDIT2):

/opt/conda/envs/librerias_cbi/lib/python3.9/site-packages/pandas/core/frame.py:4901: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(

How should I get it? Also a diffent approach is welcomed.

Notice that there are duplicate values of the index in diffent values of MONTH. There are also many more columns in df which should be kept in the downsampled dataframe.

EDIT1:

I'm adding a reproducible example

import pandas as pd
data = {
"MONTH": [202207, 202207, 202207, 202207, 202208, 202208, 202208, 202209, 202209, 202209, 202209],
"TARGET": [1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0],
"other_column1": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110],  # Example additional columns
"other_column2": [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100]
}
    
df = pd.DataFrame(data)

pd.crosstab(df['MONTH'],df['TARGET'])

TARGET  0   1
MONTH       
202207  1   3
202208  1   2
202209  2   2
2

There are 2 answers

2
rpanai On

Not sure if this is the most elegant way but I'll go for

Get the number of sample for each month

grp = pd.crosstab(df['MONTH'],df['TARGET'])\
    .min(1)\
    .reset_index(name='size')
    MONTH  size
0  202207     1
1  202208     1
2  202209     2

Merge with original df

df1 = pd.merge(df, grp)

Take a sample using size defined before

df2 = df1.groupby(['MONTH', 'TARGET'])\
    .apply(lambda x: x.sample(n=x['size'].iloc[0]))\
    .reset_index(drop=True)
0
Ale On

I guess the problem was applying the .drop() to each slice of df, or something related to duplicate values in the index (for different values of MONTH). I solved the problem this way:

df_temp = pd.DataFrame() # create an empty df that will be filled in each iteration of the loop

for mes in df['MONTH'].unique():
    df_slice = df[df["MONTH"]==mes].copy() # create a temporary df where the index is unique
    
    number_of_ones = len(df_slice[df_slice['TARGET']==1])
    number_of_zeros = len(df_slice[df_slice['TARGET']==0])
    n_obs_to_drop = number_of_ones - number_of_zeros
   
    df_slice.drop(df_slice[df_slice['TARGET']==1].sample(n_obs_to_drop).index, inplace = True) # drop rows only for the partition
    
    df_temp = pd.concat([df_temp,df_slice]) # concatenate the partitions

df = df_temp
del df_temp
del df_slice