How to generate random categorical data from existing ones to fill missing values - Python

1.6k views Asked by At

I have a column with missing categorical data and I am trying to replace them by existing categorical variables from the same column.

I do not want to use the mode because I have too many missing data, it will skew the data and I would rather not drop the rows with missing data.

I think the ideal way would be to get the proportion of each variables for my column and then replace the missing proportionally by the existing categorical variables.

Example dataframe:

   ClientId    Apple_cat    Region    Price
0  21          cat_1        Reg_A     5
1  15          cat_2        Nan       6
2  6           Nan          Reg_B     7
3  91          cat_3        Reg_A     3
4  45          Nan          Reg_C     7
5  89          cat_2        Nan       6

Note: Ideally, I'd like to avoid hardcoding each category and region name.

3

There are 3 answers

4
Cameron Riddell On BEST ANSWER

You can roll your own function for a neat and vectorized method to solving this:

def na_randomfill(series):
    na_mask = pd.isnull(series)   # boolean mask for null values
    n_null = na_mask.sum()        # number of nulls in the Series
    
    if n_null == 0:
        return series             # if there are no nulls, no need to resample
    
    # Randomly sample the non-null values from our series
    #  only sample this Series as many times as we have nulls 
    fill_values = series[~na_mask].sample(n=n_null, replace=True, random_state=0)

    # This ensures our new values will replace NaNs in the correct locations
    fill_values.index = series.index[na_mask]
    
    return series.fillna(fill_values) 

This solution works on 1 Series at time and can be called like so:

out = na_randomfill(df["Apple_cat"])

print(out)
0    cat_1
1    cat_2
2    cat_3
3    cat_3
4    cat_2
5    cat_2
Name: Apple_cat, dtype: object

Alternatively you can use apply to call it on each of your columns. Note that because of the if statement in our function, we do not need to specify null-containing columns in advance before calling apply:

out = df.apply(na_randomfill)

print(out)
   ClientId Apple_cat Region  Price
0        21     cat_1  Reg_A      5
1        15     cat_2  Reg_A      6
2         6     cat_3  Reg_B      7
3        91     cat_3  Reg_A      3
4        45     cat_2  Reg_C      7
5        89     cat_2  Reg_C      6
0
HotMailRob On

The hard part of replacing NaN in a pandas Dataframe is that the fillna() method will replace all the nans with the same number, even if you put some kind of randomness in the call.

import random
df['Apple_cat'].applymap(lambda x: x if not np.isnan(x) else np.random.choice(random.choice(list(x.dropna(axis=0))))

This will replace the Nans with a random sample from the rest of th column. Maybe you will need to replace np.nan by the other type of nan.

2
ATIF ADIB On

You can fill the missing values based on the probability distribution of the filled rows.

import numpy as np

df[‘<your_column_name>’] = df[‘<your_column_name>’].fillna(‘TBD’)
possible_values = df[‘<your_column_name>’].value_counts().to_dict()

possible_values.pop(‘TBD’)
total_items = sum(possible_values.keys())
possible_values = [(k,v) for k,v in possible_values.items()]
prob_dist = [i[1]/total_items for i in possible_values]

def fill_missing_values(item):
    if item != ‘TBD’:
        index = np.random.choice(np.arange(len(prob_dist), p=prob_dist)
        return possible_values[index]
    return item

df[‘<your_column_name>’] = df[‘<your_column_name>’].apply(lambda x: fill_missing_values(x))