I have a column with missing categorical data and I am trying to replace them by existing categorical variables from the same column.
I do not want to use the mode because I have too many missing data, it will skew the data and I would rather not drop the rows with missing data.
I think the ideal way would be to get the proportion of each variables for my column and then replace the missing proportionally by the existing categorical variables.
Example dataframe:
ClientId Apple_cat Region Price
0 21 cat_1 Reg_A 5
1 15 cat_2 Nan 6
2 6 Nan Reg_B 7
3 91 cat_3 Reg_A 3
4 45 Nan Reg_C 7
5 89 cat_2 Nan 6
Note: Ideally, I'd like to avoid hardcoding each category and region name.
You can roll your own function for a neat and vectorized method to solving this:
This solution works on 1 Series at time and can be called like so:
Alternatively you can use apply to call it on each of your columns. Note that because of the
if
statement in our function, we do not need to specify null-containing columns in advance before callingapply
: