pandas frequency table with missing values

61 views Asked by At

I try to calculate the number of equal rows in a pandas dataframe (i.e. a frequency table) which is used to calculate the k-anonymity of a dataset

I have a special requirement regarding the counting of missing values : A missing value should count towards all other classes (as the missing value "could" be any value). In addition, the count of the record with missing values is the number of possible combinations regarding the missing values. Values should be taken as categorical

Given such a DataFrame, the count (below denoted as f_k) should look like

enter image description here

With pandas value_counts, I get

d = {
    'key1': [1,1,2,np.nan],
    'key2': [1,1,1,1],
    'key3': [3,np.nan,3,np.nan]
    }

df = pd.DataFrame(data=d)
df["key1"] = df["key1"].astype("Int64").astype('category')
df["key2"] = df["key2"].astype('Int64').astype('category')
df["key3"] = df["key3"].astype('Int64').astype('category')

df
.value_counts(dropna=False)
.reset_index()

enter image description here

Any idea how to achieve this in pandas?

1

There are 1 answers

0
Bluegirl FK On

This works but time consuming:

import pandas as pd
import numpy as np


data = {
    'key1': [1, 1, 2, np.nan],
    'key2': [1, 1, 1, 1],
    'key3': [3, np.nan, 3, np.nan]
}
df = pd.DataFrame(data)

fk_lst = []
for index, row in df.iterrows():
    non_nan_columns = row[row.notna()].index.tolist()
    df = df[non_nan_columns]
    for col in df.columns:
        df[col] = df[col].fillna(row[col])
    count = df.value_counts(dropna=False).reset_index()
    count = int(count[['count']].iloc[0])
    fk_lst.append(count)
    df = pd.DataFrame(data)

df['f_k'] = fk_lst