I just started learning Python so any help is much appreciated.

So the overarching purpose here is for data exploration + data cleaning.

The function I've written below outputs a dataframe that shows the percentage of missing values from every column.

def missing_values_table(df):
    missing_vals = df.isnull().sum()
    # Boolean check of all value to True for all null values, then sums for total count.
    percent_conversion = 100 * df.isnull().sum()/len(df)
    # Percent conversion.
    combined_table = pd.concat([missing_vals, percent_conversion], axis=1)
    # Merging dataframes.
    table_renamed = combined_table.rename(columns = 
        {0:'Missing Values', 1:'Percentage'})
    # Giving column labels.
    table_renamed.sort_values(['Percentage'], ascending=False, inplace=True)
    # Sort descending.
    return table_renamed

Problematic output (it's missing index that will show me where they are in the original dataframe...which is massive):

                          Missing Values  Percentage
Engine_Horsepower                 375906   93.712932
Pushblock                         375906   93.712932
Enclosure_Type                    375906   93.712932
Blade_Width                       375906   93.712932
[...]

Desired output:

                          Missing Values  Percentage
32 Engine_Horsepower                 375906   93.712932
15 Pushblock                         375906   93.712932
3  Enclosure_Type                    375906   93.712932
17 Blade_Width                       375906   93.712932
[...]

The numbers correspond to the column number from the original dataframe, pre-sorted.

Only after investigating these individually to confirm that these columns can be deleted, I'll delete the columns based on a threshold (50%+ null values, delete).

1

There are 1 answers

7
Ted Petrou On BEST ANSWER

To keep your integer location of the columns, make the columns MultiIndex

df.columns = pd.MultiIndex.from_arrays([range(len(df.columns)), df.columns])

And then the filtering and summary of nulls will keep the location

threshold = .4
df[df.columns[df.isnull().mean() < threshold]]

And this will do the first part of your function

df_null_summary = pd.concat([df.isnull().sum(), df.isnull().mean()], axis=1, keys=['Missing Values', 'Percentage'])