I just started learning Python so any help is much appreciated.
So the overarching purpose here is for data exploration + data cleaning.
The function I've written below outputs a dataframe that shows the percentage of missing values from every column.
def missing_values_table(df):
missing_vals = df.isnull().sum()
# Boolean check of all value to True for all null values, then sums for total count.
percent_conversion = 100 * df.isnull().sum()/len(df)
# Percent conversion.
combined_table = pd.concat([missing_vals, percent_conversion], axis=1)
# Merging dataframes.
table_renamed = combined_table.rename(columns =
{0:'Missing Values', 1:'Percentage'})
# Giving column labels.
table_renamed.sort_values(['Percentage'], ascending=False, inplace=True)
# Sort descending.
return table_renamed
Problematic output (it's missing index that will show me where they are in the original dataframe...which is massive):
Missing Values Percentage
Engine_Horsepower 375906 93.712932
Pushblock 375906 93.712932
Enclosure_Type 375906 93.712932
Blade_Width 375906 93.712932
[...]
Desired output:
Missing Values Percentage
32 Engine_Horsepower 375906 93.712932
15 Pushblock 375906 93.712932
3 Enclosure_Type 375906 93.712932
17 Blade_Width 375906 93.712932
[...]
The numbers correspond to the column number from the original dataframe, pre-sorted.
Only after investigating these individually to confirm that these columns can be deleted, I'll delete the columns based on a threshold (50%+ null values, delete).
To keep your integer location of the columns, make the columns MultiIndex
And then the filtering and summary of nulls will keep the location
And this will do the first part of your function