I have started with a data science course which requires me to handle missing data either by deleting the row containing NaN in "price" subset or replacing the NaN with some mean value. However both of my dropna() and replace() doesn't seem to work. What could be the problem?

I went through a lot of solutions on stackoverflow but my problem was not solved. I also tried going through pandas.pydata.org to look for solution where I learnt about different arguments for dropna() like thresh, how='any', etc but nothing helped.

import pandas as pd

import numpy as np


url="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
df=pd.read_csv(url,header=None)


'''
Our data comes without any header or column name,hence we assign each column a header name.
'''


headers=["symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels","engnie-location","wheel-base","length","width","height","curb-weight","engine-type","num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"]
df.columns=headers


'''
Now that we have to eliminate rows containing NaN or ? in "price" column in our data
'''

df.dropna(subset=["price"], axis=0, inplace=True) 

df.head(12)

#or

df.dropna(subset=["price"], how='any') 

df.head(12)

#also to replace

mean=df["price"].mean()

df["price"].replace(np.nan,mean)

df.head(12)

It was expected that all the rows containig NaN or "?" in the "price" column to be deleted for dropna() or replaced for replace(). However there seems to be no change in data.

1 Answers

0
yaho cho On Best Solutions

Please use this code to drop ? value as following:

df['price'] = pd.to_numeric(df['price'], errors='coerce')
df = df.dropna()

to_numeric method converts argument to a numeric type.

And, coerce sets invalids as NaN.

Then, dropna can clear records include NaN.