I'm trying to implement some machine learning algorithms, but I'm having some difficulties putting the data together.
In the example below, I load a example data-set from UCI, remove lines with missing data (thanks to the help from a previous question), and now I would like to try to normalize the data.
For many datasets, I just used:
valores = (valores - valores.mean()) / (valores.std())
But for this particular dataset the approach above doesn't work. The problem is that the mean function is returning inf
, perhaps due to a precision issue. See the example below:
bcw = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)
for col in bcw.columns:
if bcw[col].dtype != 'int64':
print "Removendo possivel '?' na coluna %s..." % col
bcw = bcw[bcw[col] != '?']
valores = bcw.iloc[:,1:10]
#mean return inf
print valores.iloc[:,5].mean()
My question is how to deal with this. It seems that I need to change the type of this column, but I don't know how to do it.
not so familiar with pandas but if you convert to a numpy array it works, try