Pandas : TypeError: float() argument must be a string or a number

133.9k views Asked by At

I have a dataframe that contains

user_id    date       browser  conversion  test  sex  age  country
   1    2015-12-03       IE        1         0    M   32.0   US

Here is my code:

from sklearn import tree
data['date'] = pd.to_datetime(data.date)
columns = [c for c in data.columns.tolist() if c not in ["test"]]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(data[columns], data["test"])

I am getting this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-560-95a8a54aa939> in <module>()
      4 from sklearn import tree
      5 clf = tree.DecisionTreeClassifier(max_depth=2, min_samples_leaf = (len(data)/100) )
----> 6 clf = clf.fit(data[columns],data["test"])

C:\Users\SnehaPriya\Anaconda2\lib\site-packages\sklearn\tree\tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    152         random_state = check_random_state(self.random_state)
    153         if check_input:
--> 154             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    155             if issparse(X):
    156                 X.sort_indices()

C:\Users\SnehaPriya\Anaconda2\lib\site-packages\sklearn\utils\validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    371                                       force_all_finite)
    372     else:
--> 373         array = np.array(array, dtype=dtype, order=order, copy=copy)
    374 
    375         if ensure_2d:

TypeError: float() argument must be a string or a number

How do I overcome this error?

3

There are 3 answers

0
jezrael On BEST ANSWER

IIUC you need exclude column date also:

columns = [c for c in columns if c not in ["test", 'date']]

because error:

TypeError: float() argument must be a string or a number, not 'Timestamp'

0
niowniow On

A solution which keeps the date(time) column:

data['date'] = pd.to_numeric(pd.to_datetime(data['date']))
0
cottontail On
Ideas to preserve datetime as features in the model

Assuming the dates are relevant only with respect to how much time has passed since the observation, a solution to keep the datetime column as a feature in the model is to convert it into time difference between now and the datetimes.

data['date'] = (pd.Timestamp('now') - pd.to_datetime(data['date'])).dt.total_seconds()

Or you can convert the datetimes into integers straight up.

data['date'] = pd.to_datetime(data['date']).astype('int64')

N.B. To convert strings to datetime, passing format= makes the conversion run much, much faster (25 times faster). See this post for the benchmark and see this post for ideas to pass the format if your datetime column doesn't have a uniform format.