sklearn-LinearRegression: could not convert string to float: '--'

34.2k views Asked by At

I am trying to use a LinearRegression from sklearn and I am getting a 'Could not convert a string to float'. All columns of the dataframe are float and the output y is also float. I have looked at other posts and the suggestions are to convert to float which I have done.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 789 entries, 158 to 684
Data columns (total 8 columns):
f1     789 non-null float64
f2     789 non-null float64
f3     789 non-null float64
f4     789 non-null float64
f5     789 non-null float64
f6     789 non-null float64
OFF    789 non-null uint8
ON     789 non-null uint8
dtypes: float64(6), uint8(2)
memory usage: 44.7 KB

type(y_train)
pandas.core.series.Series
type(y_train[0])
float

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,random_state=0)
X_train.head()
from sklearn.linear_model import LinearRegression
linreg = LinearRegression().fit(X_train, y_train)

The error I get is a

ValueError                                Traceback (most recent call last)
<ipython-input-282-c019320f8214> in <module>()
      6 X_train.head()
      7 from sklearn.linear_model import LinearRegression
----> 8 linreg = LinearRegression().fit(X_train, y_train)
510         n_jobs_ = self.n_jobs
    511         X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 512                          y_numeric=True, multi_output=True)
    513 
    514         if sample_weight is not None and np.atleast_1d(sample_weight).ndim > 1:

 527         _assert_all_finite(y)
    528     if y_numeric and y.dtype.kind == 'O':
--> 529         y = y.astype(np.float64)
    530 
    531     check_consistent_length(X, y)

ValueError: could not convert string to float: '--'

Please help.

3

There are 3 answers

4
this be Shiva On BEST ANSWER

A quick solution would involve using pd.to_numeric to convert whatever strings your data might contain to numeric values. If they're incompatible with conversion, they'll be reduced to NaNs.

from sklearn.linear_model import LinearRegression

X = X.apply(pd.to_numeric, errors='coerce')
Y = Y.apply(pd.to_numeric, errors='coerce')

Furthermore, you can choose to fill those values with some default:

X.fillna(0, inplace=True)
Y.fillna(0, inplace=True)

Replace the fill value with whatever's relevant to your problem. I don't recommend dropping these rows, because you might end up dropping different rows from X and Y causing a data-label mismatch.

Finally, split and call your classifier:

X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0)
clf = LinearRegression().fit(X_train, y_train)
0
user19070165 On

It is because one of your columns contains string values. I had the same problem, because I've been ask to drop a column, but I didn't have to, because the columns were already deleted.

However, after doing this code :

model = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

I have this error :

could not convert string to float: 'product_mng'

The reason is that X_train still had the string column, which I thought was deleted. As a conclusion, check AGAIN that ALL your column are not string. If there is one, delete it with pd.drop, or label encode (or 1-hot encode) this string column.

0
Sagar Narula On

I think its better to convert all the string columns to binary(0,1) using the label encoding or one hot encoding after than our linear regression will behave much better.!!