I'm learning how to use machine learning algorithms for forecasting with darts following a video from Kish Manani (https://www.youtube.com/watch?v=9QtL7m3YS9I)
I'm trying to use TimeSeries.from_group_dataframe() to create a couple of different graphs for a linear regression model. My data looks like this
date | country | volume |
---|---|---|
2020-01-01 | UK | 2121 |
2020-01-01 | DE | 300 |
2020-01-02 | UK | 2150 |
2020-01-02 | DE | 243 |
The issue is that for some reason I am getting a value error that I cannot understand the cause of:
Traceback (most recent call last):
File "C:\Users\[redacted]\Desktop\Scripts\git\scikitlearn-models\More advanced models.py", line 41, in <module>
model.fit(y)
File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\darts\models\forecasting\regression_model.py", line 722, in fit
self._fit_model(
File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\darts\models\forecasting\regression_model.py", line 544, in _fit_model
self.model.fit(training_samples, training_labels, **kwargs)
File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 1151, in wrapper
return fit_method(estimator, *args, **kwargs)
File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model\_base.py", line 678, in fit
X, y = self._validate_data(
File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 621, in _validate_data
X, y = check_X_y(X, y, **check_params)
X = check_array(
File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\validation.py", line 917, in check_array
array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
File "C:\Users\[redacted]\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\_array_api.py", line 380, in _asarray_with_order
array = numpy.asarray(array, order=order, dtype=dtype)
ValueError: could not convert string to float: 'DE'
Code attached:
datasource = pd.read_csv('data/multivariate_test.csv',index_col=False)
df = pd.DataFrame(datasource)
# Convert the 'date_column' to timestamps with english formatting (day comes first)
df['date'] = pd.to_datetime(df['date'],dayfirst=True)
print("Data Converted to: ")
print(df.dtypes)
df.sort_values(by='date',inplace=True)
df.reset_index(drop=True,inplace=True)
print(df)
# Create a TimeSeries, specifying the time and value columns
y = TimeSeries.from_group_dataframe(df,
group_cols= 'country',
static_cols= 'country',
time_col= 'date',
value_cols=['y'],
fill_missing_dates=False, freq='MS') # stands for Month Start
### REGRESSION MODEL ###
model = RegressionModel(lags=[-1,-2,-12],model = LinearRegression())
model.fit(y)
y_pred = model.predict(n=12,series=y)ype here
I'm not expecting that the darts library has a need to convert my grouping covariates into floats, especially when the purpose of this attribute is to be able to split time series based off of category, or type.
Anyone who knows the library well, or can see an obvious mistake please let me know.
After playing around with your code and data, I ended up at this error:
"RegressionModel can only interpret numeric static covariate data. Consider encoding/transforming categorical static covariates with
darts.dataprocessing.transformers.static_covariates_transformer.StaticCovariatesTransformer
or setuse_static_covariates=False
at model creation to ignore static covariates."This means that if you are going to use RegressionModel, your "country" column should be converted to a numerical type before being passed as a static covariate. In this case, you would be assigning each country a numerical value (i.e. UK = 1, DE = 2) and replacing their country letter codes with the assigned numerical value.
To do this, you can use scikit-learn's ordinal encoder (sklearn.preprocessing.OrdinalEncoder), or even easier, darts' built in StaticCovariatesTransformer:
*note that I also removed your static_cols parameter, as group_cols automatically gets converted into static covariates, and therefore you do not need both.