SciKit-learn for data driven regression of oscillating data

536 views Asked by At

I have data that roughly follows a y=sin(time) distribution, but also depends on other variables than time. In terms of correlations, since the target y-variable oscillates there is almost zero statistical correlation with time, but y obviously depends very strongly on time.

The goal is to predict the future values of the target variable. I want to avoid using an explicit assumption of the model, and instead rely on data driven models and machine learning, so I have tried using regression methods from sklearn.

I have tried the following methods (the parameters were blindly copied from examples and other threads):

LogisticRegression()
QDA()
GridSearchCV(SVR(kernel='rbf', gamma=0.1), cv=5,
                   param_grid={"C": [1e0, 1e1, 1e2, 1e3],
                               "gamma": np.logspace(-2, 2, 5)})
GridSearchCV(KernelRidge(kernel='rbf', gamma=0.1), cv=5,
                  param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3],
                              "gamma": np.logspace(-2, 2, 5)})
GradientBoostingRegressor(loss='quantile', alpha=0.95,
                                n_estimators=250, max_depth=3,
                                learning_rate=.1, min_samples_leaf=9,
                                min_samples_split=9)
DecisionTreeRegressor(max_depth=4)
AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
                          n_estimators=300, random_state=rng)
RandomForestRegressor(n_estimators=10, min_samples_split=2, n_jobs=-1)

The results fall into two different categories of failure:

  1. The time field is having no effect, probably due to the absence of correlation from the oscillatory behaviour of the target variable. However, secondary effects from other variables allow a modest predictive capability for future time ranges (these other variables have a simple correlation with the target variable)
  2. The when applying predict() to the training time range the prediction is near perfect with respect to the observations, but when given the future time range (for which data was not used in training) the predicted value stays constant.

Below is how I performed the training and testing:

weather_df.index = pd.to_datetime(weather_df.index,unit='D')
weather_df['Days'] = (weather_df.index-datetime.datetime(2005,1,1)).days
ts = pd.DataFrame({'Temperature':weather_df['Mean TemperatureC'].ix[:'2015-1-1'],
                   'Humidity':weather_df[' Mean Humidity'].ix[:'2015-1-1'],
                   'Visibility':weather_df[' Mean VisibilityKm'].ix[:'2015-1-1'],
                   'Wind':weather_df[' Mean Wind SpeedKm/h'].ix[:'2015-1-1'],
                   'Time':weather_df['Days'].ix[:'2015-1-1'] 
                   })
start_test = datetime.datetime(2012,1,1)
ts_train = ts[ts.index < start_test]
ts_test = ts
data_train = np.array(ts_train.Humidity, ts_test.Time)[np.newaxis]
data_target = np.array(ts_train.Temperature)[np.newaxis].ravel()
model.fit(data_train.T, data_target.T)
data_test = np.array(ts_test.Humidity, ts_test.Time)[np.newaxis]
pred = model.predict(data_test.T)
ts_test['Pred'] = pred

Is there a regression model I could/should use for this problem, and if so what would be appropriate options and parameters?

2

There are 2 answers

1
zemekeneng On BEST ANSWER

Here is my guess about what is happening in your two types of results:

.days does not convert your index into a form that repeats itself between your train and test samples. So it becomes a unique value for every date in your dataset.

As a consequence your models either ignore days (1st result), or your model overfits on the days feature (2nd result) causing the model to perform badly on your test data.

Suggestion:

If your dataset is large enough (it looks like it goes from 2005), try using dayofyear or weekofyear instead, so that your model will have something generalizable from the date information.

1
pyan On

Agree with @zemekeneng that time should be module by the corresponding periods like 24hours, 12 months etc.

Beyond that, I'd like to remind using prior knowledge when selecting features or models. Since you already knew that your data is highly likely to follow sin(x), it should be used even in data driven approach.

We know that sin(x) can be approximated by x - x^3/3! + x^5/5! - x^7/7! then these should be used as features. None of the models that you used may have included these features. One way to do it would be to create these high order features by yourself and concatenate to your other features. Then a linear model with regulation may give you reasonable results.