Forecasting with Prophet with "id" column on test test

98 views Asked by At

I'm building a forecasting model with Prophet in python. My train dataset consist on column "Date", "Var1", "Var2", "Y". And the test set consist of column "id", "Date", "Var1", "Var2". The "id" column is unique (based on combination of "Date", "Var1", and "Var2". Below is my code:

dir_df_train_clean = "data/df_train_clean.csv"
df_train_clean = pd.read_csv(dir_df_train_clean, parse_dates=[0])

dir_df_test_clean = "data/df_test_clean.csv"
df_test_clean = pd.read_csv(dir_df_test_clean, parse_dates=[1])

split_date = '2019-01-01'
df_train = df_train_clean[df_train_clean.index.get_level_values('Date') < split_date]
df_val = df_train_clean[df_train_clean.index.get_level_values('Date') >= split_date]

df_train = df_train.rename(columns={'Date':'ds','Y':'y'})
df_val = df_val.rename(columns={'Date':'ds','Y':'y'})
df_test = df_test_clean.rename(columns={'Date':'ds'})

My model:

m = Prophet()
m.add_regressor('Var1')
m.add_regressor('Var2')
m.fit(df_train)

And try to predict my test set:

test_forecast = m.predict(df_test)

But the result is my "id" column dissapear from the "test_forecast" dataframe. How can i keep my id column?

I tried to merge the 'id' column back, but the index was altered. The first row (id='a1') shows date value column is '2022-07-30', but the prediction result shows that the first row on 'date' column is '2022-07-30'.

2

There are 2 answers

1
Guapi-zh On

I think Prophet generates forecasts for dates and times but does not carry over any additional columns or information from the input dataframe. To keep your "id" column, you have to merge the column back after the prediction.

0
Anna Andreeva Rogotulka On

I want to emphasize that using IDs as model features can lead to overfitting, especially if ID generation is connected to time. You should be think twice about why you need this; maybe consider using indices so that you can later merge the data(without ID) with the rest.

Negative aspects of using IDs:

Overfitting: if IDs are connected to time or other factors that don't provide value to the model, using them as features can lead to overfitting. The model might memorize the identifiers rather than generalize information.

Complexity: using IDs can make the model less interpretable since the identifiers themselves often don't convey much about the nature of the data.