Also posted as an issue in GitHub:
When a pmdarima model is fit with exogenous variables, the values of X passed to predict_in_sample do not appear to affect the predictions. Even X arrays with the incorrect number of rows or columns are allowed. What is going on here? Am I missing something? See below:
import pmdarima as pm
from pmdarima import model_selection
import numpy as np
import pandas as pd
np.random.seed(42)
y = pm.datasets.load_wineind()
df = pd.DataFrame(
{
"x1": y * np.random.uniform(0, 0.5, len(y)) + np.random.randint(1, 1000, len(y)),
"x2": y * np.random.uniform(0.5, 0.7, len(y)) + np.random.randint(1, 10000, len(y)),
}
)
df["y"] = y
train, test = model_selection.train_test_split(df, train_size=150)
arima = pm.auto_arima(
train["y"],
train.drop(columns="y"),
error_action="ignore",
trace=True,
suppress_warnings=True,
maxiter=5,
seasonal=True,
m=12,
)
# preds1 takes the expected X args
preds1 = arima.predict_in_sample(X=train.drop(columns="y"))
# preds2 takes xargs with the correct dims, but different values from those used for preds1
preds2 = arima.predict_in_sample(X=train.drop(columns="y") + 1000)
# preds3 takes only x2, not x1, and x2 is subset to only 10 observations
preds3 = arima.predict_in_sample(X=train[:10].drop(columns=["y", "x1"]))
len(preds1) # 150
len(preds2) # 150
len(preds3) # 150
all(preds1 == preds2) # True
all(preds2 == preds3) # True
arima.summary() # To confirm that indeed x1 and x2 are in the model
I expect the values of X passed to predict_in_sample to affect the predictions, and for arrays of the incorrect size to produce an error.
Note: it looks like predict_in_sample is using statsmodels.tsa.statespace.sarimax.SARIMAXResultsWrapper.predict() under the hood.