How can I get all details of the top x models from forecast::auto.arima in R?

1.1k views Asked by At

My goal is to exactly re-create the top say 3 models from the auto.arima function in R. My example uses the following series:

> data <- c(79, 73, 102, 158, 235, 326, 216, 153, 82, 87, 94, 163, 119, 81, 143, 
          179, 182, 247, 248, 105, 143, 64, 45, 163, 122, 95, 47, 117, 345, 454,
          596, 440, 207, 94, 100, 187, 106, 98, 66, 81, 178, 196, 279, 128, 80,
          178, 112, 175, 91, 52, 35, 14, 20, 332, 386, 121, 167, 53, 94, 224,
          76, 68, 42, 61, 125, 273, 431, 1901, 86, 65, 120, 227, 131, 84, 48, 153, 157)

> time_series <- ts(data, freq=12, start=c(2013, 10))

I use auto.arima function shown below to capture the model information for the best model and see in text the top models.

> library(forecast)
> auto_mod <- auto.arima(time_series, trace=TRUE)

 ARIMA(2,0,2)(1,0,1)[12] with non-zero mean : 1067.045
 ARIMA(0,0,0)            with non-zero mean : 1057.566
 ARIMA(1,0,0)(1,0,0)[12] with non-zero mean : 1057.504
 ARIMA(0,0,1)(0,0,1)[12] with non-zero mean : 1057.727
 ARIMA(0,0,0)            with zero mean     : 1092.138
 ARIMA(1,0,0)            with non-zero mean : 1055.976
 ARIMA(1,0,0)(0,0,1)[12] with non-zero mean : 1057.629
 ARIMA(1,0,0)(1,0,1)[12] with non-zero mean : 1058.261
 ARIMA(2,0,0)            with non-zero mean : 1058.174
 ARIMA(1,0,1)            with non-zero mean : 1058.187
 ARIMA(0,0,1)            with non-zero mean : 1056.126
 ARIMA(2,0,1)            with non-zero mean : Inf
 ARIMA(1,0,0)            with zero mean     : 1070.847

 Best model: ARIMA(1,0,0)            with non-zero mean 

> summary(auto_mod)
Series: time_series 
ARIMA(1,0,0) with non-zero mean 

Coefficients:
         ar1      mean
      0.2170  176.2688
s.e.  0.1105   32.0040

sigma^2 estimated as 49991:  log likelihood=-524.82
AIC=1055.65   AICc=1055.98   BIC=1062.68

Training set error measures:
                    ME    RMSE      MAE       MPE     MAPE     MASE
Training set 0.3042569 220.664 104.5716 -76.32587 96.76548 1.051865
                    ACF1
Training set 0.006605514

I thought I could use the text output to exactly re-create the models other than the top 1 but I have realized that the description of the models in the text output comes close, but does not seem to exactly, identify each model uniquely. I think it gives information about if there is an intercept or if there is drift, but not both. I've found one such example below where two different models, one with an intercept and one without, have the same label.

> mod1 <- Arima(time_series, order=c(1,0,0), seasonal=c(1,0,0), include.drift=TRUE,  
+                    include.mean=TRUE)
> summary(mod1)
Series: time_series 
ARIMA(1,0,0)(1,0,0)[12] with drift 

Coefficients:
         ar1    sar1  intercept   drift
      0.1722  0.1976   132.4150  1.2188
s.e.  0.1175  0.2059    68.9234  1.5233

sigma^2 estimated as 50174:  log likelihood=-524.15
AIC=1058.31   AICc=1059.15   BIC=1070.03

Training set error measures:
                      ME    RMSE      MAE       MPE     MAPE     MASE
Training set -0.05272209 218.099 99.62455 -76.12583 94.97947 1.002104
                    ACF1
Training set 0.004756765
> mod2 <- Arima(time_series, order=c(1,0,0), seasonal=c(1,0,0), include.drift=TRUE,  
+                    include.mean=FALSE)
> summary(mod2)
Series: time_series 
ARIMA(1,0,0)(1,0,0)[12] with drift 

Coefficients:
         ar1    sar1   drift
      0.2014  0.2862  3.6940
s.e.  0.1205  0.1960  0.9009

sigma^2 estimated as 51238:  log likelihood=-525.77
AIC=1059.53   AICc=1060.09   BIC=1068.91

Training set error measures:
                   ME     RMSE      MAE       MPE     MAPE     MASE
Training set 19.14777 221.9052 107.1779 -57.65942 96.53528 1.078082
                    ACF1
Training set -0.01092874

So in summary my question is: Is there a way to use an auto-fitting process in R to get the top x arima models? I'm open to any solution but my order of preference for a solution would be:

  • A way to pull model objects for top x models out of auto.arima
  • A way to pull the information on top x models in text or understand the trace argument better in order to back into creating the models myself.
  • Some other package or functionality

Thanks in advance for any help that can be offered!

1

There are 1 answers

4
Maurits Evers On BEST ANSWER

The information from the forecast::auto.arima trace uniquely determines the model.

Let's explicitly verify this for the third model (using your sample data):

ARIMA(1,0,0)(1,0,0)[12] with non-zero mean : 1057.504

We use base R's arima to manually fit the same model

fit3 <- arima(
    time_series, 
    order = c(1, 0, 0), 
    seasonal = list(order = c(1, 0, 0), period = 12L),
    include.mean = TRUE)

When using ML to estimate model parameters, arima returns the AIC value; the trace of forecast::auto.arima on the other hand returns the corrected (for small sample size) AIC. Thanks to this post on Cross Validated, we can write a custom function that calculates the AICc based on the model parameters.

get_AICc <- function(fit) {
    npar <- length(fit$coef) + 1
    nstar <- length(fit$residuals) - fit$arma[6] - fit$arma[7] * fit$arma[5]
    fit$aic + 2 * npar * (nstar/(nstar - npar - 1) - 1)
}

We confirm that the corrected AIC is the same as the one reported in the trace of forecast::auto.arima:

get_AICc(fit3)
#[1] 1057.504

forecast::Arima automatically returns both AIC and AICc values, and fit results coincide with those from arima (unsurprisingly, as forecast::Arima internally calls arima) and forecast::auto.arima:


Arima(
    time_series,
    order = c(1, 0, 0), 
    seasonal = c(1, 0, 0),
    include.mean = TRUE)
)
#Series: time_series 
#ARIMA(1,0,0)(1,0,0)[12] with non-zero mean 
#
#Coefficients:
#    ar1    sar1      mean
#0.1849  0.1780  179.5156
#s.e.  0.1168  0.2077   36.2321
#
#sigma^2 estimated as 49966:  log likelihood=-524.47
#AIC=1056.95   AICc=1057.5   BIC=1066.32

In comparison to base R's arima, forecast::Arima allows for more flexibility on the inclusion of constants: while arima allows for a constant offset through include.mean, forecast::Arima also allows for a time-dependent (deterministic) drift parameter. For more details, see Constants and ARIMA models in R.


I'm not aware of an already-available way to obtain a list of top n model fits from forecast::auto.arima. The trace only prints intermediate results from the search in parameter space to the console. In the past, I have defined & written my own parameter grid search of SARIMA model parameters to determine an optimal fit model based on one of the information criteria; however, forecast::auto.arima is more sophisticated and also considers (i.e. excludes) model fits with unit roots close to the unit circle. Your best bet is to either define your own grid search and have it return a list of the top n models; or modify the code of forecast::auto.arima to return such a list; or (probably the least elegant solution) capture and parse the output of auto.arima(... trace = TRUE) to extract model parameters and re-fit using arima or forecast::Arima.


Update 1

I did say that capturing the output of auto.arima was probably the least elegant method, but this is what I did end up doing; perhaps this is helpful to you as a starting point:

Let's define a function that parses an ARIMA model string as generated from auto.arima

parse_SARIMA_string <- function(x) {
    include.mean <- str_detect(x, "with non-zero mean")
    x %>% 
        str_remove_all(":.+$") %>%
        str_replace_all("\\D", " ") %>%
        trimws() %>%
        enframe() %>%
        separate(
            value, 
            c("p", "d", "q", "P", "D", "Q", "period"),
            sep = "\\s+",
            fill = "right") %>%
        mutate(include.mean = include.mean)
}

Then we can use capture.output to capture the trace of forecast::auto.arima and parse the model strings

library(tidyverse)
capture.output(auto.arima(time_series, trace = TRUE)) %>%
    str_subset("^ ARIMA") %>%
    parse_SARIMA_string()
## A tibble: 13 x 9
#    name p     d     q     P     D     Q     period include.mean
#   <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>  <lgl>       
# 1     1 2     0     2     1     0     1     12     TRUE        
# 2     2 0     0     0     NA    NA    NA    NA     TRUE        
# 3     3 1     0     0     1     0     0     12     TRUE        
# 4     4 0     0     1     0     0     1     12     TRUE        
# 5     5 0     0     0     NA    NA    NA    NA     FALSE       
# 6     6 1     0     0     NA    NA    NA    NA     TRUE        
# 7     7 1     0     0     0     0     1     12     TRUE        
# 8     8 1     0     0     1     0     1     12     TRUE        
# 9     9 2     0     0     NA    NA    NA    NA     TRUE        
#10    10 1     0     1     NA    NA    NA    NA     TRUE        
#11    11 0     0     1     NA    NA    NA    NA     TRUE        
#12    12 2     0     1     NA    NA    NA    NA     TRUE        
#13    13 1     0     0     NA    NA    NA    NA     FALSE 

Now that you have a data.frame/tibble with the model parameters it's straightforward to use them in arima or forecast::Arima.


Update 2

Regarding the identifiability of a SARIMA model based on the auto.arima model string, let's take a look at some specific models.

With differencing

  • Model 1 is an ARMA model with a constant

    enter image description here

  • Model 2 is an ARIMA model with drift

    enter image description here

    We can expand the model to get

    enter image description here

  • Model 3 is an ARIMA model with drift and a constant

    enter image description here

    When we expand model 3 we notice that the constant term cancels due to the differencing, and we end up with the same model 2.

So d=1 models with drift and drift+constant describe the same process. In this case, reporting a model as e.g. "ARIMA(1,1,0) with drift" would refer to the same model as saying "ARIMA(1,1,0) with drift and non-zero mean".

No differencing

For d=0, using include.drift = TRUE by default results in a drift and constant offset parameter, see e.g. Constants and ARIMA models in R. If you specify include.drift = TRUE and then manually force include.mean = FALSE you get a different model (as you show in mod1 and mod2). From what I understand, this will never happen in forecast::auto.arima, where include.drift = TRUE implicitly always has include.mean = TRUE. So a SARIMA string ARIMA(1,0,0)(1,0,0)[12] with drift from forecast::auto.arima would translate to a SARIMA(1,0,0)(1,0,0)[12] model with drift and offset.