Forecasting models for the following data

153 views Asked by At

I am using R for analysis. My data is as follows:

  id    timestamp   cumsum
1284381 21/01/2015  33
1284381 21/01/2015  57
1284381 2/3/2015    79
1284381 4/3/2015    203
1284381 25/03/2015  475
1284381 11/4/2015   578
1284381 17/04/2015  856
1284381 21/04/2015  1189
1284381 5/5/2015    1214
1284381 10/5/2015   1321
1284381 12/5/2015   1340
1284381 15/05/2015  1529
1284381 18/05/2015  1649
1284381 19/05/2015  1977
1284381 21/05/2015  2385
1284381 23/05/2015  2528
1284381 26/05/2015  2556
1284381 29/05/2015  2705
1284381 1/6/2015    2898
1284381 4/6/2015    2913
1284381 7/6/2015    2921
1284381 13/06/2015  2922
1284381 13/06/2015  3622
1284381 16/06/2015  3834
1284381 19/06/2015  3913
1284895 27/01/2015  6
1284895 27/01/2015  49
1284895 18/03/2015  57
1284895 20/03/2015  58
1284895 23/03/2015  59
1284895 23/03/2015  60
1284895 24/03/2015  62
1284895 29/03/2015  67
1284895 31/03/2015  75
1284895 1/4/2015    76
1284895 2/4/2015    77
1284895 8/4/2015    78
1284895 16/04/2015  80
1284895 21/04/2015  103
1284895 23/04/2015  275
1284895 26/04/2015  293
1284895 27/04/2015  386
1284895 30/04/2015  539
1284895 3/5/2015    807
1284895 8/5/2015    851
1284895 11/5/2015   988
1284895 14/05/2015  1056
1284895 18/05/2015  1157
1284895 21/05/2015  1226
1284895 23/05/2015  1383
1284895 26/05/2015  1501
1284895 30/05/2015  1518
1284895 2/6/2015    1694
1284895 4/6/2015    1695
1284895 8/6/2015    1858
1284895 11/6/2015   1909
1284895 14/06/2015  1917
1284895 17/06/2015  1957
1284895 20/06/2015  1973

The first column is ID, second is date and third is cumulative sum of the value. I want to build a forecasting model to this data, which can provide me a solution of, for a given id, at a future date(say. 08/08/2015), the cumsum would be ?? I have tried forecasting models with two variables. Since it is three variables and also the data is daily data and not continuous, I am facing difficulties in setting up the model.

2

There are 2 answers

0
Vedda On

This should be fairly straightforward, but I'm sure you'll want to tweak this for more detail. Look at the forecast package for more information. It's a great tool.

Sample Data :

df <- structure(list(id = c(1284381L, 1284381L, 1284381L, 1284381L, 
1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 
1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 
1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 
1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 
1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 
1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 
1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 
1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L), 
    timestamp = structure(c(25L, 25L, 20L, 43L, 32L, 4L, 14L, 
    26L, 45L, 3L, 7L, 11L, 17L, 18L, 27L, 30L, 34L, 38L, 2L, 
    44L, 46L, 8L, 8L, 13L, 19L, 35L, 35L, 16L, 23L, 28L, 28L, 
    31L, 37L, 42L, 1L, 21L, 47L, 12L, 26L, 29L, 33L, 36L, 40L, 
    39L, 48L, 5L, 9L, 17L, 27L, 30L, 34L, 41L, 22L, 44L, 49L, 
    6L, 10L, 15L, 24L), .Label = c("1/4/2015", "1/6/2015", "10/5/2015", 
    "11/4/2015", "11/5/2015", "11/6/2015", "12/5/2015", "13/06/2015", 
    "14/05/2015", "14/06/2015", "15/05/2015", "16/04/2015", "16/06/2015", 
    "17/04/2015", "17/06/2015", "18/03/2015", "18/05/2015", "19/05/2015", 
    "19/06/2015", "2/3/2015", "2/4/2015", "2/6/2015", "20/03/2015", 
    "20/06/2015", "21/01/2015", "21/04/2015", "21/05/2015", "23/03/2015", 
    "23/04/2015", "23/05/2015", "24/03/2015", "25/03/2015", "26/04/2015", 
    "26/05/2015", "27/01/2015", "27/04/2015", "29/03/2015", "29/05/2015", 
    "3/5/2015", "30/04/2015", "30/05/2015", "31/03/2015", "4/3/2015", 
    "4/6/2015", "5/5/2015", "7/6/2015", "8/4/2015", "8/5/2015", 
    "8/6/2015"), class = "factor"), cumsum = c(33L, 57L, 79L, 
    203L, 475L, 578L, 856L, 1189L, 1214L, 1321L, 1340L, 1529L, 
    1649L, 1977L, 2385L, 2528L, 2556L, 2705L, 2898L, 2913L, 2921L, 
    2922L, 3622L, 3834L, 3913L, 6L, 49L, 57L, 58L, 59L, 60L, 
    62L, 67L, 75L, 76L, 77L, 78L, 80L, 103L, 275L, 293L, 386L, 
    539L, 807L, 851L, 988L, 1056L, 1157L, 1226L, 1383L, 1501L, 
    1518L, 1694L, 1695L, 1858L, 1909L, 1917L, 1957L, 1973L)), .Names = c("id", 
"timestamp", "cumsum"), class = "data.frame", row.names = c(NA, 
-59L))

Code :

# Load libraries
library(forecast)
library(dplyr)

# Function for forecasting
fc <- function(id, forecast_out){
  fcast <- filter(df, id == id)
  forecast(fcast$cumsum, h = forecast_out)
  plot(forecast(fcast$cumsum, h = forecast_out))
}

# Forecast id 20 periods out
fc(id = 1284381, 20)

Output :

   Point Forecast        Lo 80     Hi 80       Lo 95     Hi 95
60       1972.852   919.796925  3025.907    362.3432  3583.361
61       1972.852   427.927523  3517.777   -389.9061  4335.610
62       1972.852     2.484834  3943.219  -1040.5645  4986.269
63       1972.852  -400.473573  4346.178  -1656.8363  5602.540
64       1972.852  -798.641352  4744.345  -2265.7813  6211.485
65       1972.852 -1201.836164  5147.540  -2882.4146  6828.119
66       1972.852 -1616.650850  5562.355  -3516.8189  7462.523
67       1972.852 -2048.143856  5993.848  -4176.7305  8122.435
68       1972.852 -2500.592208  6446.296  -4868.6905  8814.395
69       1972.852 -2977.876792  6923.581  -5598.6343  9544.338
70       1972.852 -3483.701691  7429.406  -6372.2267 10317.931
71       1972.852 -4021.730885  7967.435  -7195.0714 11140.775
72       1972.852 -4595.680763  8541.385  -8072.8520 12018.556
73       1972.852 -5209.387818  9155.092  -9011.4359 12957.140
74       1972.852 -5866.861969  9812.566 -10016.9559 13962.660
75       1972.852 -6572.331508 10518.036 -11095.8784 15041.582
76       1972.852 -7330.283276 11275.987 -12255.0656 16200.770
77       1972.852 -8145.500375 12091.204 -13501.8326 17447.537
78       1972.852 -9023.098946 12968.803 -14844.0038 18789.708
79       1972.852 -9968.565065 13914.269 -16289.9694 20235.673

enter image description here

0
Data Junki On

I agree with Amstell's recommendation to check out Rob Hyndman's forecast package in R. It attempts to select the model that will have the least MSE for a 1 step out of sample forecast using AICc as the selection criteria (if I remember correctly). Recently, he added support for a time-series cross-validation procedure that can also be used to select which model might perform best when forecasting out of sample. You can read more about that here: Time-Series Cross-Validation

I've essentially performed the latter of the two in python in order to select from a subset of ARIMAX models. Below is the resulting 1 step ahead forecast in the validation data set for the model I chose as optimal.

Validation data set - 1 step ahead forecast performance

The resulting model is: [6,2,2] or AR = 6, Diff=2, MA = 2

X variable is the date, converted into a day count. I applied a log normal transformation to the Y variable <-- completely arbitrary choice on my part.

1 step error in validation data set: MAE = 88.9, MAPE = 2.8%, rsqr = 97.3%

If your objective is to estimate more than 1 day ahead, then I'd suggest fitting the model such that your cross validation error is minimized for the given number of days ahead you require. Again, in this case, I've assumed 1 day ahead.