Regression with autocorrelated residuals - Add lagged dep. variable or model residuals with AR(1) process?

90 views Asked by At

I am trying to solve a problem where I want to know to what level the dependent variable will converge to if I assume that the independent variable will remain at its current level.

My problem is that the residuals are strongly autocorrelated so I cannot use the "simple" y(t) = a+b*x(t)+eps(t) approach, but I am unsure of how to solve this in the best possible way...

I am no expert within this field, but could think of two possible ways to attack the problem: (1) Add a lagged dependent variable, which would make the equation look like y(t)=a+bx(t)+cy(t-1)+eps(t) (2) Allow the residuals to follow an AR(1) process, which would turn my equation into: y(t)=a+bx(t)+eps(t) where eps(t) = ceps(t-1)+w(t)

The problem for me is that the two approaches above produces significantly different results, something that makes me think that I have done something wrong somewhere...

The input data: The input data

Which is positively correlated: Scatterplot

But with strong correlation between the residuals: ACF

In the supplied code I am trying to find the convergence level for the dependent variable if I assume that the independent variable will remain at its current level for obs_no >= 60.

The time series of the dependent variable as well as the convergence levels for obs_nbr >= 60 when using:

  • A simple linear regression
  • A linear regression with lagged dependent variable
  • A linear regression where the residuals is modelled as an AR(1) process Can be found below:

Actual and estimated convergence level for dep. variable

The R code that I have used can be seen below. I would very much appreciate any guidance of what approach to choose. I assume that I have done something stupid as the results are so different depending on which of the two approaches that I use....

library(forecast)
library(ggplot2)

# Creating the dataframe with the data that I am using... It seems as it is not possible to upload files so I have to do it this way.
obs_nbr <-  1:226
indep_variable <- c(0.0231,0.0226,0.0228,0.0206,0.0208,0.0197,0.0194,0.022,0.0209,0.0215,0.0228,0.0224,0.0201,0.0204,0.0192,0.0231,0.0251,0.0253,0.0256,0.0244,0.0222,0.0224,0.0221,0.0211,0.0237,0.0238,0.0229,0.0244,0.0248,0.0251,0.026,0.0242,0.0248,0.0233,0.0236,0.0228,0.0234,0.022,0.0245,0.0264,0.0267,0.0279,0.0275,0.0279,0.0278,0.0302,0.0358,0.0321,0.0246,0.0332,0.0308,0.0268,0.0329,0.03,0.0256,0.0293,0.0279,0.0265,0.0283,0.0258,0.0278,0.0283,0.0304,0.0289,0.0256,0.0243,0.0245,0.0267,0.0219,0.024,0.0203,0.0237,0.026,0.0298,0.0284,0.0271,0.026,0.026,0.0259,0.021,0.0186,0.0172,0.0144,0.0138,0.0136,0.0122,0.0136,0.0157,0.0132,0.0093,0.0106,0.0087,0.0103,0.0106,0.0093,0.0095,0.0102,0.012,0.0128,0.0131,0.0097,0.0154,0.0185,0.0196,0.0199,0.0193,0.0193,0.0218,0.0227,0.0191,0.0186,0.0183,0.0165,0.0148,0.0152,0.0145,0.0127,0.0165,0.0147,0.0137,0.0121,0.0104,0.011,0.0111,0.0119,0.0125,0.0139,0.0125,0.0147,0.0151,0.0137,0.014,0.0158,0.0153,0.0145,0.0113,0.0108,0.0114,0.0091,0.0095,0.0084,0.009,0.0101,0.0124,0.0131,0.0118,0.0094,0.0095,0.0089,0.0088,0.0104,0.0118,0.0102,0.0113,0.0111,0.0111,0.0093,0.0094,0.0105,0.0095,0.0092,0.0097,0.009,0.0097,0.0101,0.0115,0.0147,0.0143,0.0138,0.013,0.0134,0.0113,0.0118,0.0093,0.0104,0.0103,0.0058,0.0073,0.0085,0.0074,0.009,0.0054,0.004,0.0032,0.0022,0.0033,0.0028,0.0002,0.0014,0.0008,0.0016,-0.0007,0.0003,0.0011,0.0046,0.0056,0.0055,0.0046,0.0028,0.0028,0.0019,0.0017,0.0014,-0.001,0.0007,0.0034,0.0047,0.0061,0.0095,0.0144,0.0184,0.0166,0.0161,0.0232,0.0213,0.019,0.0193,0.0183,0.0182,0.0195,0.0195,0.0195,0.0182,0.0209,0.022,0.0256)
dep_variable <- c(0.0639,0.0655,0.0643,0.0656,0.0664,0.0648,0.066,0.0644,0.0665,0.0674,0.0697,0.0675,0.0681,0.0671,0.0676,0.0675,0.0673,0.0704,0.0711,0.0715,0.0705,0.0694,0.0679,0.0674,0.0666,0.0658,0.0678,0.0674,0.0656,0.0643,0.0663,0.0697,0.0694,0.0673,0.0665,0.0689,0.069,0.0727,0.0751,0.0753,0.0709,0.0709,0.0781,0.0778,0.0762,0.083,0.0904,0.0917,0.0842,0.0821,0.0884,0.0797,0.072,0.0695,0.0727,0.0694,0.0683,0.0675,0.0715,0.0695,0.0698,0.0747,0.0735,0.0706,0.0727,0.0806,0.0861,0.0821,0.087,0.0805,0.0787,0.0797,0.076,0.0761,0.0746,0.076,0.0758,0.0781,0.08,0.0828,0.0883,0.0948,0.0844,0.0852,0.0847,0.0813,0.0785,0.0769,0.0789,0.0846,0.0815,0.0798,0.079,0.0772,0.0785,0.0786,0.0785,0.075,0.075,0.0727,0.0716,0.0708,0.0724,0.0694,0.0722,0.0705,0.0677,0.0662,0.0647,0.0673,0.0647,0.0647,0.0651,0.0644,0.0637,0.0658,0.0638,0.0652,0.063,0.0616,0.0608,0.0611,0.0578,0.0593,0.059,0.059,0.0605,0.0597,0.0637,0.0654,0.0601,0.0602,0.0613,0.0633,0.0633,0.0595,0.0595,0.0593,0.0597,0.0581,0.0587,0.059,0.0607,0.059,0.0585,0.0579,0.056,0.0565,0.0567,0.0566,0.0569,0.0562,0.0567,0.056,0.0555,0.0545,0.0549,0.0558,0.0589,0.061,0.0619,0.0614,0.0617,0.0605,0.0593,0.0595,0.0644,0.0632,0.0691,0.0633,0.0615,0.0606,0.0589,0.0636,0.0595,0.0589,0.0601,0.0592,0.0578,0.056,0.0546,0.0552,0.0599,0.0613,0.0485,0.0462,0.0462,0.0456,0.0437,0.0464,0.049,0.0455,0.0441,0.0466,0.0466,0.0456,0.0461,0.0472,0.047,0.0477,0.0472,0.0497,0.047,0.0482,0.0466,0.05,0.0522,0.0513,0.0569,0.0576,0.0631,0.0572,0.0599,0.0657,0.0598,0.0565,0.0598,0.0555,0.0569,0.0552,0.0546,0.0551,0.0518,0.0506,0.0527,0.0558)

regression_data <- as.data.frame(cbind(obs_nbr, indep_variable, dep_variable))
colnames(regression_data) <- c("OBS_NBR", "INDEPENDENT_VARIABLE", "DEPENDENT_VARIABLE")


# Creating some graphs to see what kind of data issues we have...
ggplot(data = regression_data, aes(x=OBS_NBR)) +
  geom_line(aes(y=INDEPENDENT_VARIABLE, color = "INDEPENDENT_VARIABLE")) +
  geom_line(aes(y=DEPENDENT_VARIABLE, color = "DEPENDENT_VARIABLE")) +
  ylab(" ") +
  scale_color_manual(values= c("INDEPENDENT_VARIABLE" = "orange", "DEPENDENT_VARIABLE"="black"))

ggplot(data = regression_data, aes(x=INDEPENDENT_VARIABLE, y = DEPENDENT_VARIABLE)) +
  geom_point()

lin_model <- lm(regression_data$DEPENDENT_VARIABLE ~ regression_data$INDEPENDENT_VARIABLE)
acf(lin_model$residuals)

# Adding the lagged dependent variable to the dataframe
regression_data$DEPENDENT_VARIABLE_LAGGED <- c(NA, regression_data$DEPENDENT_VARIABLE[seq_along(regression_data$DEPENDENT_VARIABLE) -1])

# Adding columns for storing output when using a simple linear model...

empty_vector <-  rep(NA, nrow(regression_data))

regression_data$LM_INTERCEPT <- empty_vector
regression_data$LM_COEFF_INDEPENDENT_VARIABLE <- empty_vector
regression_data$LM_CONVERGENCE_LEVEL <- empty_vector

# Adding columns for storing output when using linear model with lagged ependent variable...
regression_data$LM_W_LAG_INTERCEPT <- empty_vector
regression_data$LM_W_LAG_COEFF_INDEPENDENT_VARIABLE <- empty_vector
regression_data$LM_W_LAG_COEFF_LAGGED_DEPENDENT_VARIABLE <- empty_vector
regression_data$LM_W_LAG_CONVERGENCE_LEVEL <- empty_vector


# Adding columns for storing output when using linear model with residuals following an AR(1) process...
regression_data$LM_AR1_RES_INTERCEPT <- empty_vector
regression_data$LM_AR1_RES_COEFF_INDEPENDENT_VARIABLE <- empty_vector
regression_data$LM_AR1_RES_COEFF_AR1 <- empty_vector
regression_data$LM_AR1_RES_CONVERGENCE_LEVEL <- empty_vector



for (i in 61: nrow(regression_data)){
  
  ###################################################################
  # Simple linear model
  lin_model <- lm(regression_data$DEPENDENT_VARIABLE[1:i] ~ regression_data$INDEPENDENT_VARIABLE[1:i])
  
  regression_data$LM_INTERCEPT[i] <-  lin_model$coefficients[1]
  regression_data$LM_COEFF_INDEPENDENT_VARIABLE[i] <-  lin_model$coefficients[2]
  regression_data$LM_CONVERGENCE_LEVEL[i] <-  lin_model$coefficients[1] + lin_model$coefficients[2] * regression_data$INDEPENDENT_VARIABLE[i]
  
  ###################################################################
  # Linear model with lagged dependent variable
  lin_model_w_lag <-lm(regression_data$DEPENDENT_VARIABLE[2:i] ~ cbind(regression_data$INDEPENDENT_VARIABLE[2:i], regression_data$DEPENDENT_VARIABLE_LAGGED[2:i]))

  regression_data$LM_W_LAG_INTERCEPT[i] <- lin_model_w_lag$coefficients[1]
  regression_data$LM_W_LAG_COEFF_INDEPENDENT_VARIABLE[i] <- lin_model_w_lag$coefficients[2]
  regression_data$LM_W_LAG_COEFF_LAGGED_DEPENDENT_VARIABLE[i] <- lin_model_w_lag$coefficients[3]

  # y(t) = a+b*x(t)+c*y(t-1) => y(t)-c*y(t-1) = a+b*x(t) => 
  # => [It is assumed that x remains at the current level, when convergence is reached (at t=T) then  y(T) = y(T-1)] => 
  # => y(T)-c*y(T) = a+b*x(T) =>
  # => y(T)*(1-c) = a+b*x(T) =>
  # => y(T) = a/(1-c) + b/(1-c)*x(T) 
  regression_data$LM_W_LAG_CONVERGENCE_LEVEL[i] <- 
    lin_model_w_lag$coefficients[1]/(1-lin_model_w_lag$coefficients[3]) +
    lin_model_w_lag$coefficients[2]/(1-lin_model_w_lag$coefficients[3]) * regression_data$INDEPENDENT_VARIABLE[i]
    
  ###################################################################
  # Linear model with residuals following an AR(1) process
  x_values <- regression_data$INDEPENDENT_VARIABLE[1:i]
  y_values <- regression_data$DEPENDENT_VARIABLE[1:i] 
  
  arima_model <- arima(y_values, xreg= x_values, order = c(1,0,0))
  regression_data$LM_AR1_RES_INTERCEPT[i] <-  arima_model$coef["intercept"]
  regression_data$LM_AR1_RES_COEFF_INDEPENDENT_VARIABLE[i] <-  arima_model$coef["x_values"]  
  regression_data$regression_data$LM_AR1_RES_COEFF_AR1[i] <-  arima_model$coef["ar1"]    
  
  regression_data$LM_AR1_RES_CONVERGENCE_LEVEL[i] <- arima_model$coef["intercept"] + arima_model$coef["x_values"] *regression_data$INDEPENDENT_VARIABLE[i]
}

plot_data <- as.data.frame(cbind(
                            regression_data$OBS_NBR,
                            regression_data$DEPENDENT_VARIABLE,
                            regression_data$LM_CONVERGENCE_LEVEL,
                            regression_data$LM_W_LAG_CONVERGENCE_LEVEL,
                            regression_data$LM_AR1_RES_CONVERGENCE_LEVEL
                            ))

colnames(plot_data) <- c("Obs_nbr", "DependentVar", "Conv_Lin_Model", "Conv_Lin_Model_w_Lag", "Conv_Lin_Model_w_AR1Res")                 

ggplot(data = plot_data, aes(x=Obs_nbr)) +
  geom_line(aes(y=DependentVar, color = "DependentVar")) +
  geom_line(aes(y=Conv_Lin_Model, color = "Conv_Lin_Model")) + 
  geom_line(aes(y=Conv_Lin_Model_w_Lag, color = "Conv_Lin_Model_w_Lag")) +
  geom_line(aes(y=Conv_Lin_Model_w_AR1Res, color = "Conv_Lin_Model_w_AR1Res")) +
  ylab(" ") + 
  scale_color_manual(values= c("DependentVar"="black", "Conv_Lin_Model"="red", "Conv_Lin_Model_w_Lag"="green", "Conv_Lin_Model_w_AR1Res"="blue"))
0

There are 0 answers