I´ve spent days searching for the optimal models which would fulfill all of the standard OLS assumptions (normal distribution, homoscedasticity, no multicollinearity) in R but with 12 variables, it´s impossible to find the optimal var combination. So I was trying to create a script which would automatize this process.

Here the sample code for calculations:

x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)

df <- as.data.frame(cbind(x1,x2,x3,x4,x5))

library(lmtest)
library(car)

model <- lm(x1~x2+x3+x4+x5, data = df)

# check for normal distribution (Shapiro-Wilk-Test)
rs_sd <- rstandard(model)
shapiro.test(rs_sd)

# check for heteroskedasticity (Breusch-Pagan-Test)
bptest(model)

# check for multicollinearity
vif(model)

#-------------------------------------------------------------------------------
# models without outliers
# identify outliers (calculating the Cooks distance, if x > 4/(n-k-1) --> outlier
cooks <- round(cooks.distance(model), digits = 4)
df_no_out <- cbind(df, cooks)
df_no_out <- subset(df_no_out, cooks < 4/(100-4-1))

model_no_out <- lm(x1~x2+x3+x4+x5, data = df_no_out)

# check for normal distribution
rs_sd_no_out<- rstandard(model_no_out)
shapiro.test(rs_sd_no_out)

# check for heteroskedasticity
bptest(model_no_out)

# check for multicollinearity
vif(model_no_out)

What I have in mind is to loop through all of the var combinations and get the P-VALUES for the shapiro.test() and the bptest() or the VIF-values for all models created so I can compare the significance values or the multicollinearity resp. (in my dataset, the multicollinearity shouldn´t be a problem and since to check for multicollinearity the VIF test produces more values (for each var 1xVIF factor) which will be probably more challenging for implementing in the code), the p-values for shapiro.test + bptest() would suffice…).

I´ve tried to write several scripts which would automatize the process but without succeed (unfortunately I´m not a programmer). I know there´re already some threads dealing with this problem

How to run lm models using all possible combinations of several variables and a factor

Finding the best combination of variables for high R-squared values

but I haven´t find a script which would also calculate JUST the P-VALUES.

Especially the tests for models without outliers are important because after removing the outliers the OLS assumptions are fullfilled in many cases.

I would really very appreciate any suggestions or help with this.

2

There are 2 answers

2
Rui Barradas On

The following automates the models fitting and the tests you made afterwards.

There is one function that fits all possible models. Then a series of calls to the *apply functions will get the values you want.

library(lmtest)
library(car)


fitAllModels <- function(data, resp, regr){
  f <- function(M){
    apply(M, 2, function(x){
      fmla <- paste(resp, paste(x, collapse = "+"), sep = "~")
      fmla <- as.formula(fmla)
      lm(fmla, data = data)
    })
  }
  regr <- names(data)[names(data) %in% regr]
  regr_list <- lapply(seq_along(regr), function(n) combn(regr, n))
  models_list <- lapply(regr_list, f)
  unlist(models_list, recursive = FALSE)
}

Now the data.

# Make up a data.frame to test the function above.
# Don't forget to set the RNG seed to make the
# results reproducible
set.seed(7646)
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)

df <- data.frame(x1, x2, x3, x4, x5)

First fit all models with "x1" as response and the other variables as possible regressors. The function can be called with one response and any number of possible regressors you want.

fit_list <- fitAllModels(df, "x1", names(df)[-1])

And now the sequence of tests.

# Normality test, standardized residuals
rs_sd_list <- lapply(fit_list, rstandard)
sw_list <- lapply(rs_sd_list, shapiro.test)
sw_pvalues <- sapply(sw_list, '[[', 'p.value')

# check for heteroskedasticity (Breusch-Pagan-Test)
bp_list <- lapply(fit_list, bptest)
bp_pvalues <- sapply(bp_list, '[[', 'p.value')

# check for multicollinearity, 
# only models with 2 or more regressors
vif_values <- lapply(fit_list, function(fit){
  regr <- attr(terms(fit), "term.labels")
  if(length(regr) < 2) NA else vif(fit)
})

A note on the Cook's distance. In your code, you are subsetting the original data.frame, producing a new one without outliers. This will duplicate data. I have opted for a list of indices of the df's rows only. If you prefer the duplicated data.frames, uncomment the line in the anonymous function below and comment out the last one.

# models without outliers
# identify outliers (calculating the 
# Cooks distance, if x > 4/(n - k - 1) --> outlier

df_no_out_list <- lapply(fit_list, function(fit){
  cooks <- cooks.distance(fit)
  regr <- attr(terms(fit), "term.labels")
  k <- length(regr)
  inx <- cooks < 4/(nrow(df) - k - 1)
  #df[inx, ]
  which(inx)
})

# This tells how many rows have the df's without outliers
sapply(df_no_out_list, NROW)

# A data.frame without outliers. This one is the one 
# for model number 8. 
# The two code lines could become a one-liner.
i <- df_no_out_list[[8]]
df[i, ]
1
Stephen On

you are scratching the surface of what is now referred to as Statistical learning. the intro text is "Statistical Learning with applications in R" and the grad level text is "The Elements of Statistical learning". to do what you need you use regsubsets() function from the "leaps" package. However if you read at least chapter 6 from the intro book you will discover about cross-validation and bootstrapping which are the modern way of doing model selection.