I´ve spent days searching for the optimal models which would fulfill all of the standard OLS assumptions (normal distribution, homoscedasticity, no multicollinearity) in R but with 12 variables, it´s impossible to find the optimal var combination. So I was trying to create a script which would automatize this process.
Here the sample code for calculations:
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)
df <- as.data.frame(cbind(x1,x2,x3,x4,x5))
library(lmtest)
library(car)
model <- lm(x1~x2+x3+x4+x5, data = df)
# check for normal distribution (Shapiro-Wilk-Test)
rs_sd <- rstandard(model)
shapiro.test(rs_sd)
# check for heteroskedasticity (Breusch-Pagan-Test)
bptest(model)
# check for multicollinearity
vif(model)
#-------------------------------------------------------------------------------
# models without outliers
# identify outliers (calculating the Cooks distance, if x > 4/(n-k-1) --> outlier
cooks <- round(cooks.distance(model), digits = 4)
df_no_out <- cbind(df, cooks)
df_no_out <- subset(df_no_out, cooks < 4/(100-4-1))
model_no_out <- lm(x1~x2+x3+x4+x5, data = df_no_out)
# check for normal distribution
rs_sd_no_out<- rstandard(model_no_out)
shapiro.test(rs_sd_no_out)
# check for heteroskedasticity
bptest(model_no_out)
# check for multicollinearity
vif(model_no_out)
What I have in mind is to loop through all of the var combinations and get the P-VALUES for the shapiro.test() and the bptest() or the VIF-values for all models created so I can compare the significance values or the multicollinearity resp. (in my dataset, the multicollinearity shouldn´t be a problem and since to check for multicollinearity the VIF test produces more values (for each var 1xVIF factor) which will be probably more challenging for implementing in the code), the p-values for shapiro.test + bptest() would suffice…).
I´ve tried to write several scripts which would automatize the process but without succeed (unfortunately I´m not a programmer). I know there´re already some threads dealing with this problem
How to run lm models using all possible combinations of several variables and a factor
Finding the best combination of variables for high R-squared values
but I haven´t find a script which would also calculate JUST the P-VALUES.
Especially the tests for models without outliers are important because after removing the outliers the OLS assumptions are fullfilled in many cases.
I would really very appreciate any suggestions or help with this.
The following automates the models fitting and the tests you made afterwards.
There is one function that fits all possible models. Then a series of calls to the
*apply
functions will get the values you want.Now the data.
First fit all models with
"x1"
as response and the other variables as possible regressors. The function can be called with one response and any number of possible regressors you want.And now the sequence of tests.
A note on the Cook's distance. In your code, you are subsetting the original data.frame, producing a new one without outliers. This will duplicate data. I have opted for a list of indices of the df's rows only. If you prefer the duplicated data.frames, uncomment the line in the anonymous function below and comment out the last one.