linear regression using dataset with missing values

Question

linear regression using dataset with missing values

60 views Asked by SOF_helps At 04 July 2023 at 16:10

I have data on the effect sizes for 14 variables (var1-var14). Each value is the effect size of a specific treatment on a certain variable. Missing values are due to that some articles did not consider certain variables. A positive value show promoting while a negative value shows the inhibiting effect of that treatment on the variable. I want (1) to do a pairwise linear regression that runs through each and every variable and compare if there is an association between variables, (2) consider var1 as the dependent variable and var2-var14 all as independent variables to find the best-fit model (maybe using glmulti package?) and show changes in which variables are most important for change in var1.

Here is a sample data:

set.seed(123)

**# Create the dataset with effect sizes and missing values**

mydata <- data.frame(
  Var1 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var2 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var3 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var4 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var5 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var6 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var7 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var8 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var9 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var10 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var11 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var12 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var13 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var14 = sample(c(-20:14, NA), 64, replace = TRUE)
)

**# Set more than 50% missing values in each column**
for (col in 1:14) {
  missing_indices <- sample(1:64, size = 32)
  mydata[missing_indices, col] <- NA
}

Is it possible to do all this with such dataset (i.e., missing values)? Thanks!

Original Q&A

There are 1 answers

**I_O** · Answer 1 · 2023-07-04T19:54:46+00:00

d being your example data:

d <- 
  paste0('Var_', 1:14) |>
  Map(f = \(.) sample(c(-20:14, NA),
                      size = 64,
                      prob = c(rep(.49/35, 35), .51),
                      replace = TRUE
                      )
      ) |>
  as.data.frame()

... you get the pairwise associations in terms of the correlation matrix like so:

d |> cor(use = 'pairwise.complete.obs')

... and a basic column-wise imputation (replacing NA with the mean value) this way:

d_imputed <- d |>
  apply(2, \(var) replace(var, is.na(var), mean(var, na.rm = TRUE)))

Finally you can obtain the regression coefficients of the predictors (columns) for each column like so:

d_imputed |> 
  apply(2, FUN = \(var) coef(lm(var ~ ., as.data.frame(d_imputed))))

A word of caution: above is just a technical answer to your literal question. For a statistically sound solution, I'd recommend researching over at Cross Validated about imputation, dimensionality reduction, predictor selection and such (see Ben Bolker's comment).

TechQA.

linear regression using dataset with missing values

There are 1 answers

Related Questions in R

Related Questions in REGRESSION

Related Questions in LINEAR-REGRESSION

Related Questions in GLMULTI

Popular Questions

Popular Tags

Trending Questions