I ran multiple imputations to deal with my missing data. Then I used the with() and pool() functions to run a linear regression for my dataset and get a pooled estimate. I am trying to predict a score across two groups (intervention and control).
Because I have so many variables and scores, I ran the imputations in blocks. Each group of questions related to one scale is imputed together, and so on.
Now I want to get standardized coefficients.
I tried to standardize my dataset before the imputation, but the standardized estimate is very close to unstandardized estimate (1.53 vs. -1.82) Does that make sense?
When I standardize the final scale directly instead of standardizing each question and then summing them at the regression step, I get a very small standardized coefficient (-0.24).
My two questions are
- Which method is the most accurate? standardizing each question or standardizing the final scale
- How to obtain standardized betas after imputation?
Here is my code to explain the things above.
`####read data
data <- read.csv("post_for_imputation.csv")
#####selected columns to impute
columns_to_check4 <- c(
"post_BSocialMAddictionS_Q1", "post_BSocialMAddictionS_Q2",
"post_BSocialMAddictionS_Q3",
"post_BSocialMAddictionS_Q4", "post_BSocialMAddictionS_Q5",
"post_BSocialMAddictionS_Q6")
####convert them to dataframe
selected_columns <- data %>%
select(all_of(columns_to_check4))
#####use the scale() function to standardize the data
j <- scale(selected_columns)
j_df <- as.data.frame(j)
####add my independent variable (it is categorical and it doesn't work with the scale
function this is why I am adding it after scaling the data - it has no missings.
column_to_add <- data$group_post
# Adding the column to dataset1
j_df <- cbind(j_df, group_post = column_to_add)
####run my imputation
imputed_data <- mice(j_df,m = 5, maxit = 10, seed = 500)
####pool my data
X <- with(imputed_data, lm(
I(as.numeric(post_BSocialMAddictionS_Q1) +
as.numeric(post_BSocialMAddictionS_Q2) +
as.numeric(post_BSocialMAddictionS_Q3) +
as.numeric(post_BSocialMAddictionS_Q4) +
as.numeric(post_BSocialMAddictionS_Q5) +
as.numeric(post_BSocialMAddictionS_Q6))
~ group_post))
summary(pool(X))`
This method gives me the standardized coefficient that is very close to the unstandardized. Is there any better way to do this? Is this even accurate? And which one should I consider? the standardized calculation when using the sum directly or when summing the data at regression (like in the code above)