Create Simulated Data Multiple Regression in R

3.7k views Asked by At

I have a data set that I am exploring using multiple regression in R. My model is as follows:

model<-lm(Trait~Noise+PC1+PC2)

where Noise, PC1, and PC2 are continuous covariates that predict a particular Trait that is also continuous.

The summary(model) call shows that both Noise and PC1 significantly affect changes in Trait, just in opposite ways. Trait increases as 'Noise' increases, but decreases as PC1 increases.

To tease apart this relationship, I want to create simulated data sets based on the sample size (45) of my original data set and by manipulating Noise and PC1 within the parameters seen in my data set, so: high levels of both, low levels of both, high of one and low of the other, etc...

Can someone offer up some advice on how to do this? I am not overly familiar with R, so I apologize if this question is overly simple.

Thank you for your time.

2

There are 2 answers

0
keegan On

It's a bit unclear what you're looking for (this should probably be on Cross Validated), but here's a start and an approximate description of linear regression.

Let's say I have some datapoints that are 3 dimensional (Noise, PC1, PC2), and you say there's 45 of them.

x=data.frame(matrix(rnorm(3*45),ncol=3))
names(x)<-c('Noise','PC1','PC2')

These data are randomly distributed around this 3 dimensional space. Now we imagine there's another variable that we're particularly interested in called Trait. We think that the variations in each of Noise, PC1, and PC2 can explain some of the variation observed in Trait. In particular, we think that each of those variables is linearly proportional to Trait, so it's just the basic old y=mx+b linear relationship you've seen before, but there's a different slope m for each of the variables. So in total we imagine Trait = m1*Noise + m2*PC1 + m3*PC2 +b plus some added noise (it's a shame one of your variables is named Noise, that's confusing).

So going back to simulating some data, we'll just pick some values for these slopes and put them in a vector called beta.

beta<-c(-3,3,.1) # these are the regression coefficients

So the model Trait = m1 Noise + m2 PC1 + m3 PC2 +b might also be expressed with simple matrix multiplication, and we can do it in R with,

trait<- as.matrix(x)%*%beta + rnorm(nrow(x),0,1)

where we've added Gaussian noise of standard deviation equal to 1.

So this is the 'simulated data' underlying a linear regression model. Just as a sanity check, let's try

l<-lm(trait~Noise+PC1+PC2,data=x)
summary(l)
 Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.13876    0.11159   1.243    0.221    
Noise       -3.08264    0.12441 -24.779   <2e-16 ***
PC1          2.94918    0.11746  25.108   <2e-16 ***
PC2         -0.01098    0.10005  -0.110    0.913  

So notice that the slope we picked for PC2 was so small (0.1) relative to the overall variability in the data, that it isn't detected as a statistically significant predictor. And the other two variables have opposite effects on Trait. So in simulating data, you might adjust the observed ranges of the variables, as well at the magnitudes of the regression coefficients beta.

0
Shaikh Tanvir Hossain On

Here is a simple simulation and then fitting. I am not sure whether this answers your question. But it's a very simple way to simulate

# create a random matrix X
N <- 500 # obs = 500
p <- 20 # 20 predictors
X <- matrix(rnorm(N*p), ncol = p) # design matrix
X.scaled <- scale(X) # scale the columns to make mean 0 and variance 1
X <- cbind(matrix(1, nrow = N), X.scaled)  # add intercept

# create coeff matrix
b <- matrix(0, nrow = p+1) 
b[1, ] <- 5      # intercept      
b[2:6, ] <- 3    # first 5 predictors are 3
b[7:11, ] <- -3  #  next 5 predictors are -3

# create noise
eps <- matrix(rnorm(N), nrow = N)

# generate the response           
y = X%*%b + eps        # response vector

#--------------------------------------------

# fit the model
X <- X[, -1] # remove the column one's before fitting
colnames(X) <- paste ("x", seq(1:p), sep="") # name the columns
colnames(y) <- "y" # name the response


data <- data.frame(cbind(y, X)) # make a dataframe

lm_res <- lm(y~., data) # fit with lm()

# the output
> lm_res$coeff 
# (Intercept)           x1           x2           x3           x4           x5 
# 4.982574286  2.917753373  3.021987926  3.067855616  3.135165773  2.997906784 
#          x6           x7           x8           x9          x10          x11 
#-2.997272333 -2.927680633 -2.944796765 -3.070785884 -2.910920487 -0.051975284 
#         x12          x13          x14          x15          x16          x17 
# 0.085147066 -0.040739293  0.054283243  0.009348675 -0.021794971  0.005577802 
#         x18          x19          x20 
# 0.079043493 -0.024066912 -0.007653293 
#