Centering Variables in R

8.6k views Asked by At

Do centered variables have to stay in matrix form when using them in a regression equation?

I have centered a few variables using the scale function with center=T and scale=F. I then converted those variables to a numeric variable, so that I can manipulate the data frame for other purposes. However, when I run an ANOVA, I get slightly different F values, just for that variable, all else is the same.

Edit:

What's the difference between these two:

scale(df$A, center=TRUE, scale=FALSE)  

Which will embed a matrix within your data.frame

AND

scale(df$A, center=TRUE, scale=FALSE)
df$A = as.numeric(df$A)

Which makes variable A numeric, and removes the matrix notation within the variable?

Example of what I am trying to do, but the example doesn't cause the problem I am having:

library(car)
library(MASS)
mtcars$wt_c <- scale(mtcars$wt, center=TRUE, scale=FALSE)
mtcars$gear <- as.factor(mtcars$gear)
mtcars1     <- as.data.frame(mtcars)
# Part 1
rlm.mpg   <- rlm(mpg~wt_c+gear+wt_c*gear, data=mtcars1)
anova.mpg <- Anova(rlm.mpg, type="III")
# Part 2
# Make wt_c Numeric
mtcars1$wt_c <- as.numeric(mtcars1$wt_c)
rlm.mpg2     <- rlm(mpg~wt_c+gear+wt_c*gear, mtcars1)
anova.mpg2   <- Anova(rlm.mpg2, type="III")
1

There are 1 answers

3
Steve Bronder On

I'll attempt to answer both of your questions

  1. Do centered variables have to stay in matrix form when using them in a regression equation?

I'm not sure what you mean by this, but you can strip the center and scale attributes you get back from scale() if that is what you are referring to. You can see in the example below you get the same answer whether it is in 'matrix form' or not.

  1. What's the difference between these two:

scale(A, center=TRUE, scale=FALSE)  

Which will embed a matrix within your data.frame

AND

 scale(df$A, center=TRUE, scale=FALSE)
 df$A = as.numeric(df$A)

From the help file for scale() we see that it returns,

"For scale.default, the centered, scaled matrix."

You are getting back a matrix with attributes for scaled and center. as.numeric(AA) strips off those attributes which is the difference between your first and second method. c(AA) does the same thing. I would guess as.numeric() either calls c() (through as.double()) or uses the same method it does.

 set.seed(1234)

 test <- data.frame(matrix(runif(10*5),10,5))

 head(test)
         X1        X2         X3        X4        X5
1 0.1137034 0.6935913 0.31661245 0.4560915 0.5533336
2 0.6222994 0.5449748 0.30269337 0.2651867 0.6464061
3 0.6092747 0.2827336 0.15904600 0.3046722 0.3118243
4 0.6233794 0.9234335 0.03999592 0.5073069 0.6218192
5 0.8609154 0.2923158 0.21879954 0.1810962 0.3297702
6 0.6403106 0.8372956 0.81059855 0.7596706 0.5019975

 # center and scale
 testVar <- scale(test[,1])

 testVar
             [,1]
 [1,] -1.36612292
 [2,]  0.48410899
 [3,]  0.43672627
 [4,]  0.48803808
 [5,]  1.35217501
 [6,]  0.54963231
 [7,] -1.74522210
 [8,] -0.93376661
 [9,]  0.64339300
[10,]  0.09103797
attr(,"scaled:center")
[1] 0.4892264
attr(,"scaled:scale")
[1] 0.2748823

 # put testvar back with its friends
 bindVar <- cbind(testVar,test[,2:5])

 # run a regression with 'matrix form' y var
 testLm1 <- lm(testVar~.,data=bindVar)

 # strip non-name attributes
 testVar <- as.numeric(testVar)

 # rebind and regress
 bindVar <- cbind(testVar,test[,2:5])

 testLm2 <- lm(testVar~.,data=bindVar)

 # check for equality
 all.equal(testLm1, testLm2)
[1] TRUE

lm() seems to return the same thing so it appears they both are the same.