I've started to use R lately, and I want to get a correlation matrix for a certain set of variables. My dataset consists of over 150 variables, but I'm only using a few of them. How can I choose which ones to produce? Thanks in advance!

2 Answers

1
G. Grothendieck On

This computes the correlation of the 2nd, 3rd and 4th variables of the builtin data frame anscombe:

cor(anscombe[2:4])
##      x2   x3   x4
## x2  1.0  1.0 -0.5
## x3  1.0  1.0 -0.5
## x4 -0.5 -0.5  1.0

So does this (assuming they have the indicated names):

cor(anscombe[c("x2", "x3", "x4")])
0
Odysseus210 On

I like using the dplyr package. For instance, if your dataset is called dataset, do:

library(dplyr)

Then lets pretend your dataset is:

dataset <- data.frame(x = c(1, 2, 3), 
                      y = c(4, 5, 6), 
                      z = c(100, 50, 20))

Then:

dataset %>%
  as.data.frame() %>%                
  select(x, z) %>%                   # select the variables
  as.matrix() %>%                   
  cor()                              # the correlation matrix

#            x          z
# x  1.0000000 -0.9897433
# z -0.9897433  1.0000000

This method is full proof. We don't know if your dataset is currently a dataframe or a matrix, which will effect which code you use. This code takes that into account.