How to loop through values in R using subsets and logical statements?

1.3k views Asked by At

I'm trying to correlate sulfate and nitrate values in my dataset (a) by ID values and specific conditions (specified below). The dataset contains three columns (ID, sulfate, nitrate). The code works when I run each ID value individually but now I'm trying to set up a loop to run through all the ID values and then print out all the correlations by ID value into a single vector. The loop is not printing out the correlation values as I'm sure I am not saving them correctly. How can I modify the code below to print out a vector of correlation values according to each ID value?

for (i in 1:5) {
    if (a$ID==i && length(a$ID==i) > 10) {
        cor(a$sulfate[a$ID==i], a$nitrate[a$ID==i])
    }
}
1

There are 1 answers

0
Pierre L On BEST ANSWER

Try instead:

res <- c()
for(i in 1:5) {
  res[i] <- cor(a$sulfate[a$ID==i], a$nitrate[a$ID==i])
}
res

Explanation

#Example data frame
df <- data.frame(ID = c(1, 1, 2, 2), sulfate = c(4, 3, 5, 1), nitrate = c(10,8, 2, 4), stringsAsFactors=F)
df
  ID sulfate nitrate
1  1       4      10
2  1       3       8
3  2       5       2
4  2       1       4

We attempt a logical test. Return the output of 'yes' if ID equals 1:

if(a$ID==1) 'yes'
[1] "yes"
Warning message:
In if (a$ID == 1) "yes" :
  the condition has length > 1 and only the first element will be used

We get the result of 'yes', but we also get a warning. Because:

a$ID==1
[1]  TRUE  TRUE FALSE FALSE

The test checks whether each element of a$ID is equal to 1. That's a problem for the if statement. How does R know which TRUE or FALSE value to use for the test? So it just uses the first.

In your code, you are passing vectors like that in your if statement. You want your if statement to return one value of TRUE or FALSE. Or avoid it all together.

Vectorization

As you become more advanced, you can avoid this loop with a vectorized function call.

sapply(split(a, a$ID), function(x) cor(x['sulfate'], x['nitrate']))
 1  2 
 1 -1 

Some R users have written great packages to deal with these types of problems. You will need dplyr and data.table. Here are two quick alternatives.

library(dplyr)
a %>%
  group_by(ID) %>%
  summarize(Cor =cor(sulfate, nitrate))
Source: local data table [2 x 2]

  ID Cor
1  1   1
2  2  -1

library(data.table)
setDT(a)[, .(cor(sulfate, nitrate)), ID]
   ID V1
1:  1  1
2:  2 -1