R editing dataframe based on column value

79 views Asked by At

Suppose I have a table of around 1M rows with the following format:

id  paid_2000  paid_2001  paid_2002  paid_2003  censor_yr
1   10         20         10         20         2001
2   15         25         15         15         2003

What would be an efficient way to set each observation to NA where the paid year is greater than or equal to the censor year? In particular, I would like the table to look like this:

id  paid_2000  paid_2001  paid_2002  paid_2003  censor_yr
1   10         NA         NA         NA         2001
2   15         25         15         NA         2003
2

There are 2 answers

0
Tunn On BEST ANSWER

Using dplyr:

library(dplyr)
df %>%
  gather(paid_yr, value, grep("paid", names(.))) %>%
  mutate(value = ifelse(as.numeric(gsub(".*_", "", paid_yr)) >= censor_yr, 
                        NA, value)) %>%
  spread(paid_yr, value)

Throw a %>% select chain at the bottom to move censor_yr back to the end.

Can explain how it works if you'd like. May or may not be easier to read than akrun's answers.

0
akrun On

We create an index of columns that are 'paid' ('pi'), subset the columns using 'pi', create a matrix of logical values by comparing the year substring from the column names of the 'paid' column with that of 'censor_yr' column and assign it to NA.

pi <- grep("paid", names(df1))
df1[pi][matrix(as.numeric(sub(".*_", "", names(df1)[pi]))[col(df1[pi])] >=
                       df1$censor_yr, nrow=2)] <- NA
df1
#   id paid_2000 paid_2001 paid_2002 paid_2003 censor_yr
#1  1        10        NA        NA        NA      2001
#2  2        15        25        15        NA      2003

Or we can do this in data.table set which would be more efficient. Get the year substring from the names of the 'paid' column names, convert to data.table (setDT(df1)), loop through the columns in 'pi' and set the values to NA which meets the condition in 'i'.

library(data.table)
nm1 <- as.numeric(sub(".*_", "", names(df1)[pi]))
setDT(df1)
for(j in seq_along(pi)){
   set(df1, i = which(nm1[j] >= df1$censor_yr), j= pi[j], value = NA)
}