I have a large dataset of this format. I would like to a) identify those IDs/rows with the following sequence of values 1 - 1 - >1 - >1 anywhere between x1 and x10; and b) generate a new variable ("event") that identifies the beginning of the sequence, taking the value X1,...,X10.
my_df <- data.frame(ID = c("a","b","c","d","e","f","g","h"),
replicate(8,sample(1:4,8,rep=TRUE)))
For a), I have replaced values >1 with 2, then pasted the values from X1 to X10, and then filtered for the sequence 1 - 1 - 2 - 2. For b), I created the variable "event" using nested ifelse() to identify where the sequence begins. This works ok with only 8 columns. Is there a way to increase efficiency for datasets with more columns?
I highly appreciate any pointers!
df_seq <- my_df%>%
mutate_at(vars(starts_with('X')), funs(ifelse(. > 1, 2, .)))%>%
mutate(seq = paste(X1,"-",X2,"-",X3,"-",X4,"-",X5,"-",X6,"-",X7,"-",X8))%>%
filter(grepl("1 - 1 - 2 - 2", seq))%>%
mutate(event = ifelse(X1 == 1 & X2 == 1 & X3 == 2 & X4 == 2,"X1",
ifelse(X2 == 1 & X3 == 1 & X4 == 2 & X5 == 2,"X2",
ifelse(X3 == 1 & X4 == 1 & X5 == 2 & X6 == 2,"X3",
ifelse(X4 == 1 & X5 == 1 & X6 == 2 & X7 == 2,"X4","X5")))))
You can use
regexprto capture the first occurrence of consecutive1s followed by values greater than1, for examplegives