I want to create an overview for a longitudinal analysis for which I require all participants from my data, which have valid responses on Variable X at any point and a valid response on variable Y at a later point. Data is formatted long, so each participant has multiple entries with different years where the survey took place as a variable.

Omitting all NAs leaves me with no data.

As my data is sensitive I would consider the following data frame as an example:

   ID year  X  Y
1   1 2000 NA NA
2   1 2002 NA  3
3   1 2003  2  4
4   2 2000  3 NA
5   2 2002 NA  2
6   2 2003  4  3
7   3 2000  1  3
8   3 2002  2 NA
9   3 2003  3  4
10  4 2000  2 NA
11  4 2002 NA NA
12  4 2003  4  2
13  5 2000  4  1
14  5 2002 NA NA
15  5 2003 NA NA

I want to only keep participants which have a valid response on variable X and Y, where the valid response on X needs to be before the one on Y. Here it would be ID 2, 3 and 4.

1

There are 1 answers

0
MrFlick On

Using dplyr you can do

dd %>% 
  filter(any(lag(cumany(!is.na(X))) & !is.na(Y)), .by=ID)

We use cumany(!is.na(X)) to see if there are any valid X values up to that row and then we use lag to shift the value "down" a row so we don't get records with the Y value is on the same row as the X value. So now that we have a true value after the first valid X, we look for a non-missing Y. If any of those rows in the group are true, keep the whole group.

Tested with

dd <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 
4L, 4L, 5L, 5L, 5L), year = c(2000L, 2002L, 2003L, 2000L, 2002L, 
2003L, 2000L, 2002L, 2003L, 2000L, 2002L, 2003L, 2000L, 2002L, 
2003L), X = c(NA, NA, 2L, 3L, NA, 4L, 1L, 2L, 3L, 2L, NA, 4L, 
4L, NA, NA), Y = c(NA, 3L, 4L, NA, 2L, 3L, 3L, NA, 4L, NA, NA, 
2L, 1L, NA, NA)), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", 
"15"))