I want to create an overview for a longitudinal analysis for which I require all participants from my data, which have valid responses on Variable X at any point and a valid response on variable Y at a later point. Data is formatted long, so each participant has multiple entries with different years where the survey took place as a variable.
Omitting all NAs leaves me with no data.
As my data is sensitive I would consider the following data frame as an example:
ID year X Y
1 1 2000 NA NA
2 1 2002 NA 3
3 1 2003 2 4
4 2 2000 3 NA
5 2 2002 NA 2
6 2 2003 4 3
7 3 2000 1 3
8 3 2002 2 NA
9 3 2003 3 4
10 4 2000 2 NA
11 4 2002 NA NA
12 4 2003 4 2
13 5 2000 4 1
14 5 2002 NA NA
15 5 2003 NA NA
I want to only keep participants which have a valid response on variable X and Y, where the valid response on X needs to be before the one on Y. Here it would be ID 2, 3 and 4.
Using
dplyr
you can doWe use
cumany(!is.na(X))
to see if there are any valid X values up to that row and then we uselag
to shift the value "down" a row so we don't get records with the Y value is on the same row as the X value. So now that we have a true value after the first valid X, we look for a non-missing Y. If any of those rows in the group are true, keep the whole group.Tested with