I have a rather messy looking dataset with over 200 records (participants) and many columns denoting the treatment a participant was on, together with the start date and end date for that respective treatment.
Here is an example below of a small portion of my dataset, where PATID is participant id, x1DRG is drug (either 1 or 2), x1SDT is the date when they started taking that drug and x1EDT is date when they stopped. Same for the rest of columns.
So for example, for the first record x1DRG=2 (participant was taking drug 2) starting with X1SDT=11/6/2019, no x1EDT (meaning he's currently taking it), x2DRG is still 2, drug was started on x2SDT = 12/7/2016, drug was terminated on x2EDT = 1/9/2017, and finally x3DRG (still 2) was started again on x3SDT=1/9/2017 and ended on x3EDT=6/5/2018.
What I want to do is see how many participants stopped the study drug, based on those start and end dates. So I would ultimately get a SAS dataset with the participant ID, a STOP column with 1 or 0, based on whether he stopped the drug or not, a DRUG column with the drug that was stopped (1 or 2), a column with CURRENT DRUG he's on if he didn't stop, and a column with the STOP DATE when he stopped the drug, if he stopped it. So for example, for PATID 1, the STOP column would be 0 (didn't stop it), CURRENT DRUG would be 2, STOP DATE would be empty since he's still on drug 2. For PATID 6, we'd have STOP is 1 (he stopped), DRUG would be 1, CURRENT DRUG would be empty (currently not on any drug), STOP DATE would be 11/24/2019.
What I was thinking of doing was to transpose the dataset and then do something like, if for each participant the oldest end date isn't followed by a start date, then that means that participant has not stopped treatment.
However, the problem is that as you can see for some participants (rows highlighted in red, patids 1, 7, 8, 9) the sequence of start and end dates aren't in chronological order. For example, for participant 1, the oldest start date in the sequence is x2SDT (12/7/2016) when he first started taking drug 2, then he stopped on x2EDT (1/9/2017), then started again on x3SDT (1/9/2017), stopped on x3EDT (6/5/2018), and finally started again on x1SDT(11/6/2019) and is currently on drug 2 since there's no x1EDT. So, in this case and many others, x1SDT doesn't correspond to the oldest date, as it should so you have to visually inspect the sequence and kinda figure out what's the correct order and based on that, decide whether he's still on a drug or not.. Now given that I have close to 500 records, I obviously don't want to (nor have the time) to manually go through all of them and decide. I've been thinking about this for days now but unfortunately I'm still at a pretty basic beginner level with SAS and have not figured out how to solve this problem programatically. If anybody has any sample code/suggestions for me, I'd really appreciate it!
So, I'm thinking for the final dataset I can get something like this maybe? Though not sure if that's possible and either way, the final format shouldn't matter too much as long as I can get the program to give me what I need.
Presuming the messy data was imported from Excel into SAS.
Usually this data shape is from a data entry worksheet that was focused on patient on a row concepts, or a pivot table created from categorical data stored elsewhere.
You want to reshape the data into a structure that has four columns;
PatId drug start_dt end_dt
.Proc TRANSPOSE
does not have a syntax for directly pivoting row-wise N groups of M columns into N rows of M columns. You can use an approach called DATA step ARRAY pivoting. That phrase is one you might find in conference papers but not in SAS documentation.Example:
6 groups of 3 variables are to be pivoted row-wise. Use 3 arrays of 6 elements each to arrange the original variables for easy reshaping.
Once the data is in categorical form you can do
BY
group processing in DATA step, PROC step or SQL.Your date ranges appear to be mutually exclusive -- in other words there are no date ranges that overlap another. Processing is more complicated if overlaps can occur. Further more it appears that a patient has only ONE active use drug wherein the end date is missing.
Sort the data to eliminate the original willy-nilly date ordering.
Use
BY
group processing to assign a date ordered sequence number and state variables and flags for the last date range of the patient.