how to arrange SAS dataset in chronological order of pair of start date/end date

1.1k views Asked by At

I have a rather messy looking dataset with over 200 records (participants) and many columns denoting the treatment a participant was on, together with the start date and end date for that respective treatment.

Here is an example below of a small portion of my dataset, where PATID is participant id, x1DRG is drug (either 1 or 2), x1SDT is the date when they started taking that drug and x1EDT is date when they stopped. Same for the rest of columns.

So for example, for the first record x1DRG=2 (participant was taking drug 2) starting with X1SDT=11/6/2019, no x1EDT (meaning he's currently taking it), x2DRG is still 2, drug was started on x2SDT = 12/7/2016, drug was terminated on x2EDT = 1/9/2017, and finally x3DRG (still 2) was started again on x3SDT=1/9/2017 and ended on x3EDT=6/5/2018.

What I want to do is see how many participants stopped the study drug, based on those start and end dates. So I would ultimately get a SAS dataset with the participant ID, a STOP column with 1 or 0, based on whether he stopped the drug or not, a DRUG column with the drug that was stopped (1 or 2), a column with CURRENT DRUG he's on if he didn't stop, and a column with the STOP DATE when he stopped the drug, if he stopped it. So for example, for PATID 1, the STOP column would be 0 (didn't stop it), CURRENT DRUG would be 2, STOP DATE would be empty since he's still on drug 2. For PATID 6, we'd have STOP is 1 (he stopped), DRUG would be 1, CURRENT DRUG would be empty (currently not on any drug), STOP DATE would be 11/24/2019.

What I was thinking of doing was to transpose the dataset and then do something like, if for each participant the oldest end date isn't followed by a start date, then that means that participant has not stopped treatment.

However, the problem is that as you can see for some participants (rows highlighted in red, patids 1, 7, 8, 9) the sequence of start and end dates aren't in chronological order. For example, for participant 1, the oldest start date in the sequence is x2SDT (12/7/2016) when he first started taking drug 2, then he stopped on x2EDT (1/9/2017), then started again on x3SDT (1/9/2017), stopped on x3EDT (6/5/2018), and finally started again on x1SDT(11/6/2019) and is currently on drug 2 since there's no x1EDT. So, in this case and many others, x1SDT doesn't correspond to the oldest date, as it should so you have to visually inspect the sequence and kinda figure out what's the correct order and based on that, decide whether he's still on a drug or not.. Now given that I have close to 500 records, I obviously don't want to (nor have the time) to manually go through all of them and decide. I've been thinking about this for days now but unfortunately I'm still at a pretty basic beginner level with SAS and have not figured out how to solve this problem programatically. If anybody has any sample code/suggestions for me, I'd really appreciate it!

enter image description here

So, I'm thinking for the final dataset I can get something like this maybe? Though not sure if that's possible and either way, the final format shouldn't matter too much as long as I can get the program to give me what I need. enter image description here

1

There are 1 answers

0
Richard On BEST ANSWER

Presuming the messy data was imported from Excel into SAS.

Usually this data shape is from a data entry worksheet that was focused on patient on a row concepts, or a pivot table created from categorical data stored elsewhere.

data have; * slightly tweaked data;
attrib
  patid length=8
  x1drg informat=best8. x1sdt x1edt format=mmddyy10. informat=anydtdte.
  x2drg informat=best8. x2sdt x2edt format=mmddyy10. informat=anydtdte.
  x3drg informat=best8. x3sdt x3edt format=mmddyy10. informat=anydtdte.
  x4drg informat=best8. x4sdt x4edt format=mmddyy10. informat=anydtdte.
  x5drg informat=best8. x5sdt x5edt format=mmddyy10. informat=anydtdte.
  x6drg informat=best8. x6sdt x6edt format=mmddyy10. informat=anydtdte.
;
infile datalines missover;
input patid -- x6edt;
datalines;
1 2 11/06/2019 .           2 12/07/2016 01/09/2017  2 01/09/2017 06/05/2018
2 2 11/06/2019 .           . .          .           . 
3 1 01/06/2019 .           . .          .           .
4 2 12/20/2019 02/12/2020  1 03/03/2020 .           .
5 1 11/11/2019 .           . .          .           .
6 1 06/03/2019 11/15/2019  1 11/24/2019 11/24/2019  .
7 2 03/27/2019 .           2 05/08/2018 03/27/2019  2 04/18/2018 05/08/2018  2 04/12/2018 04/18/2018
8 1 06/25/2019 .           2 06/07/2019 06/24/2019  2 01/16/2019 06/07/2019  1 09/20/2018 01/15/2019
9 2 08/09/2019 12/06/2019  2 05/08/2019 08/08/2019  2 12/07/2019 12/07/2019  2 12/08/2019 01/15/2020  2 01/16/2020 01/19/2020  2 01/25/2020 .
;

You want to reshape the data into a structure that has four columns; PatId drug start_dt end_dt. Proc TRANSPOSE does not have a syntax for directly pivoting row-wise N groups of M columns into N rows of M columns. You can use an approach called DATA step ARRAY pivoting. That phrase is one you might find in conference papers but not in SAS documentation.

Example:

6 groups of 3 variables are to be pivoted row-wise. Use 3 arrays of 6 elements each to arrange the original variables for easy reshaping.

* transpose by data step array method;
data have_categorical;
  set have;

  array drugs  x1drg x2drg x3drg x4drg x5drg x6drg;
  array starts x1sdt x2sdt x3sdt x4sdt x5sdt x6sdt;
  array ends   x1edt x2edt x3edt x4edt x5edt x6edt;

  do index = 1 to dim(drugs);
    drug = drugs(index);
    start_dt = starts(index);
    end_dt = ends(index);

    if not missing(drug) then OUTPUT;
  end;

  attrib start_dt end_dt format=yymmdd10. informat=anydtdte.;

  keep patid drug start_dt end_dt;
run;

Once the data is in categorical form you can do BY group processing in DATA step, PROC step or SQL.

Your date ranges appear to be mutually exclusive -- in other words there are no date ranges that overlap another. Processing is more complicated if overlaps can occur. Further more it appears that a patient has only ONE active use drug wherein the end date is missing.

Sort the data to eliminate the original willy-nilly date ordering.

proc sort data=have;
  by patid start_dt end_dt;
run;

Use BY group processing to assign a date ordered sequence number and state variables and flags for the last date range of the patient.


data want;
  set have;
  by patid;

  if first.patid
    then seqNum = 1;  * row is first date range for patid, reset sequence number;
    else seqNum + 1;  * row is next date range for patid, increment sequence number;

  if last.patid then do;
    * set the flag value at the last date range of patid;
    * presume EDT will be either missing (open range, still using), or
    * the date when usage stopped;

    stop_date = EDT;

    if missing(EDT) then do;
      STOP = 0;
      CURRENT_DRUG = DRUG;
    end;
    else 
       STOP = 1;
  end;

  format stop_date mmddyy10. seqnum stop current_drug 4.;
run;