Create an order column in a pandas dataframe based on one date column relative to another date column

72 views Asked by At

I have an extract where I'm required to identify a certain type of surgery X seen in the Surg Type column.

I need to retain medical appointments seen as distinct rows over a window/period of time, 3 appointment's prior (-3, -2, -1) and 3 appointment's post (+1, +2, +3)

I must include this order as an additional column.

On top of this I need to exclude any appointments outside of the window and any other type of Surg Type, in this example any surgery's denoted Z.

In this example 7/9 rows/record's I want retained and an additional column Prior Post

*** UPDATED EXAMPLE ***

Original Df

| Patient ID | Surg ID | Surg Type | Surg Date  | Medical Appt Date |
|------------|---------|-----------|------------|-------------------|
| 1          | 1       | X         | 2022-09-03 | 2022-01-01        |
| 1          | 1       | X         | 2022-09-03 | 2022-03-04        |
| 1          | 1       | X         | 2022-09-03 | 2022-05-04        |
| 1          | 1       | X         | 2022-09-03 | 2022-06-04        |
| 1          | 1       | X         | 2022-09-03 | 2022-11-04        |
| 1          | 1       | X         | 2022-09-03 | 2022-11-29        |
| 1          | 2       | Z         | 2022-12-01 | 2022-11-01        |
| 1          | 1       | X         | 2022-09-03 | 2023-01-02        |
| 1          | 1       | X         | 2022-09-03 | 2023-01-13        |



Desired Df

| Patient ID | Surg ID | Surg Type | Surg Date  | Medical Appt Date | Inclusion |
|------------|---------|-----------|------------|-------------------|-----------|
| 1          | 1       | X         | 2022-09-03 | 2022-01-01        | Exclude   |
| 1          | 1       | X         | 2022-09-03 | 2022-03-04        | -3        |
| 1          | 1       | X         | 2022-09-03 | 2022-05-04        | -2        |
| 1          | 1       | X         | 2022-09-03 | 2022-06-04        | -1        |
| 1          | 1       | X         | 2022-09-03 | 2022-11-04        | 1         |
| 1          | 1       | X         | 2022-09-03 | 2022-11-29        | 2         |
| 1          | 2       | Z         | 2022-12-01 | 2022-11-01        | -1        |
| 1          | 1       | X         | 2022-09-03 | 2023-01-02        | 3         |
| 1          | 1       | X         | 2022-09-03 | 2023-01-13        | Exclude   |
2

There are 2 answers

0
mozway On BEST ANSWER

You can filter the surgeries X, then compute the time difference between the appointment and the surgery and compute the rank for negative and positive numbers per group with groupby.apply. Finally, mask the ranks outside of threshold.

# number of medical appointments to keep before/after a surgery
N = 3

# columns to use a grouper
group_cols = ['Patient ID', 'Surg ID']

# ensure datetime
df[['Surg Date', 'Medical Appt Date']] = df[['Surg Date', 'Medical Appt Date']].apply(pd.to_datetime)

# custom function to compute the rank of the Appt Date
def flag_nearest(g, N):
    diff = g['Medical Appt Date'].sub(g['Surg Date'])
    m = diff>'0'
    ranks = (diff.where(m).rank()
             .fillna(-diff.mask(m).rank(ascending=False))
             # optional, to have integers in the output
             .astype(int).astype(object)
            )
    return ranks.where(ranks.abs()<=N, 'Excluded').to_frame()

# identify non-X rows
m = df['Surg Type'].eq('X')

# compute the rank and flag columns outside of threshold
df['Inclusion'] = (df[m].groupby(group_cols)
                        .apply(flag_nearest, N)
                        .droplevel(group_cols)
                        # optional, to fill the non-X
                        .reindex(df.index, fill_value='non-X')
                  )

Output:

   Patient ID  Surg ID Surg Type  Surg Date Medical Appt Date Inclusion
0           1        1         X 2022-09-03        2022-01-01  Excluded
1           1        1         X 2022-09-03        2022-03-04        -3
2           1        1         X 2022-09-03        2022-05-04        -2
3           1        1         X 2022-09-03        2022-06-04        -1
4           1        1         X 2022-09-03        2022-11-04         1
5           1        1         X 2022-09-03        2022-11-29         2
6           1        2         Z 2022-12-01        2022-11-01     non-X
7           1        1         X 2022-09-03        2023-01-02         3
8           1        1         X 2022-09-03        2023-01-13  Excluded

Extended example:

    Patient ID  Surg ID Surg Type  Surg Date Medical Appt Date Inclusion
0            1        1         X 2022-09-03        2022-01-01  Excluded
1            1        1         X 2022-09-03        2022-03-04        -3
2            1        1         X 2022-09-03        2022-05-04        -2
3            1        1         X 2022-09-03        2022-06-04        -1
4            1        1         X 2022-09-03        2022-11-04         1
5            1        1         X 2022-09-03        2022-11-29         2
6            1        2         Z 2022-12-01        2022-11-01     non-X
7            1        1         X 2022-09-03        2023-01-02         3
8            1        1         X 2022-09-03        2023-01-13  Excluded
9            2        1         X 2022-09-05        2022-01-01  Excluded
10           2        1         X 2022-09-05        2022-01-02  Excluded
11           2        1         X 2022-09-05        2022-03-04        -3
12           2        1         X 2022-09-05        2022-05-04        -2
13           2        1         X 2022-09-05        2022-06-04        -1
14           2        1         X 2022-09-05        2022-11-04         1
15           2        1         X 2022-09-05        2022-11-29         2
16           2        2         Z 2022-12-02        2022-11-01     non-X
17           2        1         X 2022-09-05        2023-01-02         3
18           2        1         X 2022-09-05        2023-01-13  Excluded
0
Panda Kim On

Code

at first make datetime

df['Surg Date'] = pd.to_datetime(df['Surg Date'], errors='coerce')
df['Medical Appt Date'] = pd.to_datetime(df['Medical Appt Date'], errors='coerce')

make condition and use np.select

import numpy as np
cond1 = df['Surg Type'].eq('X')
cond2 = df['Surg Date'].gt(df['Medical Appt Date'])
cond3 = df['Surg Date'].lt(df['Medical Appt Date'])
s1 = (cond1 & cond2)[::-1].cumsum().mul(-1)
s2 = (cond1 & cond3).cumsum()

condlist = [~cond1 | s1.lt(-3) | s2.gt(3), cond2, cond3]

df['Inclusion '] = np.select(condlist, ['Exclude Row', s1[::-1], s2], np.NaN)

df:

enter image description here


Example Code

import pandas as pd
data1 = {'Patient ID': [1, 1, 1, 1, 1, 1, 1, 1, 1], 'Surg ID': [1, 1, 1, 1, 1, 1, 2, 1, 1], 'Surg Type': ['X', 'X', 'X', 'X', 'X', 'X', 'Z', 'X', 'X'], 'Surg Date': ['2022-09-03', '2022-09-03', '2022-09-03', '2022-09-03', '2022-09-03', '2022-09-03', '2022-12-01', '2022-09-03', '2022-09-03'], 'Medical Appt': ['Y', 'Y', 'Y', 'N', 'Y', 'Y', 'N', 'Y', 'Y'], 'Medical Appt Date': ['2022-01-01', '2022-03-04', '2022-05-04', 'NaT', '2022-11-04', '2022-11-29', 'NaT', '2023-01-02', '2023-01-13']}
df = pd.DataFrame(data1)