Set unique identifier for cases with correpsonding previous index / same trace

170 views Asked by At

Suppose I have the following dataset:

Personalnumber Category Year Month Index_ID Previous_Index_ID
1 100 2022 8 42100
1 100 2022 9 9534 42100
1 9400 2023 9 4
1 9400 2023 10 485 4
2 100 2022 1 214 102
2 100 2022 2 194231 214
3 200 2022 2 2111
3 200 2022 3 1012 2111
3 200 2022 4 9876 1012
3 200 2022 5 8794 9876
3 200 2022 6 24142 8794
4 100 2022 4 42100
4 200 2022 7 12
4 200 2022 8 14 12
4 200 2022 9 485 14

The first column (Personalnumber) is a number that specifies a person. There is an additional column (Category) that gives a category. There is an entry for year and month (Year Month). There is an index column (Index_ID) and most importantly a column stating a reference, the previous index a case might relate to (Previous_Index_ID). So, let's make it more understandable:

The first case belongs to person 1 within category 100. We have to entries that belong to this case. It starts with the index 42100. The next record has the index 9534, it is related to the first one, as the column "Previous_Index_ID" has the value 42100.

The second case belongs to person 1 within category 9400. We have two entries that belong to this case. It starts with the index 4. The next record has the index 485, it is related to the first one, as the column "Previous_Index_ID" has the value 4.

The third case:

2;100;2022;1;214;102
2;100;2022;2;194231;214

belongs to person 2 within category 100. Here we can see that we do not have the first record that would have index 102 in our dataset.

It continues like this, for example person 3 has 5 records:

3;200;2022;2;2111;
3;200;2022;3;1012;2111
3;200;2022;4;9876;1012
3;200;2022;5;8794;9876
3;200;2022;6;24142;8794

This is one case.

Now I want to add a column with an unique identifier for each case.

My code is as follows:

import pandas as pd

myfile = pd.read_csv(r"C:\pathtofile\testfile.csv", sep=";")

myfile['newID'] = myfile.groupby(['Personalnumber','Category'], sort=False).ngroup().add(1)
print(myfile)

And indeed the output is as desired:

    Personalnumber  Category  Year  Month  Index_ID  Previous_Index_ID  newID
0                1       100  2022      8     42100                NaN      1
1                1       100  2022      9      9534            42100.0      1
2                1      9400  2023      9         4                NaN      2
3                1      9400  2023     10       485                4.0      2
4                2       100  2022      1       214              102.0      3
5                2       100  2022      2    194231              214.0      3
6                3       200  2022      2      2111                NaN      4
7                3       200  2022      3      1012             2111.0      4
8                3       200  2022      4      9876             1012.0      4
9                3       200  2022      5      8794             9876.0      4
10               3       200  2022      6     24142             8794.0      4
11               4       100  2022      4     42100                NaN      5
12               4       200  2022      7        12                NaN      6
12               4       200  2022      8        14                12       6
12               4       200  2022      9        485               14       6

The column newID shows the correct case numbering.

Now an additional case comes into play:

1;100;2022;8;101;
1;100;2022;9;204;101
1;100;2022;10;4344;204
1;100;2022;11;2069;4344

This case also belongs to person 1, category 100. Now the data looks like this:

Personalnumber Category Year Month Index_ID Previous_Index_ID
1 100 2022 8 42100
1 100 2022 8 101
1 100 2022 9 9534 42100
1 100 2022 9 204 101
1 100 2022 10 4344 204
1 100 2022 11 2069 4344
1 9400 2023 9 4
1 9400 2023 10 485 4
2 100 2022 1 214 102
2 100 2022 2 194231 214
3 200 2022 2 2111
3 200 2022 3 1012 2111
3 200 2022 4 9876 1012
3 200 2022 5 8794 9876
3 200 2022 6 24142 8794
4 100 2022 4 42100
4 200 2022 7 12
4 200 2022 8 14 12
4 200 2022 9 485 14

As you can see it gets mixed up and my code leads to wrong results. Reason is that the new case falls into the same "place", it also has category 100 and belongs to person 1. However, from the column Index_ID and Previous_Index_ID it is clear that this is another case. These two columns show the traces from which one can differentiate between them and see that these are two different cases. (Of course there could be also even further cases that "fall into the same place", so it is not limited to just two as here in this example.) So my problem now is to get the following desired output:

    Personalnumber  Category  Year  Month  Index_ID  Previous_Index_ID  newID
0                1       100  2022      8     42100                NaN      1
1                1       100  2022      8       101                NaN      2
2                1       100  2022      9      9534            42100.0      1
3                1       100  2022      9       204              101.0      2
4                1       100  2022     10      4344              204.0      2
5                1       100  2022     11      2069             4344.0      2
6                1      9400  2023      9         4                NaN      3
7                1      9400  2023     10       485                4.0      3
8                2       100  2022      1       214              102.0      4
9                2       100  2022      2    194231              214.0      4
10               3       200  2022      2      2111                NaN      5
11               3       200  2022      3      1012             2111.0      5
12               3       200  2022      4      9876             1012.0      5
13               3       200  2022      5      8794             9876.0      5
14               3       200  2022      6     24142             8794.0      5
15               4       100  2022      4     42100                NaN      6
16               4       200  2022      7        12                NaN      7
16               4       200  2022      8        14                12       7
16               4       200  2022      9        485               14       7

How can I do this?

The Index_ID is not unique over the complete dataset, it is only unique per year and month. So you can see that the Index_ID 42100 occurs in 2022 month 8 (personalnumber 1) and also in 2022 in month 4 (personalnumber 4). Or Index_ID 485 occurcs in 2023 month 10 (personalnumber 1) and also in 2022 month 9 (personalnumber 4). However, of course it is unique within a year and month.

(The index numbers are set completey random. So sorting ascending or descending on the Index_ID or Previous_Index_Id column is not a solution.)

EDIT regarding my comment below to the answer from Muhammed Samed Özmen:

Consider the following example:

Personalnumber;Category;Year;Month;Index_ID;Previous_Index_ID
398;14;2022;1;10708;1
398;14;2022;2;50242;10708
398;14;2022;3;76850;50242
398;14;2022;4;120861;76850
398;14;2022;5;110883;120861
398;14;2022;6;188043;110883
398;14;2022;7;9432;188043
398;14;2022;8;175715;9432
398;14;2022;9;142837;175715
398;14;2022;10;152659;142837
398;14;2022;11;52335;152659
398;14;2022;12;156366;52335
398;14;2023;1;16416;156366
398;14;2023;2;163499;16416
398;14;2023;3;1;163499

With the last line (398;14;2023;3;1;163499) the code throws a recursion error. I think the recursion error might arise due to 398;14;2022;1;10708;1 and 398;14;2023;3;1;163499.

However if I change the last record to Index_ID = 2, like this:

Personalnumber;Category;Year;Month;Index_ID;Previous_Index_ID
398;14;2022;1;10708;1
398;14;2022;2;50242;10708
398;14;2022;3;76850;50242
398;14;2022;4;120861;76850
398;14;2022;5;110883;120861
398;14;2022;6;188043;110883
398;14;2022;7;9432;188043
398;14;2022;8;175715;9432
398;14;2022;9;142837;175715
398;14;2022;10;152659;142837
398;14;2022;11;52335;152659
398;14;2022;12;156366;52335
398;14;2023;1;16416;156366
398;14;2023;2;163499;16416
398;14;2023;3;2;163499

Then it works and it sets a newID for this case as it should (all these records belong to one case).

2

There are 2 answers

0
Vitalizzare On BEST ANSWER

Let's stick to the statement:

The Index_ID is not unique over the complete dataset, it is only unique per year and month.

This means that Previous_Index_ID points to an Index_ID at some fixed past period. Based on the data provided, I assume that Previous_Index_ID is the Index_ID for the previous calendar month.

The plan is to join the indices of the corresponding previous records to the current ones, then replace group roots with unique ids and pointers to previous rows with the corresponding data in the records they're pointing at.

First, let's prepare the data to work with:

import pandas as pd
from io import StringIO

data = '''Personalnumber,Category,Year,Month,Index_ID,Previous_Index_ID
1,100,2022,8,42100,
1,100,2022,8,101,
1,100,2022,9,9534,42100.0
1,100,2022,9,204,101.0
1,100,2022,10,4344,204.0
1,100,2022,11,2069,4344.0
1,9400,2023,9,4,
1,9400,2023,10,485,4.0
2,100,2022,1,214,102.0
2,100,2022,2,194231,214.0
3,200,2022,2,2111,
3,200,2022,3,1012,2111.0
3,200,2022,4,9876,1012.0
3,200,2022,5,8794,9876.0
3,200,2022,6,24142,8794.0
4,100,2022,4,42100,
4,200,2022,7,12,
4,200,2022,8,14,12.0
4,200,2022,9,485,14.0'''

df = pd.read_csv(StringIO(data))

Then check if the initial statement is true (just in case):

index_by_date = df.groupby(['Year', 'Month'])['Index_ID']
assert index_by_date.count().eq(index_by_date.nunique()).all()

Prepare supporting data such as dates shifted one month forward, unique record identifiers (by year, month, index ID), and a group starting point marker:

year, month = divmod(df['Month'], 12)
year += df['Year']
month += 1
index_id = pd.MultiIndex.from_arrays([year, month, df['Index_ID']])
root = df.index.min() - 1     # group starting point marker

Now [left] join the corresponding previous row numbers to the data:

df = (
    df     # join the indices of the previous records as group_id
    .join(pd.Series(df.index, index_id, name='group_id'),   
          on=['Year', 'Month', 'Previous_Index_ID'])
    .fillna({'group_id': root})  
    .astype({'group_id': int})
)

And the final touch, iterate over the new column, replacing the marked starting points with unique group identifiers and populating the corresponding cells in the chain with them:

for current, previous in df['group_id'].items():
    df.loc[current, 'group_id'] = (current if previous == root 
                                   else df.at[previous, 'group_id'])

Note that in the for-loop we can replace ...= current if ... with some other provider of unique group identifiers if necessary, for example:

group_id = 0
for current, ...
    df.loc[current, ...] = (group_id:=group_id+1) if ...

# or

from itertools import count
group_id = count(1)
for current, ...
    df.loc[current, ...] = next(group_id) if ...

We can as well rewrite this loop in Numba to run faster, as far as the data index is a range index starting at zero:

from numba import jit

@jit(nopython=True) 
def set_group_ids(column, root_marker=-1): 
    group_id = 1
    for i in range(column.shape[0]):
        if column[i] == root_marker:
            column[i] = group_id
            group_id += 1
        else:
            column[i] = column[column[i]]

set_group_ids(df['group_id'].values, root)

Code to experiment with:

import pandas as pd
from io import StringIO

data = '''Personalnumber,Category,Year,Month,Index_ID,Previous_Index_ID
1,100,2022,8,42100,
1,100,2022,8,101,
1,100,2022,9,9534,42100.0
1,100,2022,9,204,101.0
1,100,2022,10,4344,204.0
1,100,2022,11,2069,4344.0
1,9400,2023,9,4,
1,9400,2023,10,485,4.0
2,100,2022,1,214,102.0
2,100,2022,2,194231,214.0
3,200,2022,2,2111,
3,200,2022,3,1012,2111.0
3,200,2022,4,9876,1012.0
3,200,2022,5,8794,9876.0
3,200,2022,6,24142,8794.0
4,100,2022,4,42100,
4,200,2022,7,12,
4,200,2022,8,14,12.0
4,200,2022,9,485,14.0'''

df = pd.read_csv(StringIO(data))


index_by_date = df.groupby(['Year', 'Month'])['Index_ID']
assert index_by_date.count().eq(index_by_date.nunique()).all(), \
    "Index_ID is not unique within (Year, Month)"


year, month = divmod(df['Month'], 12)
year += df['Year']
month += 1
index_id = pd.MultiIndex.from_arrays([year, month, df['Index_ID']])
root = df.index.min() - 1     # group starting point marker

df = (df.join(pd.Series(df.index, index_id, name='group_id'),   
              on=['Year','Month','Previous_Index_ID'])
      .fillna({'group_id': root})  
      .astype({'group_id': int}))

try:
    assert df.index.equals(pd.RangeIndex(len(df))), \
        "Unable to use Numba due to inappropriate indexing"
    
    from numba import jit

    @jit(nopython=True) 
    def set_group_ids(column, root_marker=-1): 
        group_id = 1
        for i in range(column.shape[0]):
            if column[i] == root_marker:
                column[i] = group_id
                group_id += 1
            else:
                column[i] = column[column[i]]

    set_group_ids(df['group_id'].values, root)
    
except (AssertionError, ModuleNotFoundError, ImportError) as e:
    
    from warnings import warn
    from itertools import count
    
    warn(repr(e))
    group_id = count(1)
    for curr, prev in df['group_id'].items():
        df.loc[curr, 'group_id'] = (next(group_id) if prev == root 
                                    else df.at[prev, 'group_id'])

P.S.

The described algorithm works only if all previous indexes physically precede the subsequent ones inside the table. Otherwise, we must first fill in all the starting points (roots) and then populate the chains with the appropriate group identifiers. For this purpose, we can use 2 reference sequences with indices of the next and previous entries (one to identify roots, the other to fill in the chains):

date = pd.to_datetime(df[['Year', 'Month']].assign(Day=1))

previous = pd.Series(
    df.index, 
    pd.MultiIndex.from_arrays([date + pd.offsets.MonthBegin(), df['Index_ID']]), 
    name='previous'
).reindex(
    pd.MultiIndex.from_arrays([date, df['Previous_Index_ID']]), 
    fill_value=-1
).reset_index(drop=True)

following = (
    previous[previous >= 0]
    .rename_axis('following')
    .reset_index()
    .set_index('previous')
    .reindex(previous.index, fill_value=-1)
    .squeeze()
)

roots = previous < 0
grouper = pd.Series(0, df.index, name='GroupID')
grouper[roots] = range(1, 1+sum(roots))

for root, group_id in grouper[roots].items():
    successor = following[root]
    while successor >= 0:
        grouper[successor] = group_id
        successor = following[successor]

df = df.join(grouper)
10
Muhammed Samed Özmen On

If the current row is not empty, it extract the Previous_Index_ID from the row. If the Previous_Index_ID is not NaN (i.e., it has a valid value), it means that there is a previous index to follow. In this case, it recursively calls the find_root_index function with the personalnumber, category, and prev_index_id. This recursive call continues until it either find a root index or reach the end of the chain.

apply a lambda function to each row of the DataFrame to calculate the newID for that row. It retrieves the Personalnumber, Category, and Index_ID for the current row. It calls the find_root_index function to find the root index for the current case. It uses the root_to_newID dictionary to map the root index to the corresponding newID. If the root index is not found, it assigns None as the newID. Finally, the newID column is assigned the values computed by the lambda function

import pandas as pd

data = {
    "Personalnumber": [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4],
    "Category": [100, 100, 100, 100, 100, 100, 9400, 9400, 100, 100, 200, 200, 200, 200, 200, 100, 200, 200, 200],
    "Year": [2022, 2022, 2022, 2022, 2022, 2022, 2023, 2023, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022],
    "Month": [8, 8, 9, 9, 10, 11, 9, 10, 1, 2, 2, 3, 4, 5, 6, 4, 7, 8, 9],
    "Index_ID": [42100, 101, 9534, 204, 4344, 2069, 4, 485, 214, 194231, 2111, 1012, 9876, 8794, 24142, 42100, 12, 14, 485],
    "Previous_Index_ID": [None, None, 42100, 101, 204, 4344, None, 4, 102, 214, None, 2111, 1012, 9876, 8794, None, None, 12, 14]
}
myfile = pd.DataFrame(data)

root_to_newID = {}
current_id = 1

def find_root_index(personalnumber, category, index_id):
    row = myfile[(myfile['Personalnumber'] == personalnumber) & 
                (myfile['Category'] == category) & 
                (myfile['Index_ID'] == index_id)]
        
    if row.empty:
        return index_id
        
    prev_index_id = row['Previous_Index_ID'].iloc[0]
        
    if pd.notna(prev_index_id):
        return find_root_index(personalnumber, category, prev_index_id)
        
    return index_id

for _, row in myfile.iterrows():
    root_index = find_root_index(row['Personalnumber'], row['Category'], row['Index_ID'])
    if root_index is not None and (row['Personalnumber'], row['Category'], root_index) not in root_to_newID:
        root_to_newID[(row['Personalnumber'], row['Category'], root_index)] = current_id
        current_id += 1

myfile['newID'] = myfile.apply(lambda row: root_to_newID.get((row['Personalnumber'], row['Category'], find_root_index(row['Personalnumber'], row['Category'], row['Index_ID'])), None), axis=1)

print(myfile)

Output will be like

     Personalnumber  Category  Year  Month  Index_ID  Previous_Index_ID  newID
    0                1       100  2022      8     42100                NaN      1
    1                1       100  2022      8       101                NaN      2
    2                1       100  2022      9      9534            42100.0      1
    3                1       100  2022      9       204              101.0      2
    4                1       100  2022     10      4344              204.0      2
    5                1       100  2022     11      2069             4344.0      2
    6                1      9400  2023      9         4                NaN      3
    7                1      9400  2023     10       485                4.0      3
    8                2       100  2022      1       214              102.0      4
    9                2       100  2022      2    194231              214.0      4
    10               3       200  2022      2      2111                NaN      5
    11               3       200  2022      3      1012             2111.0      5
    12               3       200  2022      4      9876             1012.0      5
    13               3       200  2022      5      8794             9876.0      5
    14               3       200  2022      6     24142             8794.0      5
    15               4       100  2022      4     42100                NaN      6
    16               4       200  2022      7        12                NaN      7
    17               4       200  2022      8        14               12.0      7
    18               4       200  2022      9       485               14.0      7

for new example set.

    import pandas as pd

data = {
    "Personalnumber": [398]*14,
    "Category": [14]*14,
    "Year": [2022]*12 + [2023]*2,
    "Month": list(range(1, 13)) + [1, 2],
    "Index_ID": [10708, 50242, 76850, 120861, 110883, 188043, 9432, 175715, 142837, 152659, 52335, 156366, 16416, 163499],
    "Previous_Index_ID": [1, 10708, 50242, 76850, 120861, 110883, 188043, 9432, 175715, 142837, 152659, 52335, 156366, 16416]
}

myfile = pd.DataFrame(data)
root_to_newID = {}
current_id = 1

def find_root_index(personalnumber, category, index_id):
    row = myfile[(myfile['Personalnumber'] == personalnumber) & 
                (myfile['Category'] == category) & 
                (myfile['Index_ID'] == index_id)]
        
    if row.empty:
        return index_id
        
    prev_index_id = row['Previous_Index_ID'].iloc[0]
        
    if pd.notna(prev_index_id):
        return find_root_index(personalnumber, category, prev_index_id)
        
    return index_id

for _, row in myfile.iterrows():
    root_index = find_root_index(row['Personalnumber'], row['Category'], row['Index_ID'])
    if root_index is not None and (row['Personalnumber'], row['Category'], root_index) not in root_to_newID:
        root_to_newID[(row['Personalnumber'], row['Category'], root_index)] = current_id
        current_id += 1

myfile['newID'] = myfile.apply(lambda row: root_to_newID.get((row['Personalnumber'], row['Category'], find_root_index(row['Personalnumber'], row['Category'], row['Index_ID'])), None), axis=1)

print(myfile)

Example output is

    Personalnumber  Category  Year  Month  Index_ID  Previous_Index_ID  newID
0              398        14  2022      1     10708                  1      1
1              398        14  2022      2     50242              10708      1
2              398        14  2022      3     76850              50242      1
3              398        14  2022      4    120861              76850      1
4              398        14  2022      5    110883             120861      1
5              398        14  2022      6    188043             110883      1
6              398        14  2022      7      9432             188043      1
7              398        14  2022      8    175715               9432      1
8              398        14  2022      9    142837             175715      1
9              398        14  2022     10    152659             142837      1
10             398        14  2022     11     52335             152659      1
11             398        14  2022     12    156366              52335      1
12             398        14  2023      1     16416             156366      1
13             398        14  2023      2    163499              16416      1