I am working with the following table:
+------+------+------+------+---------------+---------+-------+
| ID 1 | ID 2 | Date | Type | Marked_Latest | Updated | Notes |
+------+------+------+------+---------------+---------+-------+
| 1 | 100 | 2001 | SMT | | | |
| 1 | 101 | 2005 | SMT | | | |
| 1 | 102 | 2020 | SMT | Latest | | |
| 1 | 103 | 2020 | SMT | | | |
| 1 | 103 | 2020 | ABT | | | |
| 2 | 201 | 2009 | CMT | Latest | | |
| 2 | 202 | 2022 | SMT | | | |
| 2 | 203 | 2022 | SMT | | | |
+------+------+------+------+---------------+---------+-------+
I am trying to perform the following steps using a df.query()
but since there are so many caveats I am not sure how to fit them all in.
Step 1: Only looking at Type == "SMT" or Type == "CMT"
, group by ID 1 and identify latest date
, compare this (grouped ID 1 data) to date
of Marked_Latest == "Latest
(essentially, just verifying that the date is correct)
Step 2: If the date
values are the same, do nothing. If different, then supply ID 2
next to original Marked_Latest == "Latest"
in Updated
Step 3: If multiple Latest
have the same max Date
, put a note in Notes
that says "multiple".
This will result in the following table:
+------+------+------+------+---------------+---------+----------+
| ID 1 | ID 2 | Date | Type | Marked_Latest | Updated | Notes |
+------+------+------+------+---------------+---------+----------+
| 1 | 100 | 2001 | SMT | | | |
| 1 | 101 | 2005 | SMT | | | |
| 1 | 102 | 2020 | SMT | Latest | | multiple |
| 1 | 103 | 2020 | SMT | | | multiple |
| 1 | 103 | 2020 | ABT | | | |
| 2 | 201 | 2009 | CMT | Latest | 203 | |
| 2 | 202 | 2022 | SMT | | | multiple |
| 2 | 203 | 2022 | SMT | | | multiple |
+------+------+------+------+---------------+---------+----------+
To summarize: check that the latest date is actually marked as latest date. If it is not marked as latest date, write the updated ID 2
next to the original (incorrect) latest date. And when there are multiple cases of latest date, inputting "multiple" for each ID of latest date.
I have gotten only as far as identifying the actual latest date, using
q = df.query('Type' == "SMT" or 'Type' == "CMT").groupby('ID 1').last('ID 2')
q
This will return a subset with the latest dates marked, but I am not sure how to proceed from here, i.e. how to now compare this dataframe with the date
field corresponding to Marked_Latest
.
All help appreciated.
Try:
Output: