I have a data connection source that creates two datasets:
- Dataset X (Snapshot)
- Dataset Y (Incremental)
The two datasets pull from the same source. Dataset X
consists of the current state of all rows in the source table. Dataset Y
pulls all rows that have been updated since the last build. These two datasets are then merged downstream into dataset Z
with dataset Z
being either dataset X
or the most recent version of each row from dataset Y
. This allows us to both have low latency updates and maintain good partitioning.
When rows are deleted in the source table, the rows are no longer present in dataset X
but are still present in dataset Y
.
What would be the best way keep these 'deleted' rows in dataset Z
? Ideally I would also be able to snapshot dataset Y
without losing any of the 'deleted' rows.
Good question! From what I understand, you want dataset
Z
to have only the most up-to-date rows, including the most up-to-date deleted rows. Both updated rows and deleted rows are present inY
. In this case, I would suggest first unioningY
andX
together, so that all rows, including deleted rows are present in the union dataset. Then, use a window function over a date column in order to get the most recent version of each row. Here is an outline of the pyspark code I would suggest for this:Note that this solution does not get around the issue of what happens if Y snapshots.