I have a use case where I have a stream of records with gps coordinates.
Schema:
latitude: Float
longitude: Float
I want to use pyspark to calculate the distance between my current record and my previous record in real time as a new column in the dataframe (e.g. distanceTraveled). I know i can use the geopy to calculate the distance but I'm at a loss on how to get the latitude and longitude from my previous record.
The output should look something like
Schema:
latitude: Float
longitude: Float
distanceTraveled: Float
The only solution that I figured out that worked was to store the entry in some kind of file like a pickle and then whenever I get a new entry update the record. However, that doesn't seem to work with EMR Serverless since those files aren't persisted between runs. That also just seems like it can't possibly be the proper way to do this.
Ideally I'd also like to be able to expand my solution so that I could do the same thing not just since the last entry but over a set time period as well as part of the same job (like over the last hour).