Pyspark structured streaming - data from previous record

36 views Asked by David Cunningham At 15 December 2023 at 20:35

I have a use case where I have a stream of records with gps coordinates.

Schema:
 latitude: Float
 longitude: Float

I want to use pyspark to calculate the distance between my current record and my previous record in real time as a new column in the dataframe (e.g. distanceTraveled). I know i can use the geopy to calculate the distance but I'm at a loss on how to get the latitude and longitude from my previous record.

The output should look something like

Schema:
  latitude: Float
  longitude: Float
  distanceTraveled: Float

The only solution that I figured out that worked was to store the entry in some kind of file like a pickle and then whenever I get a new entry update the record. However, that doesn't seem to work with EMR Serverless since those files aren't persisted between runs. That also just seems like it can't possibly be the proper way to do this.

Ideally I'd also like to be able to expand my solution so that I could do the same thing not just since the last entry but over a set time period as well as part of the same job (like over the last hour).

Original Q&A

TechQA.

Pyspark structured streaming - data from previous record

There are 0 answers

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in AMAZON-EMR

Related Questions in EMR-SERVERLESS

Popular Questions

Popular Tags

Trending Questions