spark avoid building RDD everytime

Asked by At

I have data in S3, I am able to load data in S3 as RDD apply some changes to convert it to dataframe and run spark-sql queries. But whenever new data is added to S3, again I need to load entire data as RDD convert it to dataframe and run queries. Is there a way to avoid loading entire data and just load the new data. ie the new data should get added to RDD instead of loading entire RDD ?

2 Answers

abhi On Best Solutions

After several try outs , concluded that there is no way to avoid rebuilding RDD, I am rebuilding rdd periodically so that new files in s3 will also be included in rdd. or I can query data in s3 via glue table using spark but this is slow since for every query internally rdd/dataframe is getting constructed

Steve Loughran On

Take a look at spark streaming: one of its sources monitors directories for change