I have data in S3, I am able to load data in S3 as RDD apply some changes to convert it to dataframe and run spark-sql queries. But whenever new data is added to S3, again I need to load entire data as RDD convert it to dataframe and run queries. Is there a way to avoid loading entire data and just load the new data. ie the new data should get added to RDD instead of loading entire RDD ?
Linked Questions
- spark job failing in windows with java.io.IOException: (null) entry in command string: null chmod 0644
- org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: EMP_SAL#7736
- Passing command line arguments to Spark-shell
- Spark: Best practice for retrieving big data from RDD to local machine
- Compressing sequence file in Spark?
- Hadoop and Spark
- Spark: Monitoring a cluster mode application
- Installing Apache Spark on Windows
- How can I use the literal value of a spark dataframe column?
- How to filter invalid xmls
- How to write data to hive table with snappy compression in Spark SQL
- how does spark does in-memory computation
- Killing Spark job using command Prompt
- spark-sql 1.3 writes parquet much faster than spark-sql 2.4
- Spark duplicated workers instantiated
Popular Questions
- Partially applied generic function "cannot be cast to Nothing"
- Peek and Pop not an option
- Run JIRA in port 80 as root
- Agar.io style ripple effect for canvas arcs
- What is the difference between [ValidateModel] and a check of valid state in ASP.NET?
- Passing shared_ptr to std::function (member function)
- UWP location tracking even when the app was suspended
- Docker – fix service IP addresses
- Dynamic partition in hive
- How to enable Indications on Client Configuration descriptor from iOS8
2 Answers
0

Take a look at spark streaming: one of its sources monitors directories for change
Related Questions
- API compatibility between scala and python?
- Installed Spark, built against right hadoop version , getting cannot assigned requested address error
- spark RDD (Resilient Distributed Dataset) can be updated?
- Spark Clusters: worker info doesn't show on web UI
- What will spark do if I don't have enough memory?
- Does PySpark offer advantage when data size is bigger than memory?
- Spark FileStreaming not Working with foreachRDD
- Connection Refused When Running SparkPi Locally
- What is the difference between map and flatMap and a good use case for each?
- When to use SPARK_CLASSPATH or SparkContext.addJar
- Apache Spark compile failed while installing Netty
- Java samples for GraphX
- Passing set of lines in Apache Spark
- Passing configuration to Spark Job
- Change Executor Memory (and other configs) for Spark Shell
After several try outs , concluded that there is no way to avoid rebuilding RDD, I am rebuilding rdd periodically so that new files in s3 will also be included in rdd. or I can query data in s3 via glue table using spark but this is slow since for every query internally rdd/dataframe is getting constructed