Batch layer: How does Spark read and process new data from Master Data?

Question

Batch layer: How does Spark read and process new data from Master Data?

497 views Asked by Yassir S At 20 December 2016 at 10:40

I am building a lambda architecture, I coded the streaming layer and now I am doing the batch layer. For that purpose, I am using Spark 2 as a batch processor and HDFS as a master data.

To read data from HDFS, I wrote the following piece of code:

      SparkSession spark = SparkSession
                .builder()
                .appName("JavaWordCount")
                .master("local")
                .config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
                .getOrCreate();

      JavaRDD<String> msg = spark.read().textFile("HDFS://mypath/*").javaRDD();

However, with this code, new data inserted in HDFS after runnig Spark is not read. I wonder How I can possibly do that ?

Is there only the solution with the Structured streaming (http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) or is there another solution ?

Original Q&A

There are 1 answers

**JS G.** · Accepted Answer · 2017-03-21T08:59:23+00:00

JS G. On 21 March 2017 at 08:59 BEST ANSWER

Yes, in my opinion, Spark 2.x Structure Streaming enables to do it.

I would advise you to to watch this presentation from the Spark Summit 2017 : https://www.youtube.com/watch?list=PLTPXxbhUt-YVEyOqTmZ_X_tpzOlJLiU2k&v=IJmFTXvUZgY

TechQA.

Batch layer: How does Spark read and process new data from Master Data?

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in LAMBDA-ARCHITECTURE

Popular Questions

Popular Tags

Trending Questions