Batch layer: How does Spark read and process new data from Master Data?

466 views Asked by At

I am building a lambda architecture, I coded the streaming layer and now I am doing the batch layer. For that purpose, I am using Spark 2 as a batch processor and HDFS as a master data.

To read data from HDFS, I wrote the following piece of code:

      SparkSession spark = SparkSession
                .builder()
                .appName("JavaWordCount")
                .master("local")
                .config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
                .getOrCreate();

      JavaRDD<String> msg = spark.read().textFile("HDFS://mypath/*").javaRDD();

However, with this code, new data inserted in HDFS after runnig Spark is not read. I wonder How I can possibly do that ?

Is there only the solution with the Structured streaming (http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) or is there another solution ?

1

There are 1 answers

0
JS G. On BEST ANSWER

Yes, in my opinion, Spark 2.x Structure Streaming enables to do it.

I would advise you to to watch this presentation from the Spark Summit 2017 : https://www.youtube.com/watch?list=PLTPXxbhUt-YVEyOqTmZ_X_tpzOlJLiU2k&v=IJmFTXvUZgY