I am building a lambda architecture, I coded the streaming layer and now I am doing the batch layer. For that purpose, I am using Spark 2 as a batch processor and HDFS as a master data.
To read data from HDFS, I wrote the following piece of code:
SparkSession spark = SparkSession
.builder()
.appName("JavaWordCount")
.master("local")
.config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
.getOrCreate();
JavaRDD<String> msg = spark.read().textFile("HDFS://mypath/*").javaRDD();
However, with this code, new data inserted in HDFS after runnig Spark is not read. I wonder How I can possibly do that ?
Is there only the solution with the Structured streaming (http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) or is there another solution ?
Yes, in my opinion, Spark 2.x Structure Streaming enables to do it.
I would advise you to to watch this presentation from the Spark Summit 2017 : https://www.youtube.com/watch?list=PLTPXxbhUt-YVEyOqTmZ_X_tpzOlJLiU2k&v=IJmFTXvUZgY