Spark stream unable to read files created from flume in hdfs

Question

Spark stream unable to read files created from flume in hdfs

988 views Asked by Y0gesh Gupta At 09 June 2015 at 04:13

I have created a real time application in which I am writing data streams to hdfs from weblogs using flume, and then processing that data using spark stream. But while flume is writing and creating new files in hdfs spark stream is unable to process those files. If I am putting the files to hdfs directory using put command spark stream is able to read and process the files. Any help regarding the same will be great.

Original Q&A

There are 3 answers

Erik Schmiegelow On 12 June 2015 at 13:51

In addition to frb's answer: which is correct - SparkStreaming with Flume acts as an Avro RPC Server - you'll need to configure an AvroSink which points to your SparkStreaming instance.

CarloV On 23 November 2017 at 15:12

with spark2, now you can connect directly your spark streaming to flume, see official docs, and then write once on HDFS at the end of the process.

 import org.apache.spark.streaming.flume._
 val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])

**frb** · Accepted Answer · 2015-06-11T09:20:28+00:00

You have detected the problem yourself: while the stream of data continues, the HDFS file is "locked" and can not be read by any other process. On the contrary, as you have experienced, if you put a batch of data (that's yur file, a batch, not a stream), once it is uploaded it is ready for being read.

Anyway, and not being an expert on Spark streaming, it seems from the Spark Streaming Programming Guide, Overview section, that you are not performing the right deployment. I mean, from the picture shown there, it seems the streaming (in this case generated by Flume) must be directly sent to Spark Streaming engine; then the results will be put in HDFS.

Nevertheless, if you want to maintain your deployment, i.e. Flume -> HDFS -> Spark, then my suggestion is to create mini-batches of data in temporal HDFS folders, and once the mini-batches are ready, store new data in a second minibatch, passing the first batch to Spark for analysis.

HTH

TechQA.

Spark stream unable to read files created from flume in hdfs

There are 3 answers

Related Questions in HADOOP

Related Questions in APACHE-SPARK

Related Questions in HDFS

Related Questions in SPARK-STREAMING

Related Questions in FLUME-NG

Popular Questions

Popular Tags

Trending Questions