How to force file content to be processed sequenctially?

Question

How to force file content to be processed sequenctially?

59 views Asked by Mohammed Niaz At 16 February 2016 at 08:58

I got a requirement to process the file as it is means the file content should be processed as it appears in the file.

For Example: I have a file and size is 700MBs. How we can make sure the file will be processed as it appears since it depends on Datanode availability. In some cases, if any of Datanode process the file slowly(low configuration).

One way to fix this, adding unique id/key in file but we dont want to add anything new in the file.

Any thoughts :)

Original Q&A

There are 1 answers

**Matthias Kricke** · Answer 1 · 2016-02-16T10:06:04+00:00

You can guarantee that only one mapper calculates the content of the file by writing your own FileInputFormat which sets isSplitable to false. E.g.

public class WholeFileInputFormat extends FileInputFormat<Text, BytesWritable> {
        @Override
        protected boolean isSplitable(FileSystem fs, Path filename) {
            return false;
        }


        @Override
        public RecordReader<Text, BytesWritable> getRecordReader(
          InputSplit split, JobConf job, Reporter reporter) throws IOException {
            return new WholeFileRecordReader((FileSplit) split, job);
        }
}

For more examples how to do it, I like to recommend a github project. Depending on your hadoop version slight changes might be necessary.

TechQA.

How to force file content to be processed sequenctially?

There are 1 answers

Related Questions in HADOOP

Related Questions in MAPREDUCE

Related Questions in HIVE

Related Questions in HDFS

Related Questions in BIGSQL

Popular Questions

Trending Questions