How to make Hadoop MR to read only files instead of folders in input path

Question

How to make Hadoop MR to read only files instead of folders in input path

876 views Asked by Abhinay At 16 February 2016 at 09:53

As per our requirement, the output of one job will be the input of other job.

By using Multiple outputs concepts we are creating a new folder in output path and writing those records into folder. This is how it looks like :

OPFolder1/MultipleOP/SplRecords-m-0000*
OPFolder1/part-m-0000* files

When the new job is using the input as OPFolder1, I am facing the below error

org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:298)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:85)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
    org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /user/abhime01/OPFolder1/MultiplOP/

Is there any way or property, to make hadoop, read only the files rather than folders.

Original Q&A

There are 3 answers

**Remus Rusanu** · Answer 1 · 2016-02-16T10:00:00+00:00

Remus Rusanu On 16 February 2016 at 10:00

Set mapreduce.input.fileinputformat.input.dir.recursive to true. See FileInputFormat doesn't read files recursively in the input path dir.

**Jyadav** · Answer 2 · 2016-02-16T11:52:30+00:00

One way to achieve this is to create custom input format by subclassing default InputFormat class, so that it will allow you to override the listStatus method. While implement the liststatus method you just need to ignore directories inside your input dir.

Example:

 for (int i = 0; i < len; ++i) {
FileStatus file = files[i];
if (!file.isDir()) {
newFiles.add(file);

Hope that will help you.

**vefthym** · Answer 3 · 2016-02-17T03:12:31+00:00

vefthym On 17 February 2016 at 03:12

Instead of using the root directory for the InputPath, you could use the path: OPFolder1/part-m*, which is basically all the files in this directory, whose names start with part-m.

TechQA.

How to make Hadoop MR to read only files instead of folders in input path

There are 3 answers

Related Questions in HADOOP

Related Questions in MAPREDUCE

Related Questions in RECORDREADER

Popular Questions

Popular Tags

Trending Questions