How to make Hadoop MR to read only files instead of folders in input path

893 views Asked by At

As per our requirement, the output of one job will be the input of other job.

By using Multiple outputs concepts we are creating a new folder in output path and writing those records into folder. This is how it looks like :

OPFolder1/MultipleOP/SplRecords-m-0000*
OPFolder1/part-m-0000* files

When the new job is using the input as OPFolder1, I am facing the below error

org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:298)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:85)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
    org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /user/abhime01/OPFolder1/MultiplOP/

Is there any way or property, to make hadoop, read only the files rather than folders.

3

There are 3 answers

2
Remus Rusanu On

Set mapreduce.input.fileinputformat.input.dir.recursive to true. See FileInputFormat doesn't read files recursively in the input path dir.

0
Jyadav On

One way to achieve this is to create custom input format by subclassing default InputFormat class, so that it will allow you to override the listStatus method. While implement the liststatus method you just need to ignore directories inside your input dir.

Example:

 for (int i = 0; i < len; ++i) {
FileStatus file = files[i];
if (!file.isDir()) {
newFiles.add(file);

Hope that will help you.

0
vefthym On

Instead of using the root directory for the InputPath, you could use the path: OPFolder1/part-m*, which is basically all the files in this directory, whose names start with part-m.