The Hadoop Mapreduce CombineFileInputFormat works great when it comes to read a lot of small size files, however i have been noticing that sometimes the job gets failed with the following exception,
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:164) at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:67) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
I have noticed that this is happening only when some other mapreduce job is running on the same data at the same time otherwise it works as expected.
Even the same exception is generated when i run hive query under the similar condition.
I have been searching for the solution or probable cause for this.
Finally i got the cause of this issue, actually i have been using the CombineFileInputFormat with gzip, so the first runnning job was extracting the gzip file in the same folder and was deleting it on its completion, however when i ran another job in parallel it also takes the file unzipped by the firstjob in its input.
So in between the execution of the second job the unzipped file was getting deleted by the first job, this actually was causing error.
Similar will be the case with hive.