I got a requirement to process the file as it is means the file content should be processed as it appears in the file.
For Example: I have a file and size is 700MBs. How we can make sure the file will be processed as it appears since it depends on Datanode availability. In some cases, if any of Datanode process the file slowly(low configuration).
One way to fix this, adding unique id/key in file but we dont want to add anything new in the file.
Any thoughts :)
You can guarantee that only one mapper calculates the content of the file by writing your own
FileInputFormatwhich setsisSplitableto false. E.g.For more examples how to do it, I like to recommend a github project. Depending on your hadoop version slight changes might be necessary.