I am seeing as the size of the input file increase failed shuffles increases and job complete time increases non linearly.
eg.
75GB took 1h
86GB took 5h
I also see average shuffle time increase 10 fold
eg.
75GB 4min
85GB 41min
Can someone point me a direction to debug this?
Whenever you are sure your algorithms are correct, automatic hard-disk volumes partioning or fragmentation problems may occur somewhere after that 75Gb threshold, as of you are probably using the same filesystem for caching the results.