Cascading join two files very slow

Question

Cascading join two files very slow

837 views Asked by user3019299 At 06 December 2013 at 20:28

I am using cascading to do a HashJoin two 300MB files. I do the following cascading workflow:

// select the field which I need from the first file
Fields f1 = new Fields("id_1");
docPipe1 = new Each( docPipe1, scrubArguments, new ScrubFunction( f1 ), Fields.RESULTS );   

// select the fields which I need from the second file 
Fields f2 = new Fields("id_2","category");
docPipe2 = new Each( docPipe2, scrubArguments, new ScrubFunction( f2), Fields.RESULTS ); 

// hashJoin
Pipe tokenPipe = new HashJoin( docPipe1, new Fields("id_1"), 
                     docPipe2, new Fields("id_2"), new LeftJoin());

// count the number of each "category" based on the id_1 matching id_2
Pipe pipe = new Pipe(tokenPipe );
pipe = new GroupBy( pipe , new Fields("category"));
pipe = new Every( pipe, Fields.ALL, new Count(), Fields.ALL );

I am running this cascading program on a Hadoop Cluster which has 3 datanode, each is 8 RAM and 4 cores (I set mapred.child.java.opts to 4096MB.); but it takes me about 30 mins to get the final result. I think it is too slow, but I think there is no problem in my program and in the cluster. How can I make this cascading join faster?

Original Q&A

There are 2 answers

**Nagendra kumar** · Answer 1 · 2013-12-24T06:28:25+00:00

as given in cascading userguide

HashJoin attempts to keep the entire right-hand stream in memory for rapid comparison (not just the current grouping, as no grouping is performed for a HashJoin).Thus a very large tuple stream in the right-hand stream may exceed a configurable spill-to-disk threshold, reducing performance and potentially causing a memory error. For this reason, it's advisable to use the smaller stream on the right-hand side.

or

use CoGroup that might be helpful

**Neha Kumari** · Answer 2 · 2017-11-14T14:16:24+00:00

It may be possible that your hadoop cluster might be busy or dedicated to some other job probably and hence the time taken. I dont think that replacing HashJoin with CoGroup will help you because CoGroup is a reduce-side join, while HashJoin does a map-side join and hence HashJoin is going to be more performant than ConGroup. I think you should try once again with a less busy cluster because your code also looks good.

TechQA.

Cascading join two files very slow

There are 2 answers

Related Questions in JAVA

Related Questions in HADOOP

Related Questions in CASCADING

Popular Questions

Popular Tags

Trending Questions