How to create splits from a sequence file in Hadoop?

706 views Asked by At

In Hadoop, I have a sequence file of 3GB size. I want to process it in parallel. Therefore, I am going to create 8 maptasks and hence 8 FileSplits.

FileSplit class has constructors that require the:

Path of the file
Start position
Length

For example the fisrt split can be from 0 with length 3GB/8 and the next split from 3GB/8 with length 3GB/8 and so forth.

Now the SequenceFile.Reader has a constructor that takes same:

Path of the file
Start position
Length

For the first split (from 0 with length 3Gb/8) the sequence file was able to read it as it contains the header of the file, the compression type, and information about the key and value classes.

However, for the other splits the SequenceFile.Reader was not able to read the split because, I think, that portion of the file doesn't contain the header of the sequence file (becuase the file split is not starting from 0) and hence it throws a NullPointerException when I tried to use the sequence file.

So is there a way to make file splits from the sequence file?

1

There are 1 answers

0
Mosab Shaheen On

Well, the idea is that start and length parameters of SequenceFile.Reader is not for specifying portion of the sequence file rather it is for specifying the real beginning and span over a sequence file (e.g. In case you have a container file that contains five sequence files together, and you want to use one of them so specify start and length of the sequence file inside that container file. OR in case you want to read from the beginning of a sequence file to a specific length; however it is not possible to set the start to the middle of a sequence file because you will skip the header of the sequence file and you will get "not a sequence file error", thus you must set the start parameter to the beginning of the sequence file).

Therefore, the solution is to create your file split in your InputFormat as usual:

new FileSplit(path, start, span, hosts);

And you create the sequence reader in your RecordReader as usual (no need to specify start or length):

reader = new SequenceFile.Reader(fs, path, conf);// As usual
start = Split.getStart();
reader.sync(start);

The idea is here in "sync" which skips the amount of bytes specified by "start" of the split.


And for the nextKeyValue of the RecordReader:

    if ((reader.getPosition() >= (start + span)) || !reader.next(key, value)) {
        return false;
    } else {
        return true;
    }