Custom record reader for custom binary format

249 views Asked by At

In Hadoop v2 I need to create a RecordReader and/or an InputFormat based on some large binary formats stored in HDFS. The files are basically concatenated records with the following structure:

4-byte constant string "FOOO"
8-byte integer record length n1
n1-byte rest of the record

4-byte constant string "FOOO"
8-byte integer record length n2
n2-byte rest of the record

4-byte constant string "FOOO"
8-byte integer record length n3
n3-byte rest of the record

4-byte constant string "FOOO"
8-byte integer record length n4
n4-byte rest of the record
...

To know all of the boundary points, I'd therefore need to scan through the entire file.

Are there any examples of custom readers/formats that address structures like this?

I'm hoping to avoid pre-computing all the split points in advance, I'd rather stream in each record as the mapper needs it so I don't have to waste a loop through the data. But even if I do have to pre-compute the split points, I don't know how to write a custom splitter so I'd appreciate a pointer to something like that too if possible.

One point to note: the "payload" of each record is essentially arbitrary binary data, and may contain the "FOOO" 4-byte constant in it as far as I know. So if an input split falls somewhere in the middle of a record, I can't necessarily just advance to the next instance of "FOOO" to find the next record, nor would that be an efficient way to manage record-finding since it means scanning all the data rather than just the headers and seeking to the necessary locations.

0

There are 0 answers