In Hadoop v2 I need to create a RecordReader
and/or an InputFormat
based on some large binary formats stored in HDFS. The files are basically concatenated records with the following structure:
4-byte constant string "FOOO"
8-byte integer record length n1
n1-byte rest of the record
4-byte constant string "FOOO"
8-byte integer record length n2
n2-byte rest of the record
4-byte constant string "FOOO"
8-byte integer record length n3
n3-byte rest of the record
4-byte constant string "FOOO"
8-byte integer record length n4
n4-byte rest of the record
...
To know all of the boundary points, I'd therefore need to scan through the entire file.
Are there any examples of custom readers/formats that address structures like this?
I'm hoping to avoid pre-computing all the split points in advance, I'd rather stream in each record as the mapper needs it so I don't have to waste a loop through the data. But even if I do have to pre-compute the split points, I don't know how to write a custom splitter so I'd appreciate a pointer to something like that too if possible.
One point to note: the "payload" of each record is essentially arbitrary binary data, and may contain the "FOOO"
4-byte constant in it as far as I know. So if an input split falls somewhere in the middle of a record, I can't necessarily just advance to the next instance of "FOOO"
to find the next record, nor would that be an efficient way to manage record-finding since it means scanning all the data rather than just the headers and seeking to the necessary locations.