How are keys, values, and records delimited in Hadoop streaming, typedbytes, and/or rawbytes

394 views Asked by At

I understand that that text records in Hadoop streaming are delimited by the newline character and that there is a configurable delimiter between keys and values (defaults to tab).

1) The structure of the rawbytes format suggests that no record or key/value delimiters are necessary, but can someone confirm that this is the case?

2) In the typedbytes format, how are keys and values delimited, and how are records delimited?

3) Also, how are keys sorted in the typedbytes and rawbytes format?

1

There are 1 answers

0
piccolbo On
  1. Correct
  2. Length information in the header makes delimiters unnecessary, and in fact they are not used in the spec, with one exception, the 255 delimited list, typecode 9
  3. No sort order is specified. In my experience the default comparator in mapreduce sorts them as raw bytes, numerically for each byte and lexicographically for arrays. It is pluggable, so you can change that with your own Java class.

See https://hadoop.apache.org/docs/current2/api/org/apache/hadoop/typedbytes/package-summary.html

Antonio