JSON object spans multiple lines, How to split input in Hadoop

4.7k views Asked by At

I need to ingest large JSON files whose records may span multiple lines (not files) (depends entirely on how the data provider is writing it).

Elephant-Bird assumes LZO compression, which I know the data provider will not be doing.

The Dzone article http://java.dzone.com/articles/hadoop-practice makes the assumption that the JSON record will be on the same line.

Any ideas, with the exception of squishing the JSON... file will be huge... on how to properly split the file such that the JSON does not break.

Edit: lines, not files

2

There are 2 answers

1
Chris White On

Short of any other suggestions, and dependent on how the JSON is being formatted, you may have an option.

The problem, as pointed out in the Dzone article, is that JSON has no end element that you can easily locate when you jump to a split point.

Now if your input JSON has 'pretty' or standard formatting you can take advantage of this in a custom input format implementation.

For example, taking the sample JSON from the Dzone example:

{
  "results" :
    [
      {
        "created_at" : "Thu, 29 Dec 2011 21:46:01 +0000",
        "from_user" : "grep_alex",
        "text" : "RT @kevinweil: After a lot of hard work by ..."
      },
      {
        "created_at" : "Mon, 26 Dec 2011 21:18:37 +0000",
        "from_user" : "grep_alex",
        "text" : "@miguno pull request has been merged, thanks again!"
      }
    ]
}

with this format, you know (hope?) that each new record starts on a line that has 6 whitespaces and an open bracket. A record ends on a similar format - 6 spaces and a closing bracket.

So your logic in this case: consume lines until you find a line with 6 spaces and an open bracket. Then buffer content until the find the 6 spaces and a closing bracket. Then use whatever JSON deserializer you want to turn that into a java object (or just pass the multiline Text to your mapper.

0
Sudarshan Thitte On

The best way for you to split and parse multi-line JSON data would be to extend NLineInputFormat class and define your own notion of what constitutes an InputSplit. [For example : 1000 JSON records could constitute 1 split]

Then, you would need to extend LineRecordReader class and define your own notion of what constitutes 1 line [in this case, 1 record].

This way, you would get well-defined splits, each containing 'N' JSON records, which can then be read using the same LineRecordReader and each of your map tasks would receive one record to process at a time.

Charles Menguy's reply to How does Hadoop process records split across block boundaries? explains the nuance in this approach very well.

For a sample such extension of NLineInputFormat, check out http://hadooped.blogspot.com/2013/09/nlineinputformat-in-java-mapreduce-use.html

A similar multi-line CSV format for Hadoop can be found here : https://github.com/mvallebr/CSVInputFormat

Update : I found a relevant multi-line JSON input format for Hadoop here: https://github.com/Pivotal-Field-Engineering/pmr-common/blob/master/PivotalMRCommon/src/main/java/com/gopivotal/mapreduce/lib/input/JsonInputFormat.java