I need to ingest large JSON files whose records may span multiple lines (not files) (depends entirely on how the data provider is writing it).
Elephant-Bird assumes LZO compression, which I know the data provider will not be doing.
The Dzone article http://java.dzone.com/articles/hadoop-practice makes the assumption that the JSON record will be on the same line.
Any ideas, with the exception of squishing the JSON... file will be huge... on how to properly split the file such that the JSON does not break.
Edit: lines, not files
Short of any other suggestions, and dependent on how the JSON is being formatted, you may have an option.
The problem, as pointed out in the Dzone article, is that JSON has no end element that you can easily locate when you jump to a split point.
Now if your input JSON has 'pretty' or standard formatting you can take advantage of this in a custom input format implementation.
For example, taking the sample JSON from the Dzone example:
with this format, you know (hope?) that each new record starts on a line that has 6 whitespaces and an open bracket. A record ends on a similar format - 6 spaces and a closing bracket.
So your logic in this case: consume lines until you find a line with 6 spaces and an open bracket. Then buffer content until the find the 6 spaces and a closing bracket. Then use whatever JSON deserializer you want to turn that into a java object (or just pass the multiline Text to your mapper.