jackson jsonparser restart parsing in broken JSON

1.4k views Asked by At

I am using Jackson to process JSON that comes in chunks in Hadoop. That means, they are big files that are cut up in blocks (in my problem it's 128M but it doesn't really matter). For efficiency reasons, I need it to be streaming (not possible to build the whole tree in memory).

I am using a mixture of JsonParser and ObjectMapper to read from my input. At the moment, I am using a custom InputFormat that is not splittable, so I can read my whole JSON.

The structure of the (valid) JSON is something like:

[    {    "Rep":
        {
        "date":"2013-07-26 00:00:00",
        "TBook":
        [
            {
            "TBookC":"ABCD",            
            "Records":
            [
                {"TSSName":"AAA", 
                    ... 
                },
                {"TSSName":"AAB", 
                    ... 
                },
                {"TSSName":"ZZZ", 
                ... 
                }
            ] } ] } } ]

The records I want to read in my RecordReader are the elements inside the "Records" element. The "..." means that there is more info there, which conforms my record. If I have an only split, there is no problem at all. I use a JsonParser for fine grain (headers and move to "Records" token) and then I use ObjectMapper and JsonParser to read records as Objects. For details:

configure(JsonParser.Feature.AUTO_CLOSE_SOURCE, false);
MappingJsonFactory factory = new MappingJsonFactory();
mapper = new ObjectMapper(factory); 
mapper.configure(Feature.FAIL_ON_UNKNOWN_PROPERTIES,false);
mapper.configure(SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS,false);
parser = factory.createJsonParser(iStream);
mapper.readValue(parser, JsonNode.class);

Now, let's imagine I have a file with two inputsplits (i.e. there are a lot of elements in "Records"). The valid JSON starts on the first split, and I read and keep the headers (which I need for each record, in this case the "date" field).

The split would cut anywhere in the Records array. So let's assume I get a second split like this:

                ... 
                },
                {"TSSName":"ZZZ", 
                ... 
                },
                {"TSSName":"ZZZ2", 
                ... 
                }
            ] } ] } } ]

I can check before I start parsing, to move the InputStream (FSDataInputStream) to the beginning ("{" ) of the record with the next "TSSNAME" in it (and this is done OK). It's fine to discard the trailing "garbage" at the beginning. So we got this:

                {"TSSName":"ZZZ", 
                ... 
                },
                {"TSSName":"ZZZ2", 
                ... 
                },
                ...
            ] } ] } } ]

Then I handle it to the JsonParser/ObjectMapper pair seen above. The first object "ZZZ" is read OK. But for the next "ZZZ2", it breaks: the JSONParser complaints about malformed JSON. It is encountering a "," not being in an array. So it fails. And then I cannot keep on reading my records.

How could this problem be solved, so I can still be reading my records from the second (and nth) splits? How could I make the parser ignore these errors on the commas, or either let the parser know in advance it's reading contents of an array?

1

There are 1 answers

0
xmar On

It seems it's OK just catching the exception: the parser goes on and it's able to keep on reading objects via the ObjectMapper.

I don't really like it - I would like an option where the parser could not throw Exceptions on nonstandard or even bad JSON. So I don't know if this fully answers the question, but I hope it helps.