Using ijson.parse() and ijson.items() to load a big JSON file - why does this work?

14.1k views Asked by At

I am trying to load JSON files that are too big for json.load. I have spent a while looking into ijson and many stack overflow posts, and used the following code, mostly stolen from https://stackoverflow.com/a/58148422/11357695 :

def extract_json(filename):
    listJ=[]
    with open(filename, 'rb') as input_file:
        jsonobj = ijson.items(input_file, 'records.item', use_float=True)
        jsons = (o for o in jsonobj)
        for j in jsons:
            listJ.append(j)
    return listJ

My JSON file is read in as a dict, with 6 keys, one of which is 'records'. The above function only replicates the contents of this 'records' key's value. I looked into this a bit more and came to the conclusion that ijson.items uses a prefix ('records.item'). So it's not surprising it's only replicating this key's value. But I'd like to get everything.

To achieve this, I looked at using ijson.parse to give a list of prefixes. When I fed all of the prefixes made by the weird generator parser object below into ijson.items() using an iterative loop, I got a MemoryError pretty quickly from the json.items() statement. I also got IncompleteJSONError in earlier iterations of the code, which does not appear with the current version. However, if I remove the except ijson.IncompleteJSONError statement I get a Memory Error:

def loadBigJsonBAD(filename):
    with open(filename, 'rb') as input_file:
        parser = ijson.parse(input_file)
        prefixes=[]
        for prefix , event, value in parser:
            prefixes.append(prefix)
    listJnew=[]
    with open(filename, 'rb') as input_file:
        for prefix in prefixes:
            jsonobjn = ijson.items(input_file, prefix, use_float=True)
            try:
                jsonsn = (o for o in jsonobjn)
                for jn in jsonsn:
                    listJnew.append(jn)
            except ijson.IncompleteJSONError:
                continue
    return listJnew

I tried what would happen if I just searched for prefixes without 'record', to see if this would at least give me the rest of the dictionary. However, it actually worked perfectly and made a list whose first object is the same as the object generated for json.load (which worked in this case as I was using a small file to test the code):

def loadBigJson(filename):
    with open(filename, 'rb') as input_file:
        parser = ijson.parse(input_file)
        prefixes=[]
        for prefix , event, value in parser:
            if prefix[0:len('records')] != 'records':
                prefixes.append(prefix)
    listJnew=[]
    with open(filename, 'rb') as input_file:
        for prefix in prefixes:
            jsonobjn = ijson.items(input_file, prefix, use_float=True)
            try:
                jsonsn = (o for o in jsonobjn)
                for jn in jsonsn:
                    listJnew.append(jn)
            except ijson.IncompleteJSONError:
                continue
    return listJnew

When this is tested:

path_json=r'C:\Users\u03132tk\.spyder-py3\antismashDB\GCF_010669165.1\GCF_010669165.1.json'

extractedJson=extract_json(path_json) #extracts the 'records' key value

loadedJson=json.load(open(path_json, 'r'))  #extracts entire json file
loadedJsonExtracted=loadedJson['records']   #the thing i am using to compare to the extractedJson item

bigJson=loadBigJson(path_json)  #a list whose single object is the same as loaded json. 

print (bigJson[0]==loadedJson)#True
print (bigJson[0]['records']==loadedJsonExtracted)#True
print (bigJson[0]['records']==extractedJson)#True

This is great, but it highlights that I don't really understand what's going on - why is the records prefix necessary for the the extract_json function (I tried the other keys in the json dictionary, there were no hits) but counterproductive for loadBigJson? What is generating the Error statements and why does an except IncompleteJSONError statement prevent a MemoryError?

As you can tell I'm pretty unfamiliar with working with JSONs, so any general tips/clarifications would also be great.

Thanks for reading the novel, even if you don't have an answer!
Tim

2

There are 2 answers

0
Rodrigo Tobar On BEST ANSWER

There are several questions posed, so I'll try to break them down a bit.

why is the records prefix necessary for the the extract_json function ...?

ijson needs to know when objects should start being built. Remember that you give ijson a stream of data, so at no point it knows the full structure of your document. This means that without this hint ijson cannot possible guess your intentions, or come up with one on its own right.

Say you have

{
  "a": [1, 2, 3],
  "b": ["A", "B", "C"],
  "c": [{"i": 10, "j": 20, "k": 30},
        {"i": 11, "j": 21, "k": 31},
        {"i": 12, "j": 22, "k": 32}]
}

If you gave this to ijson.items, what objects should it yield? Should it be:

  • 1, 2 and 3, or
  • A, B and C, or
  • {"i": 10, "j": 20, "k": 30}, {"i": 11, "j": 21, "k": 31} and {"i": 12, "j": 22, "k": 32}, or
  • 10, 20, 30, 11, 21, 31, 12, 22, and 32, or
  • [1, 2, 3], or ["A", "B", "C"], or [{"i": 10, "j": 20, "k": 30}, ...], or
  • The full object, or....

Which objects are built by items depends on the prefix you give it. If you are giving ijson a records.item prefix then it means you have a JSON document that looks like:

{
  ...
  "records": [.....],
  ...
}

and that you want to return the values of that list as individual objects.

If I'm correctly reading in between the lines of your question, I think the underlying issue you have is that ijson.items operates on a single prefix, but you want to extract objects from different prefixes. This functionality is not yet in ijson, but could actually be added (and shouldn't be too difficult I think). A similar idea is to have support for "wildcards" (e.g., a prefix that looks like *.items), which could also be supported I think.

Having said that, please have a look at the kvitems function. It returns key, value pairs for a given prefix, which sounds more or less like what you need.

(I tried the other keys in the json dictionary, there were no hits)

If you could share an extract or simplified example of your JSON file this could be commented on.

... but counterproductive for loadBigJson?

Because both loadBigJsonBAD and loadBigJson are flawed.

To begin with, they both invoke ijson.parse many times using a file that isn't reset. The first invocation will work, but will exhaust the file object (i.e., read will return nothing. Further invocations use this exhausted file object and fail because there is nothing to read, and therefore they raise IncompleteJSONError.

Secondly, ijson.parse generates a prefix,key,value tuple for every event in the JSON document: when an object starts, when an object ends, when an array starts and end, and when atomic values (strings, numbers, bools) are found. Accumulating the prefixes from all of these in a list give you way more entries than you need, and many of them will be repeated. You should at least put them into a set; otherwise you are repeating yourself.

What is generating the Error statements and why does an except IncompleteJSONError statement prevent a MemoryError?

Where do you get a MemoryError, and after how much time? The only possibility I can think of is that you are using 3.0.0 <= ijson < 3.1.2 with the yajl2_c backend, and that you are leaking memory by creating too many ijson.items objects (see this bug report). But that probably would probably not happen if you used a set to store the prefixes in the first place.

Thanks for reading the novel

You're welcome!

Additionally, note that instead of looping through the the results of, say, items and appending values into a list (that is, if you really, really want to collect all objects in memory at once) you should be able to construct a list directly from the iterator. So instead of:

def extract_json(filename):
    listJ=[]
    with open(filename, 'rb') as input_file:
        jsonobj = ijson.items(input_file, 'records.item', use_float=True)
        jsons = (o for o in jsonobj)
        for j in jsons:
            listJ.append(j)
    return listJ

you should be able to do:

def extract_json(filename):
    with open(filename, 'rb') as input_file:
        return list(ijson.items(input_file, 'records.item', use_float=True))

Edit 1:

For the structure of the given JSON example you might want to go with kvitems using an empty prefix. So:

for key, value in ijson.kvitems(input_file, ''):
    # (key, value) will be:
    #  (key1, ”string”),
    #  ("records", [list with lots of nested dictionary/list objects])
    #  ("key3", int)
    #  ("key4", string)
    #  ("key5", {dict with nested dictionary objects})
    #  ("key6", str)

This sounds exactly like what you are trying to achieve, and will be done all for you already. Each iteration will give you a different (key, value) pair, and this will be done iteratively only using as much memory is needed to hold that particular pair. If you still want to put everything into one single list you can do, but beware that with big files you might run out of memory (and that's the purpose of using ijson v/s json, right?)

So yes, your new code is generic, but: 1) it could be made simpler if you used kvitems I think, and 2) it kind of defeats the purpose of iterating through big files because you are still accumulating all the contents in memory.

Regarding the "invalid character" errors, yes, this is probably an enconding problem. I'm only adventuring a guess here, but if you copy/pasted the JSON content from your JSON generator link into an editor and saved that file, chances are that it's encoded in your local default encoding rather than UTF8 and that this is what's producing the error. I tried using the "Download JSON file" option and that worked well.

1
Tim Kirkwood On

(NB - answering as this is well over char limit </3 Should I edit the original answer instead?)

Ah, thanks for this, especially for the nuts and bolts of what ijson is doing – I’ve been banging my head on that for a while! Your comments on extract_json are spot on, much nicer code to read. Wrt the JSON structure, I can’t see an option to attach a file, but it’s formatted as:

{
key 1: ”string”,
“records”: [list with lots of nested dictionary/list objects]
key3: int
key4: string
key5: {dict with nested dictionary objects}
key6: str
}

‘records’ has the bulk of the information I want, but I was treating this as a learning exercise so would like to get everything. You’re correct about the issues with the other 2 functions as well, gave me the chance to revise how files work in python! And yes there were many (many) prefixes, even set() makes a very long list for ijson.items to work through. I had a look through the prefixes, and decided to cut them down to those that have <1 ‘.’ I.e. the initial key if I understand correctly. When incorporated into the code below it works fine with no errors.

def NewJsonLoad(filename):
    with open(filename, 'rb') as input_file:
        #get all prefixes in json file
        prefixes=[]
        parser = ijson.parse(input_file)
        for prefix , event, value in parser:
                prefixes.append(prefix)
        prefixes = list(set(prefixes))
        prefixes_filtered=[]
        
        #pull out prefixes that are the initial keys only
        for prefix in prefixes:
            if prefix.count('.')==0:
                prefixes_filtered.append(prefix)
        
        #pull out items for the filtered prefixes
        finalout=[]
        for prefix in prefixes_filtered:
            input_file.seek(0)#reset pointer - see https://stackoverflow.com/a/22590262/11357695
            jsonobjn = ijson.items(input_file, prefix, use_float=True)
            jsonsn = (o for o in jsonobjn)
            for jn in jsonsn:
                finalout.append(jn)
    return finalout[0]#feeding jn into a list object, not the original dict object - this is item [0]

Do you think this works and would be generalizable, or am I missing some more glaring errors?

I’ve tried to find some dummy csv to play with (https://www.json-generator.com/), to test if it works on files that aren’t formatted by the program generating the JSON files I am working with. However, both my function and json.load don’t like it for some reason – might be something to do with decoding, I’ve seen the term tossed around a bit :P

IncompleteJSONError: lexical error: invalid char in json text.
                                       [    {      "_id": "5f80c3b4
                     (right here) ------^ 

Cheers for the help/tutorial!