Python: ijson.parse(in_file) vs json.load(in_file)

3.8k views Asked by At

I am trying to read a large JSON file (~ 2GB) in python.

The following code works well on small files but doesn't work on large files because of MemoryError on the second line.

in_file = open(sys.argv[1], 'r')
posts = json.load(in_file)

I looked at similar posts and almost everyone suggested to use ijson so I decided to give it a try.

in_file = open(sys.argv[1], 'r')
posts = list(ijson.parse(in_file))

This handled reading the big file size but ijson.parse didn't return a JSON object like json.load does so the rest of my code didn't work

TypeError: tuple indices must be integers or slices, not str

If I print out "posts" when using json.load, the o/p looks like a normal JSON

[{"Id": "23400089", "PostTypeId": "2", "ParentId": "23113726", "CreationDate": ... etc

If I print out "posts" after using ijson.parse, the o/p looks like a hash map

[["", "start_array", null], ["item", "start_map", null], 
 ["item", "map_key", "Id"], ["item.Id", "string ... etc

My question: I don't want to change the rest of my code so I am wondering if there is anyway to convert the o/p of ijson.parse(in_file) back to a JSON object so that it's exactly the same as if we are using json.load(in_file)?

1

There are 1 answers

0
flashback On

Maybe this works for you:

in_file = open(sys.argv[1], 'r')
posts = []
data = ijson.items(in_file, 'item')
for post in data:
    posts.append(post)