Using ijon package to read big json file (http.client.IncompleteRead error)

493 views Asked by At

I'm trying to read a big json file (>1,5Gb), using ijson package and deal with the results.

response = requests.get("https://api.scryfall.com/bulk-data/all-cards")    
with urlopen(response.json()["download_uri"]) as all_cards:
        for card_object in ijson.items(all_cards, "item"):
            do_something_with(card_object)

However each time I run this I get the following error:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 555, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 522, in _read_next_chunk_size
    return int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 587, in _readinto_chunked
    chunk_left = self._get_chunk_left()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 557, in _get_chunk_left
    raise IncompleteRead(b'')
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/benjamin/PycharmProjects/octavin/venv/bin/flask", line 8, in <module>
    sys.exit(main())
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 985, in main
    cli.main()
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 579, in main
    return super().main(*args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/flask/cli.py", line 427, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/benjamin/PycharmProjects/octavin/app/cli.py", line 65, in update
    for card_object in ijson.items(all_cards, "item"):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 492, in readinto
    return self._readinto_chunked(b)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 603, in _readinto_chunked
    raise IncompleteRead(bytes(b[0:total_bytes]))
http.client.IncompleteRead: IncompleteRead(64016 bytes read)

Is that because of any timeout, or because the file's too big? Or anything else?

Note that this is working (all-cards-20220408091307.json being the locally downloaded file):

with open("all-cards-20220408091307.json") as all_cards:
    for card_object in ijson.items(all_cards, "item"):
        do_something_with(card_object)
1

There are 1 answers

2
Rodrigo Tobar On

This seems to be a problem with http.client's HTTPResponse when reading data from a response with chunked encoding: https://bugs.python.org/issue39371.

Since you're already using requests I'd suggest you use that to perform your second request and avoid this issue altogether. requests's response object has an iter_content method that can be used to incrementally read binary data from the incoming stream. ijson on the other hand expects a file-like object. To bridge the gap you can use a solution similar to the one suggested here: https://github.com/ICRAR/ijson/issues/58#issuecomment-917655522; otherwise you can use ijson's push mechanism, where you do the reading and hand over the data chunks to ijson (which is a bit more complex, see ijson's documentation documentation for more details).