Reading large lz4 compressed JSON data set in Python 2.7

Question

Reading large lz4 compressed JSON data set in Python 2.7

2.3k views Asked by SecurityGuy At 30 August 2017 at 17:31

I need to analyze a large data set that is distributed as a lz4 compressed JSON file.

The compressed file is almost 1TB. I'd prefer not to uncompress it to disk due to cost. Each "record" in the dataset is very small, but it is obviously not feasible to read the entire data set into memory.

Any advice on how to iterate through records in this large lz4 compressed JSON file in Python 2.7?

Original Q&A

There are 1 answers

**JonathanU** · Answer 1 · 2018-01-21T15:14:01+00:00

As of version 0.19.1 of the python lz4 bindings, there is full support for buffered IO provided. So, you should be able to do something like:

import lz4.frame
chunk_size = 128 * 1024 * 1024
with lz4.frame.open('mybigfile.lz4', 'r') as file:
    chunk = file.read(size=chunk_size)
    # Do stuff with this chunk of data.

which will read in data from the file at around 128 MB at a time.

Aside: I am the maintainer of the python lz4 package - please do file issues on the project page if you have problems with the package, or if something is not clear in the documentation.

TechQA.

Reading large lz4 compressed JSON data set in Python 2.7

There are 1 answers

Related Questions in PYTHON

Related Questions in JSON

Related Questions in PYTHON-2.7

Related Questions in LZ4

Popular Questions

Popular Tags

Trending Questions