Loading a very large jsonl in pandas returns ValueError

654 views Asked by At

I'm trying to load a very large jsonl file (>50 GB) using chunks in pandas

reader = pd.read_json("January.jsonl", lines = True, chunksize = 10000)

for chunk in reader:
    df = chunk   

This code starts, runs for a while an then returns this error

 self._parse_no_numpy()

  File "C:\Users\anaconda3\lib\site-packages\pandas\io\json\_json.py", line 1089, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None

ValueError: Expected object or value

Is there a problem with my file or what else? sample from my file

1

There are 1 answers

1
forgetso On BEST ANSWER

You seem to have malformed JSON data in your file. For example, try loading the following "JSON" data - note that id 77 is malformed.

{"created_at": "2019-01-01 23:45:01", "id":1}
{"created_at": "2019-01-01 23:45:01", "id":2}
{"created_at": "2019-01-01 23:45:01", "id":3}
{"created_at": "2019-01-01 23:45:01", "id":4}
{"created_at": "2019-01-01 23:45:01", "id":5}
{"created_at": "2019-01-01 23:45:01", "id":6}
{"created_at": "2019-01-01 23:45:01", "id":7}
{"created_at": "2019-01-01 23:45:01", "id":8}
{"created_at": "2019-01-01 23:45:01", "id":11}
{"created_at": "2019-01-01 23:45:01", "id":22}
{"created_at": "2019-01-01 23:45:01", "id":33}
{"created_at": "2019-01-01 23:45:01", "id":44}
{"created_at": "2019-01-01 23:45:01", "id":55}
{"created_at": "2019-01-01 23:45:01", "id":66}
{i"created_at": "2019-01-01 23:45:01", "id":77}

{"created_at": "2019-01-01 23:45:01", "id":88}
{"created_at": "2019-01-01 23:45:01", "id":99}

Then run this code.

>>> import pandas as pd
>>> reader = pd.read_json("January.jsonl", lines=True, chunksize=1)
>>> for r in reader:
...     print(r)

And view the output:

12 2019-01-01 23:45:01  55
            created_at  id
13 2019-01-01 23:45:01  66
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 779, in __next__
    obj = self._get_object_parser(lines_json)
  File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 753, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 857, in parse
    self._parse_no_numpy()
  File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 1089, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value

The error is the same as the one you received. You will need to find the malformed data and fix it. You could try reading the JSON data line by line to find out where the error(s) exists and extract the lines to inspect them.

f = open("January.jsonl")
lines=f.readlines()
for line_no, line in enumerate(lines):
     try:
         data = json.loads(line)
     except Exception:
         print(line_no)
         print(line)