I know I should have a code but I have nothing useful yet.
There is ~300GB JSON file on my GCS gs://path/listings_all.json
ultimately I'm trying to import it into BigQuery but it has some wrong data structure (I have sourced it by mongoexport
from MongoDB)
invalid field name "$date". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long
So, now my approach is to somehow read source file line by line from GCS process it and upload each processed line to BigQuery using python API.
Below simple reader I have put together to test with sample 100 lines from the original huge file:
import json
from pprint import pprint
with open('schema_in_10.json') as f:
for line in f:
j_content = json.loads(line)
# print(j_content['id'], j_content['city'], j_content['country'], j_content['state'], j_content['country_code'], j_content['smart_location'], j_content['address'], j_content['market'], j_content['neighborhood'])
# // geo { lat, lng}'])
print('------')
pprint(j_content['is_location_exact'])
pprint(j_content['zipcode'])
pprint(j_content['name'])
Can you please help me on how can I read or stream a huge JSON line by line from Google Cloud Storage with Python3?
Reading it line by line and then trying to stream to BigQuery won't scale with 300GB on your local machine, and you'll struggle to get this working TBH.
There's a couple of scalable options: