I'm using EMR & Pyspark to create a dataframe, by loading data from S3 and transforming it to JSON list, and then creating a dataframe using the below code:
df = self.spark.read.json(self.spark.sparkContext.parallelize(jsons))
But I seems to get limited to the message size. I could of course set the maxSize higher, but I feel like this is a hack and I'm rather doing something wrong.
What am I missing? I obviously want to support much larger data, and I tried to use chunking inside the python code, like here:
for msg in data:
json_message = MessageToJson(msg, indent=None, use_integers_for_enums=True)
jsons.append(json_message)
if sys.getsizeof(jsons) > MAX_JSONS:
writer.write(parquet_path, jsons)
jsons.clear() # Clear the list
But I still get the same error, since I guess It's the entire task's memory which is giving the issue? So how else can fix this? What is wrong in my approach?