Pyspark & EMR Serialized task 466986024 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes)

13 views Asked by At

I'm using EMR & Pyspark to create a dataframe, by loading data from S3 and transforming it to JSON list, and then creating a dataframe using the below code:

  df = self.spark.read.json(self.spark.sparkContext.parallelize(jsons))

But I seems to get limited to the message size. I could of course set the maxSize higher, but I feel like this is a hack and I'm rather doing something wrong.

What am I missing? I obviously want to support much larger data, and I tried to use chunking inside the python code, like here:

  for msg in data:
            json_message = MessageToJson(msg, indent=None, use_integers_for_enums=True)
            jsons.append(json_message)
            
            if sys.getsizeof(jsons) > MAX_JSONS:
                writer.write(parquet_path, jsons)
                jsons.clear()  # Clear the list

But I still get the same error, since I guess It's the entire task's memory which is giving the issue? So how else can fix this? What is wrong in my approach?

0

There are 0 answers