If my data is not partitioned can that be why I’m getting maxResultSize error for my PySpark job?

34 views Asked by At

I have a PySpark job in production at my company that runs every day but failed recently when it had been successful everyday since January when the job was first deployed.

The error was regarding the maxResultSize as described here: https://kb.databricks.com/jobs/job-fails-maxresultsize-exception#:~:text=This%20error%20occurs%20because%20the,the%20driver%20local%20file%20system.

I increased the configuration for this attribute from the default of 1g to 2g for this job and it’s working now. But I think this is only a temporary fix.

When debugging I noticed that the number of Tasks across my executors keeps increasing every day when I inspect the jobs even though nothing has changed (input data volume and output data volumes are steady). The metadata hasn’t changed and I don’t see any bad data getting pulled in on any of the recent data.

The only thing I can think of is the fact that the input data is not partitioned; in fact, new data gets appended to it daily and it is currently sitting around 4.5gb in Hadoop. My job only needs the most recent day’s data and I have a filter in place to fetch it.

Could this be the deeper root cause?

1

There are 1 answers

0
MatrixOrigin On

The reason for this error is that the total size of the output results of the Executors' tasks, when output to the Driver, exceeds the limit of spark.driver.maxResultSize.

Suggestions for handling:

Print the amount of data finally sent to the Driver program in the Executors' tasks to see if the daily data volume is continuously increasing. Optimize the task processing logic to see if it's possible to filter, aggregate, etc., more data in advance, thus reducing the volume of data returned to the Driver program. Partition the data and then process these partitions separately, instead of processing all data at once.