Error and Partition Adjustment Issue when Saving Parquet Files in PySpark

63 views Asked by At

I'm encountering an issue when attempting to save parquet files using PySpark. It appears that I'm running into an error, and I suspect it may be related to the number of partitions I'm working with.

While running my code, I receive a warning message that says: "WARN DAGScheduler: Broadcasting large task binary with size 15.9 MiB."

Later in the process, my terminal starts filling up with the following log message: "at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)."

I'm seeking insights and suggestions on how to address this issue and optimize the partitioning for improved performance.

Your assistance in resolving this matter would be greatly appreciated. Thank you in advance for your valuable input!

The last partition count I checked exceeded 400,000, but adjusting the number of partitions to match my computer's specifications results in unacceptably long execution times. As a temporary workaround, I've tried coalescing the data into 200 partitions using the following code: "concatenatedData2 = concatenatedData2.coalesce(200)."

0

There are 0 answers