Pyspark error while writing large dataframe to file

69 views Asked by At

I am trying to write my dataframe df_trans(which has about 10 mill records) to file and want to compare the performance by writing it to parquet vs orc vs csv.

df_trans.write.mode('overwrite').parquet('path')
df_trans.write.mode('overwrite').orc('path')
df_trans.write.mode('overwrite').csv('path')

When I run these 3 statements together from main it fails with error "An error occurred while calling o206.parquet. : java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/FileFormatWriter$$anonfun$write$9 Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$9"

When I run these individually they each takes about 30 mins to write. Note: I am running it in local mode.

I am unable gather much from the error logs, but my guess is it runs out of memory and fails. Is there anything I can do to speed this up?

0

There are 0 answers