How to fetch all rows data from spark dataframe to a file using pyspark in databricks

755 views Asked by At

I'm trying to fetch all rows data from spark dataframe to a file in databricks. I'm able to write df data to a file with only few counts. Suppose if i'm getting the count in df as 100 , then in file its 50 count so it's skipping the data.How can i load completed data from dataframe to a file without skipping the data. I have created a udf that udf will open the file and append the data to it.I have called that udf in spark sql df.

Can someone help me on this issue?

1

There are 1 answers

0
Andrew Corson On

I would advise against using a udf the way you are for a few reasons:

  • UDFs run on the worker nodes, so you would have multiple udfs, each writing a portion of your data to a local file.
  • Even if you have your UDF appending to a file in a shared location (like the DBFS), you still have multiple nodes writing to a file concurrently, which could lead to errors.
  • Spark already has a way to do this out of the box that you should take advantage of

To write a spark dataframe to a file in databricks: Use the Dataframe.write attribute (Databricks docs). There are plenty of options, so should be able to do whatever you need (Spark docs (this one is for CSVs))

Note on partitions: Spark writes each partition of the DF in its own file, so you should use the coalesce function (warning: this is very slow with extremely large dataframes since spark has to fit the whole dataframe into memory on the driver node)

Note on File locations: The file path you give will be on the driver node, so unless you plan on reading it back with another script, you should start your path with "/dbfs" , which is mounted onto all of the nodes' file systems.This way, it is saved on the Databricks File System, which is accessible from any cluster in your databricks instance. (It's also available to download using the Databricks CLI.)

Full Example:

df_to_write = my_df.select(<columns you want>)
df_to_write.coalesce(1).write.csv("/dbfs/myFileDownloads/dataframeDownload.csv")