Saving a file locally in Databricks PySpark

18.6k views Asked by At

I am sure there is documentation for this somewhere and/or the solution is obvious, but I've come up dry in all of my searching.

I have a dataframe that I want to export to a text file to my local machine. The dataframe contains strings with commas, so just display -> download full results ends up with a distorted export. I'd like to export out with a tab-delimiter, but I cannot figure out for the life of me how to download it locally.

I have

match1.write.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.save("file:\\\C:\\Users\\user\\Desktop\\NewsArticle.txt")

but clearly this isn't right. I suspect it is writing somewhere else (somewhere I don't want it to be...) because running it again gives me the error that the path already exists. So... what is the correct way?

2

There are 2 answers

2
Prem On

Check if it is present at below location. Multiple part files should be there in that folder.

import os
print os.getcwd()

If you want to create a single file (not multiple part files) then you can use coalesce()(but note that it'll force one worker to fetch whole data and write these sequentially so it's not advisable if dealing with huge data)

df.coalesce(1).write.format("csv").\
    option("delimiter", "\t").\
    save("<file path>")
2
kodachrome On

cricket_007 pointed me along the right path--ultimately, I needed to save the file to the Filestore of Databricks (not just dbfs), and then save the resulting output of the xxxxx.databricks.com/file/[insert file path here] link.

My resulting code was:

df.repartition(1) \ #repartitioned to save as one collective file
.write.format('csv') \ #in csv format
.option("header", True) \ #with header
.option("quote", "") \ #get rid of quote escaping
.option(delimiter="\t") \ #delimiter of choice
.save('dbfs:/FileStore/df/') #saved to the FileStore