Converting PySpark DataFrame to Pandas using Apache Arrow

5k views Asked by At

I'd like to convert a PySpark DataFrame (pyspark.sql.DataFrame) to Pandas dataframe. There is a builtin method toPandas() which is very inefficient (Please read Wes McKinney'd article about this issue back in Fenruary 2017 here and his calculation in this jupyter notebook).

In the meantime, some efforts have been done to make such a conversion faster. An example, would be Josh's function here. However, that did not help me since I am looking to transfer +1M rows from pysaprk.DataFrame to Pandas, this solution did not work for me.

Luckily, as shown in this post in July 26th 2017, thanks to authors Wes, Li and Holden, the functionality of toPandas() has been significantly improved thanks to implementation of Apache Arrow in Spark 2.3. That being said, I don't have access to Spark 2.3 (I am using Spark 2.1).

So, my question is how I can use Apache Arrow functionalities to convert pyspark dataframe to Pandas fast for Spark older than 2.1. I think a lot of people are stuck with older versions of Spark and can benefit from this.

Update 1: I have been suggested to print pyspark to CSV files first and then read CSV files from Pandas powerful read_csv method. I am really hopeful that I will find a way to avoid doing that!

Update 2: The reasoning behind slowness of toPandas() method and possible out of memory issues are discussed in details in this discussion

dfSpark = spark.sql(sqlQuery)
df = dfSpark.toPandas() # Very slow / out of memory error
1

There are 1 answers

0
ab3 On

Have you tried using an intermediate file?

You can save the file to parquet from spark, then read it in pandas.

#spark dataframe 
df.write.parquet("path/file.parquet")

see more : https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#parquet-files

check out pyarrow read parquet files:

https://arrow.apache.org/docs/python/parquet.html

import pyarrow.parquet as pq
table = pq.read_table('example.parquet') 

#or if you want to only read some of the colums 
table = pq.read_table('example.parquet', columns=['one', 'three'])

df = table.to_pandas()  #pandas df

Also, if you are running out of memory, sample the df, or filter it before writing it.