I'd like to convert a PySpark
DataFrame (pyspark.sql.DataFrame
) to Pandas
dataframe. There is a builtin method toPandas()
which is very inefficient (Please read Wes McKinney'd article about this issue back in Fenruary 2017 here and his calculation in this jupyter notebook).
In the meantime, some efforts have been done to make such a conversion faster. An example, would be Josh's function here. However, that did not help me since I am looking to transfer +1M rows from pysaprk.DataFrame
to Pandas
, this solution did not work for me.
Luckily, as shown in this post in July 26th 2017, thanks to authors Wes, Li and Holden, the functionality of toPandas()
has been significantly improved thanks to implementation of Apache Arrow
in Spark 2.3
. That being said, I don't have access to Spark 2.3 (I am using Spark 2.1).
So, my question is how I can use Apache Arrow
functionalities to convert pyspark
dataframe to Pandas
fast for Spark
older than 2.1. I think a lot of people are stuck with older versions of Spark
and can benefit from this.
Update 1: I have been suggested to print pyspark
to CSV
files first and then read CSV
files from Pandas
powerful read_csv
method. I am really hopeful that I will find a way to avoid doing that!
Update 2: The reasoning behind slowness of toPandas()
method and possible out of memory
issues are discussed in details in this discussion
dfSpark = spark.sql(sqlQuery)
df = dfSpark.toPandas() # Very slow / out of memory error
Have you tried using an intermediate file?
You can save the file to parquet from spark, then read it in pandas.
see more : https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#parquet-files
check out pyarrow read parquet files:
https://arrow.apache.org/docs/python/parquet.html
Also, if you are running out of memory, sample the df, or filter it before writing it.