I'd like to plot 200 Gb of the NYC taxi Dataset. I managed to plot/visualize pandas dataframe using datashader. But I didn't manage to use the PySpark dataframe (using a 4-nodes cluster with 8Gb RAM on each) to get it done. What I can do though, is to use the .toPandas() method to convert the PySpark dataframe into a Pandas dataframe. But this will load the entire dataframe in RAM on the driver node (which has not enough RAM to fit the entire dataset), and therefore, doesn't make use of the distributed power of Spark.
I also know that, fetching only the pickup and dropoff longtitudes and latitudes will bring the dataframe to around ~30GB. But that doesn't change the problem.
I've created an issue on the datashader GitHub here Datashader issue opened
I have looked at Dask as an alternative, but it seems the conversion PySpark dataframe -> Dask dataframe is not supported yet.
Thank you for your suggestion !
There is, indeed, no direct way to convert a (distributed) pyspark dataframe to a Dask dataframe. However, Dask is its own execution engine, and you should be able to sidestep spark completely, if you wish. Dask is able to load datasets from CSV from a remote datasource such as S3 in a similar manner to spark, which might look something like:
This works particularly well with datashader, which knows how to compute its aggregations using Dask, so you can work with datasets larger than memory, potentially computed across a cluster - all without spark.
The datashader examples contain both Dask and NYC taxi examples (but not the both together, unfortunately).