Using datashader with PySpark DataFrame

913 views Asked by At

I'd like to plot 200 Gb of the NYC taxi Dataset. I managed to plot/visualize pandas dataframe using datashader. But I didn't manage to use the PySpark dataframe (using a 4-nodes cluster with 8Gb RAM on each) to get it done. What I can do though, is to use the .toPandas() method to convert the PySpark dataframe into a Pandas dataframe. But this will load the entire dataframe in RAM on the driver node (which has not enough RAM to fit the entire dataset), and therefore, doesn't make use of the distributed power of Spark.

I also know that, fetching only the pickup and dropoff longtitudes and latitudes will bring the dataframe to around ~30GB. But that doesn't change the problem.

I've created an issue on the datashader GitHub here Datashader issue opened

I have looked at Dask as an alternative, but it seems the conversion PySpark dataframe -> Dask dataframe is not supported yet.

Thank you for your suggestion !

2

There are 2 answers

1
mdurant On BEST ANSWER

There is, indeed, no direct way to convert a (distributed) pyspark dataframe to a Dask dataframe. However, Dask is its own execution engine, and you should be able to sidestep spark completely, if you wish. Dask is able to load datasets from CSV from a remote datasource such as S3 in a similar manner to spark, which might look something like:

df = dask.dataframe.read_csv('s3://bucket/path/taxi*.csv')

This works particularly well with datashader, which knows how to compute its aggregations using Dask, so you can work with datasets larger than memory, potentially computed across a cluster - all without spark.

The datashader examples contain both Dask and NYC taxi examples (but not the both together, unfortunately).

0
Gayatri On

This is something different from Dask..

I would say that the best way to visualize such data with spark is to use zeppelin. It is easy to install https://zeppelin.apache.org/. You have default visualizations you can use with spark. check it out.