Using datashader with PySpark DataFrame

Question

Using datashader with PySpark DataFrame

915 views Asked by filipyoo At 03 September 2017 at 22:50

I'd like to plot 200 Gb of the NYC taxi Dataset. I managed to plot/visualize pandas dataframe using datashader. But I didn't manage to use the PySpark dataframe (using a 4-nodes cluster with 8Gb RAM on each) to get it done. What I can do though, is to use the .toPandas() method to convert the PySpark dataframe into a Pandas dataframe. But this will load the entire dataframe in RAM on the driver node (which has not enough RAM to fit the entire dataset), and therefore, doesn't make use of the distributed power of Spark.

I also know that, fetching only the pickup and dropoff longtitudes and latitudes will bring the dataframe to around ~30GB. But that doesn't change the problem.

I've created an issue on the datashader GitHub here Datashader issue opened

I have looked at Dask as an alternative, but it seems the conversion PySpark dataframe -> Dask dataframe is not supported yet.

Thank you for your suggestion !

Original Q&A

There are 2 answers

Gayatri On 05 September 2017 at 04:01

This is something different from Dask..

I would say that the best way to visualize such data with spark is to use zeppelin. It is easy to install https://zeppelin.apache.org/. You have default visualizations you can use with spark. check it out.

**mdurant** · Accepted Answer · 2017-09-05T02:15:43+00:00

There is, indeed, no direct way to convert a (distributed) pyspark dataframe to a Dask dataframe. However, Dask is its own execution engine, and you should be able to sidestep spark completely, if you wish. Dask is able to load datasets from CSV from a remote datasource such as S3 in a similar manner to spark, which might look something like:

df = dask.dataframe.read_csv('s3://bucket/path/taxi*.csv')

This works particularly well with datashader, which knows how to compute its aggregations using Dask, so you can work with datasets larger than memory, potentially computed across a cluster - all without spark.

The datashader examples contain both Dask and NYC taxi examples (but not the both together, unfortunately).

TechQA.

Using datashader with PySpark DataFrame

There are 2 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PYSPARK

Related Questions in DASK

Related Questions in DATASHADER

Popular Questions

Popular Tags

Trending Questions