Python Datashader to plot large 2D arrays of points

1.4k views Asked by At

I am looking for some help/advise in the use of datashader to plot a large 2D data array as a series of points, colored by amplitude. The data I deal with is housed in several 2D HDF5 datasets, with a time index stored in a separate dataset. The second dimension of the data is a spatial dimension (distance in m), which is a non-uniform stepped series of floats. The datasets are typically very large (~1000 x >1000000), so I would like to dask to handle the construction of an out-of-core dataframe, where the y-location of the data is stored as the column header, the x-location is the frame index, and I want to color-map the points to the data value The problem I have comes when I want to plot this in datashader from the dask dataframe, currently, the only way I've found is to flatten the dataframe and create two corresponding 'x' and 'y' columns to house the index and y-locations. Can anyone help me understand whether this plotting is possible without the step of flattening the data?

This is an example of what I have done thus far:

import datashader as ds
import datashader.transfer_functions as tf
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.array as da

import bokeh.plotting as bk
from bokeh.palettes import viridis

from datashader.bokeh_ext import InteractiveImage

bk.output_notebook()

# ------------------------
# This is a proxy for a function, which creates a delayed frame from
# a series of delayed pandas dataframes, each reading from a separate 
# h5 dataset.
random_data = da.random.random((10000, 1000), chunks = (1000, 100))
frame = dd.from_array(random_data)
# ------------------------

# ------------------------
# Flatten the dataframe and create two additional arrays holding the x and y
# locations.
a = frame.compute() # I want to avoid this call on the whole dataframe
index = [a.index] * len(a.columns)
index = np.vstack(index).reshape((-1), order = 'F')
columns = [a.columns] * len(a.index)
columns = [item for sublist in columns for item in sublist]
data = a.values.flatten()

# ------------------------
# Now creating an in-memory frame for the data
plot_frame = pd.DataFrame(columns = ['x', 'y', 'z']) # Empty frame
plot_frame.x = index
plot_frame.y = columns[::-1] #Reverse column order to plot
plot_frame.z = data
# ------------------------

x_range = [a.index[0], a.index[-1]]
y_range = [a.columns[0], a.columns[-1]]

def create_image(x_range = x_range, y_range = y_range[::-1], w=500, h=500):
    cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=h, plot_width=w)
    agg = cvs.points(plot_frame, 'x', 'y', ds.mean('z'))
    return tf.shade(agg, cmap = viridis(256))

def base_plot(tools='pan,wheel_zoom,reset, box_zoom, save'):
    p = bk.figure(x_range = x_range, y_range = y_range, tools=tools, 
                  plot_width=900, plot_height=500, outline_line_color=None,
        min_border=0, min_border_left=0, min_border_right=0,
        min_border_top=0, min_border_bottom=0, x_axis_type = 'datetime')   
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    return p

p = base_plot()
InteractiveImage(p, create_image)

Can anyone recommend a method for handling this more effectively through the datashader pipeline?

Thanks in advance!

1

There are 1 answers

2
James A. Bednar On

I'm responding just so that you know your question has been seen by the datashader maintainers, but unfortunately I don't know the best way to do precisely what you are asking for. As you can see in the OSM example in datashader, out-of-core datasets already work well using dask+datashader, but in that case the data was put into a suitable block and column addressable format first (castra originally, but now I'd personally recommend fastparquet based on my benchmarking). Here it seems like you are trying to preserve the original organization on disk while also making it look like a flat dataframe, and I'm not sure how that can be achieved. You could consider asking the dask maintainers directly; it's definitely not something the datashader team is currently investigating.