Dask Delayed ignores name for dependent variables

192 views Asked by At

When creating a graph of calculations using delayed I'm trying to assign names so that if I visualize the graph it's readable. However, for delayed variables that are dependent on functions the name parameter doesn't seem to affect the key. Here's a toy example:

def calc_avg(a, b):
    return pd.concat([a, b], axis=1).mean(axis=1)

def calc_ratio(a, b):
    return a / b

a = delayed(pd.Series(np.random.rand(10)), name='a')
b = delayed(pd.Series(np.random.rand(10)), name='b')
c = delayed(pd.Series(np.random.rand(10)), name='c')

x = delayed(calc_avg, name='avg_result')(a,b)
y = delayed(calc_ratio, name='ratio_result')(x,c)

y.visualize()

You can see the visualization here (I can't embed images), but rather than seeing 'avg_result' I see 'calc_avg-#0' and rather than see 'ratio_result' I see 'calc_ratio-#1'. If I look at x.key or y.key they do not match the names that I provided. Is this the expected behavior?

1

There are 1 answers

0
mdurant On BEST ANSWER

The key of a dask result needs to be unique for every combination of the function that was delayed, and the inputs you give it. What you see above is the expected behaviour: you are naming the function, but a call with different inputs would expect a different output, so the key must be different.

You can specify the key you'd like associated not when you define the delayed function, but when you call it:

x = delayed(calc_avg)(a, b, dask_key_name='avg_result')
y = delayed(calc_ratio)(x, c, dask_key_name='ratio_result')