-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Is your feature request related to a problem?
Currently I have a workload which does something a bit like:
ds = open_zarr(source)
(
ds.assign(
x=ds.foo * ds.bar
y=ds.foo + ds.bar
).to_zarr(dest)
)
(the actual calc is a bit more complicated! And while I don't have a MVCE of the full calc, I pasted a task graph below)
Dask — while very impressive in many ways — handles this extremely badly, because it attempts to load the whole of ds
into memory before writing out any chunks. There are lots of issues on this in the dask repo; it seems like an intractable problem for dask.

Describe the solution you'd like
I was hoping to make the internals of this task opaque to dask, so it became a much dumber task runner — just map over the blocks, running the function and writing the result, block by block. I thought I had some success with .map_blocks
last week — the internals of the calc are now opaque at least. But the dask cluster is falling over again, I think because the write is seen as a separate task.
Is there any way to make the write more opaque too?
Describe alternatives you've considered
I've built a homegrown thing which is really hacky which does this on a custom scheduler — just runs the functions and writes with region
. I'd much prefer to use & contribute to the broader ecosystem...
Additional context
(It's also possible I'm making some basic error — and I do remember it working much better last week — so please feel free to direct me / ask me for more examples, if this doesn't ring true)