Skip to content

Is there any way of having .map_blocks be even more opaque to dask? #8414

@max-sixty

Description

@max-sixty

Is your feature request related to a problem?

Currently I have a workload which does something a bit like:

ds = open_zarr(source)
(
    ds.assign(
        x=ds.foo * ds.bar
        y=ds.foo + ds.bar
    ).to_zarr(dest)
)

(the actual calc is a bit more complicated! And while I don't have a MVCE of the full calc, I pasted a task graph below)

Dask — while very impressive in many ways — handles this extremely badly, because it attempts to load the whole of ds into memory before writing out any chunks. There are lots of issues on this in the dask repo; it seems like an intractable problem for dask.

image

Describe the solution you'd like

I was hoping to make the internals of this task opaque to dask, so it became a much dumber task runner — just map over the blocks, running the function and writing the result, block by block. I thought I had some success with .map_blocks last week — the internals of the calc are now opaque at least. But the dask cluster is falling over again, I think because the write is seen as a separate task.

Is there any way to make the write more opaque too?

Describe alternatives you've considered

I've built a homegrown thing which is really hacky which does this on a custom scheduler — just runs the functions and writes with region. I'd much prefer to use & contribute to the broader ecosystem...

Additional context

(It's also possible I'm making some basic error — and I do remember it working much better last week — so please feel free to direct me / ask me for more examples, if this doesn't ring true)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions