[REVIEW] Formalization of Computation #4923

madsbk · 2021-06-17T14:25:56Z

This PR implements graph computations based on the specification in Dask:

A computation may be one of the following:

Any key present in the Dask graph like 'x'

Any other value like 1, to be interpreted literally

A task like (inc, 'x')

A list of computations, like [1, 'x', (inc, 'x')]

In order to support efficient and flexible task serialization, this PR introduces classes for computations, tasks, data, functions, etc.
This PR continues the existing coarse grained approach where (nested) tasks are pickled. We used to use dumps_function() and warn_dumps() to do this. Now, we use PickledComputation as the outermost wrapper and PickledObjects for already pickled objects. This makes it possible for HLG.unpack(), which runs on the Scheduler, to build new tasks of already pickled objects.

Notable Classes

PickledObject - An object that are serialized using protocol.pickle. This object isn't a computation by itself instead users can build computations containing them. It is automatically de-serialized by the Worker before execution.
Computation - A computation that the Worker can execute. The Scheduler sees this as a black box. A computation cannot contain pickled objects but it may contain Serialize and/or Serialized objects, which will be de-serialize when arriving on the Worker automatically.
PickledComputation - A computation that are serialized using protocol.pickle. The class is derived from Computation but can contain pickled objects. Itself and contained pickled objects will be de-serialized by the Worker before execution.

Notable Functions

typeset_dask_graph() - Use to typeset a Dask graph, which wrap computations in either the Data or Task class. This should be done before communicating the graph. Note, this replaces the old tlz.valmap(dumps_task, dsk) operation.

Closes [Discussion] Serialize objects within tasks #4673
Closes [DISCUSSION] Can the scheduler use pickle.dumps? #4890
Closes [WIP] Distinguish tuples & lists in MsgPack serialization #4575
Closes Single pass serialization #4699
Closes [WIP] Fine grained serialization #4897
Closes dask.dataframe.read_csv('./filepath/*.csv') returning tuple dask#7777
Closes dumps_task in SimpleShuffleLayer and BroadcastJoinLayer unpack dask#7650
Tests added / passed
Passes black distributed / flake8 distributed / isort distributed

mrocklin · 2021-06-18T12:05:32Z

@jcrist @TomAugspurger if either of you have time (I suspect not) it would be good to get the perspective of someone who is familiar with Dask broadly, who also isn't familiar with the work that Mads and team have been doing. A sanity check here would be welcome.

distributed/diagnostics/tests/test_widgets.py

mrocklin · 2021-06-18T14:04:14Z

distributed/protocol/computation.py

+- `PickledObject` - An object that are serialized using `protocol.pickle`.
+  This object isn't a computation by itself instead users can build pickled
+  computations that contains pickled objects. This object is automatically
+  de-serialized by the Worker before execution.


Should we have PickledObjects? Or should we use the general serialization path for this?

You mean if we have like a list of objects? That should also just use a single PickledObject to serialize everything in one go.
We use typeset_computation() to look through a computation and wrap individual task functions in PickledCallable.

I mean "why would we ever want to pickle an object, rather than give the rest of our serialization machinery a chance to work?" For example, what if the object was a cupy array.

Or maybe this is used very infrequently, and not where user-data is likely to occur?

This is basically the fine vs coarse grained serialization discussion. This PR continues the existing coarse grained approach where (nested) tasks are pickled. We used to use dumps_function() and warn_dumps() to do this.
Now, we use PickledComputation as the outermost wrapper and PickledObjects for already pickled objects. This makes it possible for HLG.unpack() that runs on the Scheduler to build new tasks of already pickled objects.

Or maybe this is used very infrequently, and not where user-data is likely to occur?

Yes, including large data in the task directly will raise a warning just like it use to.

…_of_computation

GPUtester · 2021-08-02T11:39:12Z

Can one of the admins verify this patch?

madsbk · 2021-08-03T11:14:43Z

@jrbourbeau if you can find time to review this PR, it would be great.

madsbk force-pushed the formalization_of_computation branch from 33cc926 to 4c1cd29 Compare June 17, 2021 14:27

madsbk mentioned this pull request Jun 17, 2021

Use Distributed's new Computation API dask/dask#7815

Closed

4 tasks

madsbk marked this pull request as ready for review June 17, 2021 14:55

madsbk force-pushed the formalization_of_computation branch 3 times, most recently from 4a3ab81 to d71974a Compare June 18, 2021 10:25

madsbk added 4 commits June 18, 2021 12:30

Initially impl. of typed computations

1e1b399

CI: use Dask's sister PR temporary

d2af4ab

removed dumps_task()

9235a80

clean up the classes

2adfcfc

madsbk force-pushed the formalization_of_computation branch from d71974a to 2adfcfc Compare June 18, 2021 10:30

madsbk added 3 commits June 18, 2021 12:48

doc

69f920c

Remove loads_function

1ee1e0c

Remove dumps_function

1ffbbd1

madsbk changed the title ~~[WIP] Formalization of Computation~~ [REVIEW] Formalization of Computation Jun 18, 2021

mrocklin reviewed Jun 18, 2021

View reviewed changes

distributed/diagnostics/tests/test_widgets.py Show resolved Hide resolved

mrocklin reviewed Jun 18, 2021

View reviewed changes

madsbk added 3 commits June 18, 2021 16:05

Fixed fails because of the removal of dumps_function

55aba5c

doc

7e0e309

doc

0791c89

jakirkham mentioned this pull request Jul 30, 2021

Are reference cycles a performance problem? #4987

Open

Merge branch 'main' of github.com:dask/distributed into formalization…

8e419a8

…_of_computation

jakirkham mentioned this pull request Sep 17, 2021

Rust implementation of dask-scheduler #3139

Open

This was referenced Feb 1, 2022

Cythonic SchedulerState (WIP) #5176

Closed

Drop support for cythonized scheduler #5685

Closed

madsbk closed this Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Formalization of Computation #4923

[REVIEW] Formalization of Computation #4923

madsbk commented Jun 17, 2021 •

edited

Loading

mrocklin commented Jun 18, 2021

mrocklin Jun 18, 2021

madsbk Jun 18, 2021

mrocklin Jun 18, 2021

madsbk Jun 18, 2021

madsbk Jun 18, 2021

GPUtester commented Aug 2, 2021

madsbk commented Aug 3, 2021

[REVIEW] Formalization of Computation #4923

[REVIEW] Formalization of Computation #4923

Conversation

madsbk commented Jun 17, 2021 • edited Loading

Notable Classes

Notable Functions

mrocklin commented Jun 18, 2021

mrocklin Jun 18, 2021

Choose a reason for hiding this comment

madsbk Jun 18, 2021

Choose a reason for hiding this comment

mrocklin Jun 18, 2021

Choose a reason for hiding this comment

madsbk Jun 18, 2021

Choose a reason for hiding this comment

madsbk Jun 18, 2021

Choose a reason for hiding this comment

GPUtester commented Aug 2, 2021

madsbk commented Aug 3, 2021

madsbk commented Jun 17, 2021 •

edited

Loading