Add `scheduler` parameter #44

jcrist · 2017-04-11T16:37:53Z

Allow specifying the scheduler by name instead of passing in the get function directly. Scheduler can be one of:

None, to use the global scheduler, falls back to threading if not set (default)
A string specifying the scheduler name ("threading", "multiprocessing", or "sync")
A scheduler get function
Anything that can be passed to dask.distributed.Client.

Not fully set on this yet.

Pros:

Easier for less dask-savy users to specify different schedulers, feels a bit more like joblib.

Cons:

Doesn't match how schedulers are normally set in dask. Disconnect may feel foreign.

If we go this route, then I'd also add n_jobs as a parameter (matching scikit-learn), which would specify num_workers for the threading and multiprocessing schedulers, and be ignored by the others. Might also make n_jobs=1 for all but distributed result in the synchronous scheduler. Downside of supporting n_jobs here is we'd probably want the default to match what dask does (n_jobs = cpu_count()) instead of what scikit-learn does (n_jobs=1). I'm fine with this, but it is a difference.

If we don't go this route, then I might add a scheduler_kwargs parameter instead, which would be forwarded to the get call. Not sure if any of the other keyword arguments would prove useful for this library though.

Allow specifying the scheduler by name instead of passing in the `get` function directly

mrocklin · 2017-04-11T16:41:12Z

We might also consider adopting a convention like this within Dask. The term get is a pretty bad name. It actually comes all the way from the first commit

- Copy of fix for pickling masked arrays - Add distributed to travis

Still needs tests.

- Add tests for n_jobs - Add tests for scheduler parameter

jcrist · 2017-04-17T18:26:11Z

Ok, I've cleaned this up and added support for n_jobs, which has been asked for in the past.

A few potential issues:

This makes it a bit tricky to provide good error messages on bad parameters, since we just pass all unrecognized values to Client. I've tried to produce a nice error message from this, but there may be a better.
I'm also not 100% pleased with the scheduler names. Valid names are:
- threading
- multiprocessing
- sync
The first two are good, and match joblib but not the dask docs (we say threaded instead of threading). The third is an abbreviation, which may be undesired. Perhaps synchronous (a bit long though)? The equivalent in joblib is sequential. I don't think that's a good name here, as we don't enforce an ordering of execution (which sequential would seem to indicate).
We also have a different default (threading, n_jobs=cpu_count()) than scikit-learn (processes, n_jobs=1). I think this makes sense though.

mrocklin · 2017-04-18T13:41:24Z

I think that following joblib over dask names (like threading over threaded) makes sense for this interface. You could also support aliases so that all of sync, sequential, synchronous would map to the same value.

Actually, I like this idea because then we can establish these names in more parts of Dask where we might prefer threaded over threading.

scheduler_types = {
  'sync': 'synchronous',
  'sequential': 'synchronous',  # from joblib
  'threading': 'threaded',
  ...
}

scheduler = scheduler_types.get(scheduler, scheduler)

mrocklin · 2017-04-18T13:41:47Z

I would like to give people the option to provide names rather than get= functions.

jcrist · 2017-04-18T15:09:12Z

I liked the alias idea and added it. I think this is good to go. Merging.

Add scheduler parameter

a8b8268

Allow specifying the scheduler by name instead of passing in the `get` function directly

jcrist added 3 commits April 11, 2017 13:14

A few tweaks

bb764f9

- Copy of fix for pickling masked arrays - Add distributed to travis

Support n_jobs

4b41bd7

Still needs tests.

Support n_jobs parameter

127fb24

- Add tests for n_jobs - Add tests for scheduler parameter

Support scheduler aliases

176f0af

jcrist merged commit 36ef51b into dask:master Apr 18, 2017

jcrist deleted the scheduler-param branch April 18, 2017 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `scheduler` parameter #44

Add `scheduler` parameter #44

jcrist commented Apr 11, 2017

mrocklin commented Apr 11, 2017

jcrist commented Apr 17, 2017

mrocklin commented Apr 18, 2017

mrocklin commented Apr 18, 2017

jcrist commented Apr 18, 2017

Add scheduler parameter #44

Add scheduler parameter #44

Conversation

jcrist commented Apr 11, 2017

mrocklin commented Apr 11, 2017

jcrist commented Apr 17, 2017

mrocklin commented Apr 18, 2017

mrocklin commented Apr 18, 2017

jcrist commented Apr 18, 2017

Add `scheduler` parameter #44

Add `scheduler` parameter #44