Parallel Processing vignette #21

topepo · 2020-07-31T22:24:13Z

I would suggest that, when using tune, the standard foreach parallelism be suggested and the model-specific threading methods be used if just parsnip is being used to fit.

Generally, parallelizing the resamples is faster than the individual models (see xgboost example). We always try to parallelize the longest running loop.

The text was updated successfully, but these errors were encountered:

Athospd · 2020-08-02T00:59:24Z

Hey, Max, glad to see you here. I was writing about forking and then I decided to perform a benchmark to enrich the vignette.
I was expecting to corroborate your findings but I ended up with counter-intuitive results.

tl;dr: pure forking or pure threading wasn't the best: 2 threads with 4 workers was the fastest setup.

see here https://curso-r.github.io/treesnip/articles/threading-forking-benchmark.html

Do you think that it is worth it to consider these combinations? Or is it better to stick with the simple rule of thumb (tune -> forking; fit -> thread)?

topepo · 2020-08-03T12:53:08Z

That's really interesting! TBH I',m surprised that a combination like that works at all. Can you make a plot of the x-axis as the speed-up (seq time/par time)?

I might run some of these too locally this weekend.

Athospd · 2020-08-09T01:33:57Z

@topepo I'm running more benchmarks here and I think I spotted a potential issue you might want to check yourself to confirm: when I set vfold_cv(v = 3) only 3 workers were used even with tune_grid() set to fit lots of different models. And when I set to vfold_cv(v = 8) I watched all my 8 cores 100%. My hypothesis is that tune_grid() is forking only on the folds loop.

gregleleu · 2020-11-06T20:04:49Z

Hi,
I'm using doFuture/doRNG parallel processing for my tidymodels workflows (for tuning), with other engines (apparently I need to load doFuture before using doRNG, but I'm still trying to check that):

library(doFuture)
registerDoFuture()
plan(multisession)

doRNG::registerDoRNG()

It fails when using treesnip with catboost. I get an error: Error in pkg_list[[1]]: subscript out of bounds.
This is because catboost and treesnip are not loaded on the workers (I can't fork because of Rstudio, and there is a consensus you shouldn't fork from Rstudio).
It works when I "register" the dependencies manually (see tidymodels/tune#205):

set_dependency("boost_tree", eng = "catboost", "catboost")
set_dependency("boost_tree", eng = "catboost", "treesnip")

It could useful to either document that somewhere for people or maybe there is a place where you can include the set_dependency commands.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Processing vignette #21

Parallel Processing vignette #21

topepo commented Jul 31, 2020

Athospd commented Aug 2, 2020 •

edited

Loading

topepo commented Aug 3, 2020

Athospd commented Aug 9, 2020 •

edited

Loading

gregleleu commented Nov 6, 2020

Parallel Processing vignette #21

Parallel Processing vignette #21

Comments

topepo commented Jul 31, 2020

Athospd commented Aug 2, 2020 • edited Loading

topepo commented Aug 3, 2020

Athospd commented Aug 9, 2020 • edited Loading

gregleleu commented Nov 6, 2020

Athospd commented Aug 2, 2020 •

edited

Loading

Athospd commented Aug 9, 2020 •

edited

Loading