You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my current project I have use case for a foreach stage with a long list of parameters (3048, to be specific).
I'd love to be able to dvc repro this stage in parallel.
For instance, for a stage like this (which is roughly what I'm doing right now):
To run 64 jobs iterating over disjoint sub-arrays of items declared in the foreach list, effectively achieving a 64x speedup. The parameter -j would be required for my use case, as the API I'm working with is rate-limited (I cannot send all 3048 files at once).
I realize the concurrency of repro and run commands was adressed many times, and as far as I understand correctly this is something I should manage on my own as a user - however, for the foreach stages, concurrency should be significantly easier to manage for dvc - variants of the same foreach stage cannot share outputs, and as long as the entire foreach stage (with all its variants) is processed before any downstream stages, there is no way to cause a conflict between parallel jobs.
My temporary workaround will be to add a single stage with all of the files as dependencies, that uses Python's multiprocessing to handle the concurrency of uploads - which is a bolierplate piece of code that I have implemented for many other projects in the past, so it would be really nice to have this capability provided by a tool like dvc.
The text was updated successfully, but these errors were encountered:
Sorry, DVC is not designed in a way to allow running more than one stage at a time within a single repository instance/workspace, and there are existing feature requests to support this behavior for example #755, #3783.
And there are also some discussions about foreach parallelization in #5440
In my current project I have use case for a
foreach
stage with a long list of parameters (3048, to be specific).I'd love to be able to
dvc repro
this stage in parallel.For instance, for a stage like this (which is roughly what I'm doing right now):
I'd like to run something like:
To run 64 jobs iterating over disjoint sub-arrays of items declared in the
foreach
list, effectively achieving a 64x speedup. The parameter-j
would be required for my use case, as the API I'm working with is rate-limited (I cannot send all3048
files at once).I realize the concurrency of
repro
andrun
commands was adressed many times, and as far as I understand correctly this is something I should manage on my own as a user - however, for theforeach
stages, concurrency should be significantly easier to manage for dvc - variants of the sameforeach
stage cannot share outputs, and as long as the entireforeach
stage (with all its variants) is processed before any downstream stages, there is no way to cause a conflict between parallel jobs.My temporary workaround will be to add a single stage with all of the files as dependencies, that uses Python's
multiprocessing
to handle the concurrency of uploads - which is a bolierplate piece of code that I have implemented for many other projects in the past, so it would be really nice to have this capability provided by a tool likedvc
.The text was updated successfully, but these errors were encountered: