Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run all variants of a foreach stage in parallel #6035

Closed
kowaalczyk opened this issue May 18, 2021 · 3 comments
Closed

Run all variants of a foreach stage in parallel #6035

kowaalczyk opened this issue May 18, 2021 · 3 comments

Comments

@kowaalczyk
Copy link

In my current project I have use case for a foreach stage with a long list of parameters (3048, to be specific).
I'd love to be able to dvc repro this stage in parallel.

For instance, for a stage like this (which is roughly what I'm doing right now):

# stage definition in dvc.yaml
upload_file:
  foreach:
  - path/to/input/directory/1
  - path/to/input/directory/2
  (...)
  - path/to/input/directory/3048
  do:
    cmd: python upload_my_file_to_remote_server.py ${item}/data.json >${item}/response.log
    deps:
    - upload_my_file_to_remote_server.py
    - ${item}/data.json
    outs:
    - ${item}/response.log

I'd like to run something like:

dvc repro upload_file -j 64

To run 64 jobs iterating over disjoint sub-arrays of items declared in the foreach list, effectively achieving a 64x speedup. The parameter -j would be required for my use case, as the API I'm working with is rate-limited (I cannot send all 3048 files at once).

I realize the concurrency of repro and run commands was adressed many times, and as far as I understand correctly this is something I should manage on my own as a user - however, for the foreach stages, concurrency should be significantly easier to manage for dvc - variants of the same foreach stage cannot share outputs, and as long as the entire foreach stage (with all its variants) is processed before any downstream stages, there is no way to cause a conflict between parallel jobs.

My temporary workaround will be to add a single stage with all of the files as dependencies, that uses Python's multiprocessing to handle the concurrency of uploads - which is a bolierplate piece of code that I have implemented for many other projects in the past, so it would be really nice to have this capability provided by a tool like dvc.

@karajan1001
Copy link
Contributor

karajan1001 commented May 19, 2021

Sorry, DVC is not designed in a way to allow running more than one stage at a time within a single repository instance/workspace, and there are existing feature requests to support this behavior for example #755, #3783.

And there are also some discussions about foreach parallelization in #5440

@kowaalczyk
Copy link
Author

Glad to see this being discussed in #5440 - just wanted to flag my use case for the parallel foreach, closing this issue for now :)

@Antho2422
Copy link

Hello,

Is there a proposed implementation of this particular use case yet ? Would be really useful :)

Thank you,
Anthony

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants