exp: set workers per stage #9363

dberenbaum · 2023-04-24T17:37:18Z

Related: #755. This is a narrower issue than #755 that comes from conversations with users about how they can run more experiments at once.

dvc exp run -j runs experiments in parallel across multiple workers, but experiments may be too memory-intensive to run many experiments at once, even on large machines. One way to mitigate this issue is by identifying which stages are memory-intensive and setting quotas on the number of workers for those stages. For example, I may have many stages that read data in small batches but a single stage that must read in all the data at once to aggregate/combine it. If I can tell DVC that I want to use a maximum of 10 jobs for all other stages but only 2 jobs for my aggregation stage, then I can still take advantage of experiment parallelization without overloading the machine's memory.

The text was updated successfully, but these errors were encountered:

pmrowla · 2023-04-25T02:10:19Z

I don't think this is a different issue than #755. It's a slightly different high level use case, but in terms of DVC functionality it's asking for the same thing. The exp run workers don't know anything about the pipeline. They just run the top level dvc repro calls separately.

What we are talking about here (and in #755) is having a completely separate worker pool for executing individual stages in parallel, and dvc repro using those workers when it runs the stages in a pipeline.

(I think this issue is actually more complex than #755, since it also needs the ability to restrict workers to specific stages, and designate different levels of concurrency for individual stages, as opposed to a single level of concurrency for any/all stages)

shcheklein · 2023-04-25T02:41:15Z

@pmrowla I think it can be a simpler scope though. If "somehow" we pass information to DVC and it actively throttles / waits before running a stage if there are multiple stages of that type are running already. I don't think we need a separate pool then, or ability to run different stages in parallel within a single pipeline (e.g. parallelizing foreach, etc) then.

It's a different question if that makes sense to implement in such way or not. Point here is that, indeed, a high level scenario doesn't require any additional parallelism as far as I understand, it requires an ability to add quotas into the existing system (on the dvc.yaml file, not workers), and there can be different ways of implementing this.

WDYT?

pmrowla · 2023-04-25T02:52:05Z

In this case, your other workers are just going to block when they get to stage with quotas (since we can only run a pipeline sequentially right now). Using dave's example scenario, the 8 other workers are going to spend a majority of the time blocking and waiting for the memory intensive jobs to finish (since we can only do 2 of them at a time), in which case this is effectively the same thing as just doing exp run -j 2 with the existing behavior.

I guess I'm (maybe incorrectly) making the assumption that the memory intensive stage is going to take the most execution time for the user's pipeline. In the event that the rest of the pipeline is actually slower than the memory intensive stage then I suppose there would still be some benefit to this?

shcheklein · 2023-04-25T03:02:49Z

your other workers are just going to block when they get to stage with quotas

Yes, that's true, but it fits certain scenarios quite well. E.g. there are some relatively quick but very resource intensive stages. It's fine that we don't execute the next stage until some of the first are done and there are some free workers even if that means that some workers are not doing anything.

pmrowla · 2023-04-25T03:12:40Z

In that case we could probably do this with a relatively naive solution like keeping a count of the number of processes running each stage at a time somewhere in the main DVC repo (and queue/temp runs would have to know to check the count in the main repo, not within the temp workspace's .dvc directory). But it would still have to be smart enough to account for situations where one worker or pipeline stage dies ungracefully (meaning one and only one of the blocking workers has to then decrement the count for the crashed worker)

This also only works when there is only one person running DVC experiments on the machine (in one DVC repo). If you have a scenario where multiple users are running jobs on a particular machine (from their own separate clones of a DVC repo), this won't actually work unless the counter was system-wide.

dberenbaum · 2023-04-25T20:29:27Z

@pmrowla This is related to a user request where the low-memory stage is a remote job that actually takes the bulk of the time but obviously is not memory-intensive on the local machine.

JulianWgs · 2023-04-26T11:53:23Z

Hi all, nice seeing this discussion! This feature request came from my side.

This also only works when there is only one person running DVC experiments on the machine (in one DVC repo). If you have a scenario where multiple users are running jobs on a particular machine (from their own separate clones of a DVC repo), this won't actually work unless the counter was system-wide.

I would opt for a naive solution where the user is responsible for insuring that there are no other users starting memory intensive tasks (dvc or not) on the same machine. So an user, repo or even per dvc exp run counter would be sufficient.

efiop · 2023-05-03T19:02:04Z

@JulianWgs We've been trying to ping you through the email, but we are hitting 550 spam detected. transport denied and don't have other means of reaching you 🙁

Moynihan18 · 2023-05-04T18:57:52Z

@JulianWgs thanks for getting back to us. I got your email, but I got caught by the spam detector in the reply. I'll hear from you next week!

Moynihan18 · 2023-05-12T18:41:43Z

@JulianWgs Hey Julian - still blocked by spam. Didn't hear from you this week - checking back in here. Update?

shcheklein added A: experiments Related to dvc exp triage Needs to be triaged labels Apr 24, 2023

dberenbaum added the p2-medium Medium priority, should be done, but less important label May 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp: set workers per stage #9363

exp: set workers per stage #9363

dberenbaum commented Apr 24, 2023

pmrowla commented Apr 25, 2023

shcheklein commented Apr 25, 2023

pmrowla commented Apr 25, 2023 •

edited

Loading

shcheklein commented Apr 25, 2023

pmrowla commented Apr 25, 2023

dberenbaum commented Apr 25, 2023

JulianWgs commented Apr 26, 2023

efiop commented May 3, 2023 •

edited

Loading

Moynihan18 commented May 4, 2023

Moynihan18 commented May 12, 2023

exp: set workers per stage #9363

exp: set workers per stage #9363

Comments

dberenbaum commented Apr 24, 2023

pmrowla commented Apr 25, 2023

shcheklein commented Apr 25, 2023

pmrowla commented Apr 25, 2023 • edited Loading

shcheklein commented Apr 25, 2023

pmrowla commented Apr 25, 2023

dberenbaum commented Apr 25, 2023

JulianWgs commented Apr 26, 2023

efiop commented May 3, 2023 • edited Loading

Moynihan18 commented May 4, 2023

Moynihan18 commented May 12, 2023

pmrowla commented Apr 25, 2023 •

edited

Loading

efiop commented May 3, 2023 •

edited

Loading