-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exp: set workers per stage #9363
Comments
I don't think this is a different issue than #755. It's a slightly different high level use case, but in terms of DVC functionality it's asking for the same thing. The What we are talking about here (and in #755) is having a completely separate worker pool for executing individual stages in parallel, and (I think this issue is actually more complex than #755, since it also needs the ability to restrict workers to specific stages, and designate different levels of concurrency for individual stages, as opposed to a single level of concurrency for any/all stages) |
@pmrowla I think it can be a simpler scope though. If "somehow" we pass information to DVC and it actively throttles / waits before running a stage if there are multiple stages of that type are running already. I don't think we need a separate pool then, or ability to run different stages in parallel within a single pipeline (e.g. parallelizing It's a different question if that makes sense to implement in such way or not. Point here is that, indeed, a high level scenario doesn't require any additional parallelism as far as I understand, it requires an ability to add quotas into the existing system (on the WDYT? |
In this case, your other workers are just going to block when they get to stage with quotas (since we can only run a pipeline sequentially right now). Using dave's example scenario, the 8 other workers are going to spend a majority of the time blocking and waiting for the memory intensive jobs to finish (since we can only do 2 of them at a time), in which case this is effectively the same thing as just doing I guess I'm (maybe incorrectly) making the assumption that the memory intensive stage is going to take the most execution time for the user's pipeline. In the event that the rest of the pipeline is actually slower than the memory intensive stage then I suppose there would still be some benefit to this? |
Yes, that's true, but it fits certain scenarios quite well. E.g. there are some relatively quick but very resource intensive stages. It's fine that we don't execute the next stage until some of the first are done and there are some free workers even if that means that some workers are not doing anything. |
In that case we could probably do this with a relatively naive solution like keeping a count of the number of processes running each stage at a time somewhere in the main DVC repo (and queue/temp runs would have to know to check the count in the main repo, not within the temp workspace's This also only works when there is only one person running DVC experiments on the machine (in one DVC repo). If you have a scenario where multiple users are running jobs on a particular machine (from their own separate clones of a DVC repo), this won't actually work unless the counter was system-wide. |
@pmrowla This is related to a user request where the low-memory stage is a remote job that actually takes the bulk of the time but obviously is not memory-intensive on the local machine. |
Hi all, nice seeing this discussion! This feature request came from my side.
I would opt for a naive solution where the user is responsible for insuring that there are no other users starting memory intensive tasks (dvc or not) on the same machine. So an user, repo or even per |
@JulianWgs We've been trying to ping you through the email, but we are hitting |
@JulianWgs thanks for getting back to us. I got your email, but I got caught by the spam detector in the reply. I'll hear from you next week! |
@JulianWgs Hey Julian - still blocked by spam. Didn't hear from you this week - checking back in here. Update? |
Related: #755. This is a narrower issue than #755 that comes from conversations with users about how they can run more experiments at once.
dvc exp run -j
runs experiments in parallel across multiple workers, but experiments may be too memory-intensive to run many experiments at once, even on large machines. One way to mitigate this issue is by identifying which stages are memory-intensive and setting quotas on the number of workers for those stages. For example, I may have many stages that read data in small batches but a single stage that must read in all the data at once to aggregate/combine it. If I can tell DVC that I want to use a maximum of 10 jobs for all other stages but only 2 jobs for my aggregation stage, then I can still take advantage of experiment parallelization without overloading the machine's memory.The text was updated successfully, but these errors were encountered: