Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm intergration #1057

Closed
efiop opened this issue Aug 25, 2018 · 6 comments
Closed

Slurm intergration #1057

efiop opened this issue Aug 25, 2018 · 6 comments
Labels
feature request Requesting a new feature

Comments

@efiop
Copy link
Contributor

efiop commented Aug 25, 2018

Kudos @Hong-Xiang

@efiop efiop added the feature request Requesting a new feature label Aug 25, 2018
@efiop
Copy link
Contributor Author

efiop commented Jan 22, 2019

Closing because of the lack of interest from the users. Please feel free to reopen.

@efiop efiop closed this as completed Jan 22, 2019
@observingClouds
Copy link

Hi, I'm fairly new to DVC, but try to use it more and more in my daily work.

I'd be interested in the support of SLURM (or schedulers in general), because I submit most of my computations to a cluster with a queuing system.

My specific example

I have a script do_something.py which is computational costly and needs to be submitted to the queuing system. So without a DVC pipeline sbatch do_something.py submits the job and starts the computation when resources are available.

Issue

dvc run -n do_something -d input.csv -o output.csv sbatch do_somthing.py also submits the job correctly and the calculations are successful, however because the command argument of dvc run, in this case sbatch do_something.py, immediately returns, the output-file has not been added or modified yet, because sbatch returns when the job is submitted to the queue. The computation can happen much later. DVC fails in this case.

Running stage 'do_something':
> sbatch do_something.py
Submitted batch job 31514191                                                                                                                                                                                                                                                  
ERROR: output 'output.csv' does not exist

Poor person's solution
One solution that I can think of, but which has some disadvantages, is to submit the dvc call as well:
sbatch <SLURM options> dvc run -n do_something -d input.csv -o output.csv do_somthing.py

However, the side effects are:
negative

  • <SLURM options> like project, nodes, time etc. need to be given and are not "remembered" by dvc or need to be included in the header of do_something.py
  • dvc repro would not use the exact resources

positive

  • dvc repro is not platform dependent and would run independent of the scheduler ( because it would not use it)

My current visionary solution

`dvc run -n do_something -d input.csv -o output.csv sbatch do_somthing.py`

would work ;) So probably dvc would need to wait to do its magic until it is notified by SLURM that the run has been successful (!).

I'm happy to hear further thoughts.

@efiop
Copy link
Contributor Author

efiop commented Aug 18, 2021

@observingClouds Hi. Are you familiar with https://dvc.org/doc/command-reference/exp ? We are starting to work on remote execution for it right now, maybe that would suit your scenario better?

@observingClouds
Copy link

Thanks @efiop for your prompt reply. Is there already a documentation or an issue with the discussion on the "remote execution" which you could point me to? So far I can't see how I could use dvc exp to my advantage here as I assume it also has issues with sbatch do_something.py.

@dberenbaum
Copy link
Collaborator

@observingClouds Please take a look at #6440 to track progress on remote execution, which is still in its infancy right now.

More to the point, would it be possible to use sbatch -W (https://slurm.schedmd.com/sbatch.html#OPT_wait) to block the process until the job completes? Would this alleviate the problems you are facing?

@observingClouds
Copy link

observingClouds commented Aug 18, 2021

The sbatch -W suggestion is fantastic! Thanks @dberenbaum. This indeed alleviates the problem quite drastically.
Thank you both for your swift and helpful comments. Keep up the great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

3 participants