-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-MPI backends (e.g. Slurm) #97
Comments
Perhaps you are looking for https://jobqueue.dask.org/en/latest/ ?
…On Fri, Oct 28, 2022 at 12:52 PM Lehman Garrison ***@***.***> wrote:
Since a "major makeover" was mentioned for dask-mpi in
dask/distributed#7192 <dask/distributed#7192>,
I thought I would ask if support for non-MPI backends was a possibility.
That is, use something like Slurm's srun to launch the ranks even if MPI
is not available.
The reason I think this may be possible is because as far as I can tell,
MPI is only being used to broadcast the address of the scheduler to the
workers. Afterwards, it seems to me that any communication is done via
direct TCP. If this is correct, could the startup be achieved instead by,
e.g., writing the address to a file in a shared location?
The benefit of this is the elimination of a fairly heavyweight MPI
dependency. When users start mixing, e.g. a conda install of MPI with a
Lmod install of MPI, their jobs often get confused and fail. It seems like
dask-mpi has a chance to sidestep these issues completely.
—
Reply to this email directly, view it on GitHub
<#97>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCLSPD77RTT7YJGZUDWFQHEBANCNFSM6AAAAAARRJF5NE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I'm actually looking for something that can run a dask cluster within an existing Slurm allocation. My understanding is that jobqueue submits Slurm jobs itself; my use-case is that I already have an allocation, and I'd like to start up dask within it. Taking a step back, I'd like something that can dynamically schedule generic Python tasks within a given allocation of resources. For the most part, I'm interested in completely independent jobs, although task graph capabilities might be useful in the future too. I think dask-mpi is the closest thing I've found so far, but I'm open to suggestions. |
@lgarrison: Yes. Dask-MPI is just a convenience function. As you point out, MPI is not used for client-scheduler-worker communication. MPI is only used to communicate the scheduler address to the worker processes (and the client process, if used with the It seems to me that the feature you are looking for would only require that Dask-MPI not require Does that accurately describe what you are looking for? |
Yes, I think that's exactly right. I'll test out the |
I attempted a pull out a base class from distributed #4710 The core difference is that folks want to take an existing allocation and fill it with a Dask cluster like However, that draft implementation met some friction from other maintainers and I didn't pursue it any further. But maybe this is a signal that there is demand for this and the distinction is valuable? |
I did realize that one other piece is needed to get this to work, besides the @jacobtomlinson, it looks like you've already worked on related abstractions. Are we on the right track here? Is it okay to implement "Slurm-awareness" at the |
This sounds like a good track but I don't think it belongs in This is why I tried to abstract things out in |
I agree it's a bit odd to have a package named class SlurmBootstrapper(Bootstrapper):
def get_rank(self, comm=None):
return os.environ['SLURM_PROCID']
def supports_bcast(self):
'''Do we have MPI broadcast support, or do we have to use scheduler-file?'''
return False
class MPIBootstrapper(Bootstrapper):
def get_rank(self, comm=None):
return comm.get_rank()
def supports_bcast(self):
return True
def bcast(self, comm, *args, **kwargs):
comm.bcast(*args, **kwargs) This could also be an interesting use case for entry points, which would allow an external package to register a bootstrapping interface. But it still seems like a shame to split off another package since the packaging overhead would be so much larger than the useful code itself. Plus, an important part of this (for my use case, at least) is to have the |
I agree it's a shame for such a small amount of code to cause this discussion, but I'm really not sure this proposal makes sense. Given that this whole library is like ~200 lines of code I don't think there would be much pushback to making a PR into
This seems like a pretty unusual ask. Making MPI optional in the |
Perhaps more importantly, I can see many users upon being directed to the
I know this seems unusual, but I think it's more because But if we are at the point of copy-pasting |
Hi everyone, Yes, things like that have been discussed in the past. In the HPC world, between In any case, I think it would be really nice to have such Dask cluster deployment method, a lot of HPC system admin (and users) would be glad of this. The thing is, job schedulers are so specific in these matters that it seems complicated to me to find the correct abstraction and place for a new dask-hpc-or-multinode package. Slurm is really powerful in this multi nodes domain, it would almost make sense to have a dask-slurm package, that uses allocations and So in my opinion, you should start in a new repo, I would be happy to follow your development and see if things can be generalized in one way or another! Maybe in the end |
That all sounds reasonable. I think the key difference between I have concerns about orthogonality in all the deployment packages we have when we talk about introducing more runners. The SLURM runner you are proposing is similar to I also want to have a runner for various cloud environments, I think Azure ML is the example I used in my original PR about this. We already have the cluster manager In that case, does it make sense for a SLURM runner and Azure ML runner to live in one new package called I assume from @guillaumeeb's response that there may not be as much benefit of colocating the SLURM cluster manager and cluster runner in the same package. Maybe code reuse between them would be low? But I think there are other compelling cases for cloud, kubernetes, etc where this colocation does have benefits, especially around package dependencies. I don't think putting this in We also don't want to create a new package for every new runner, that would be a pain. That leaves colocating them with existing cluster managers that use the same resource schedulers and dependencies, even if that is a shift for the package's remit currently. |
I didn't thought about all that, given that I've not looked a lot into This is also something that Back to the subject here, the proposal is really much closer to |
I agree the deployment paradigm is a big difference, but I think that's ok. If you're worried about maintaining a SLURM runner in
This is tempting, we could just rename this repo and then put the SLURM runner in here too. Packaging would get more complex because MPI would need to be made optional. However I worry that users would get confused between Also what about cloud/kubernetes/hadoop/etc? If we want runners for those it doesn't make sense for those to go in a |
Not at all! You've seen over the years that I'm not always available to maintain things, so I won't feel the responsibility for this 😄. I just think I'm not seeing yet what really is a runner, or at least this one, and how it would integrate into Do you think we could share part of the runner code and use it in Maybe the development should still be started in another repository so that we can see how it can fit later on in a existing repo? Maybe you're afraid in that case that we never merge it in an existing package... |
This would fit well with the use case I had in mind. I agree the dependency management would be more complex, but maybe specifying optional dependencies like
If I understand correctly, the concern is that you want to be able to reuse, e.g., cloud-specific utilities that live in
I think this is valid; understanding the difference between Using cloud as an example, there's a tension between encouraging code reuse (keeping all cloud things in one repo) and keeping the cluster-manager and cluster-runner workflows separate (two repos with cloud things). But I would argue that the code reuse problem has a decent technical solution, which is to keep them in separate packages and use optional dependencies. The manager vs runner "educational" problem benefits from having two different repos that clearly serve two different workflows, I think. |
The way I see it the cluster managers that we have inside The runner in It's sort of top-down cluster creation vs bottom-up. For integration, it's mostly just namespacing. I think
Yeah I already proposed this in dask/distributed#4710 and we could revisit that again for sure. Part of the challenge is that a runner is a trivially small amount of code compared to a cluster manager. The runner in
Dask cloudprovider is a gigantic package with many optional dependency sets already. Making that a dependency of
This is only part of my concern. My other concern is that I think users would expect to find cloud-related tools in the
This makes the most sense to me as the next course of action. Maybe @lgarrison could create I think key candidates would be:
But we can kick that down the road and maybe discuss it at a Dask community meeting with other maintainers. |
Okay. I was unavailable the last couple of days to respond, so I recognize that I'm coming back to this conversation a bit late. I wanted to add some thoughts to the discussion, at the risk of throwing a spanner in the works. As has been pointed out, it only uses MPI to determine the "rank" of each running process and to broadcast the scheduler's address to the workers. Also as pointed out already, Dask-MPI originally didn't even broadcast the scheduler address! In those early days, I always saw Dask-MPI really as a way of automatically running/launching I vaguely remember (in those early days) a discussion around ways of reliably determining the "process rank" (using MPI terminology) without needing to rely on Now, even if the "process-rank discovery problem" can be abstracted well, that still leaves a robust mechanism for sharing the scheduler address with the workers without MPI or a shared filesystem. In general, this is all part of "service discovery," but in some sense Dask-MPI exists because "service discovery" is usually discouraged (and sometimes banned) on HPC systems. So, finally getting back to the discussion at hand, I see the direction of this project being consistent with @lgarrison's initial suggestion (above). But I would say that Dask-MPI was an MPI-specific, early attempt at solving the general problem. I agree that any attempts that take a big step in that direction should probably be created in a new repo. As for the repo/project name... Since what we are trying to discuss is just a "single-command launcher" for both |
Interesting points @kmpaul. To thow another spanner in the Dask team at NVIDIA is exploring a more powerful "single-command launcher" for workers. Today we have I played with this a while ago in dask-agent but we will likely start from scratch and everything is up for grabs including the name. Maybe this would be an interesting place to also explore launching schedulers and handling this coordination. |
Hello, |
@LTMeyer things are still a work in progress. I recommend you track dask/dask-jobqueue#638 which I will be working on in the near future. |
Thank you @jacobtomlinson. Actually, @lgarrison pointed me to the repo you both contributed to (https://github.com/jacobtomlinson/dask-runners), which is also the subject of the PR you mentioned. It perfectly answered my need and I was able to use dask on Slurm allocated resources as I wanted. |
Since a "major makeover" was mentioned for dask-mpi in dask/distributed#7192, I thought I would ask if support for non-MPI backends was a possibility. That is, use something like Slurm's
srun
to launch the ranks even if MPI is not available.The reason I think this may be possible is because as far as I can tell, MPI is only being used to broadcast the address of the scheduler to the workers. Afterwards, it seems to me that any communication is done via direct TCP. If this is correct, could the startup be achieved instead by, e.g., writing the address to a file in a shared location?
The benefit of this is the elimination of a fairly heavyweight MPI dependency. When users start mixing, e.g. a conda install of MPI with a Lmod install of MPI, their jobs often get confused and fail. It seems like dask-mpi has a chance to sidestep these issues completely.
The text was updated successfully, but these errors were encountered: