Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated LSF scheduler to accept number of nodes #5153

Merged
merged 3 commits into from
May 30, 2022

Conversation

nkeilbart
Copy link
Contributor

I have the scheduler on Lassen at LLNL which is not able to accept the current settings. I have updated the scheduler so that if the option:

use_num_machines : True

is set then you can specify both num_machines and tot_num_mpiprocs as well.

@chrisjsewell
Copy link
Member

Thanks @nkeilbart!
Obviously, will have to look at why this is now failing the tests, but off-hand I think this seems reasonable

@nkeilbart
Copy link
Contributor Author

No problem at all. Let me know if there is some documentation somewhere that might need updated as well. Otherwise people might not know that option exists without looking at the source code. I'm still a bit of a novice with all the git repository setup but if you point me in the right direction I'll work on that as well.

@sphuber
Copy link
Contributor

sphuber commented Oct 6, 2021

Thanks @nkeilbart . The first question that sprang to my mind is: why did the original implementer say that num_machines is not supported? Do all LSF schedulers support defining the number of nodes, or might this be something specific to the installation/configuration at LLNL? If it is supported by default, then it would indeed make sense to add here. Once we answer that question, I have some small comments on the implementation, but I wanted to first get this question out of the way. Because if it is not general to all LSF schedulers, then creating a LLNL specific plugin might be better. This would be very easy to do through the plugin system and would prevent breaking the LsfScheduler for other users.

Edit: I looked at the LSF documentation, and it doesn't mention this -nnodes flag for the bsub command that you seem to be using: https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=started-quick-reference

@nkeilbart
Copy link
Contributor Author

Hi @sphuber, thanks for taking a look at this. I believe the answer to your question lies in some of the comments I came upon in the LSF scheduler that the original programmer left. They state that you need to check that PARALLEL_SCHED_BY_SLOT=Y is NOT defined in lsb.params. They then mention you can check this with bparams -l. It turns out here at the lab we do have this option enabled. I would imagine that other places would have it implemented in a similar fashion meaning it would be nice to have this feature. I don't think my changes would necessarily break the scheduler for other users that already have it setup to only use number of processors. I think it should still be able to write the job submission script correctly and run.

As for the -nnodes option, when I look at "man bsub" on the computer here and look for -nnodes it comes up as an option. The description says, "Specifies the number of compute nodes that are required for the CSM job." I am unable to simply specify number of processors on the server here but must specify the number of nodes instead and then the number of processors later when executing the binary.

I'm not sure if we have a slightly modified version for the lab and I could email the people at the server to get some more detailed answers if you'd like.

I hope that answers your questions. Let me know what else I can provide.

@nkeilbart
Copy link
Contributor Author

nkeilbart commented Oct 8, 2021

Also found some documentation here:

https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=options-nnodes

Looking at the page it seems to mention this option is available when easy mode LSF job submission is enabled which seems to be what settings I have.

@sphuber
Copy link
Contributor

sphuber commented May 19, 2022

@nkeilbart I took the liberty to fix the tests and add additional ones for the added functionality. Also cleaned the code up a bit for readability, hope you don't mind. Let me know if you are still happy with this as is, and then I will accept and merge it.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @nkeilbart

@sphuber sphuber merged commit c751207 into aiidateam:main May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants