-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux-mini run: full featured version for wreck parity #2150
Comments
Thanks @garlick for opening this issue. I will be soon talking to some of the key SNL users on Sierra to collect their requirements. |
Starting to get some feedback from SNL users: Anthony Agelastos at SNL is directly working with the SPARC (SNL ATDM code) team and helping them run their simulations on Sierra as part of their ATCC-7 campaign. He is interested in running SPARC under Flux to do some evaluations of Flux as a means for doing testing of multiple invocations within a single allocation. |
The point of contact for SNL's SPARTA code is Stan Moore: On Trinity, his code uses the following srun options: |
Stan also said initially he had issues with unexpected affinity behavior of |
Ross Bartlett for SNL's SPARC code:
It seems his use case can be directly benefited from the current capability even without |
Also from Ross:
I still need to figure out whether getting a node allocation with LSF first and then |
Rich Drake for SNL's SIERRA code:
Seems MPMD support is a gap... any idea how easy or difficult to match this srun option? |
More from Rich:
|
@garlick and I just chatted briefly about that. We could probably mimic the exact behavior with a wreck plugin, but we could also leverage nested instances to achieve the co-scheduling. In the nested instance case, we (or the user) would just need to make sure that the total amount of work submitted to the sub-instance is not greater than the resources allocated to the instance can handle (i.e., ntasks cannot be greater than ncores). We could probably wrap that logic up in a |
Yeah it seems like Rich is alluding that he can make use of a general co-scheduling capability. For oneoff option like this, it would be wise to invite the users like Rich to firm up our solution as co-design. Looks like we will have to divide and conquer across different SNL teams a bit for effective communications going forward. |
Great info we are accumulating here. It is sort of difficult to decide what options to support. The goal is to provide a stable porting target not an srun clone. My suggestion is to start with the options that are supported in Possibly this will help us identify some missing plumbing in master for synchronization, I/O, etc. that will be good short term work items. |
Meeting discussion:
|
We should move any requirements gathered here into #2379, then close. |
Sorry for the noise, didn't mean to close (stray mouse click) |
Opened a couple of issues to track outstanding items, and closing this one. |
Following up on a discussion with @dongahn:
It may be useful to add a
flux srun
command to both flux 0.11 (wreck) and master (new exec system) that superficially mimics SLURM'ssrun
. We could keep it stable over the 0.11 to 0.12 transition, while the "real" porcelainflux run
is being developed so we can roll out 0.12 on corona and sierra without breaking the world.It may also be useful for users that have hardwired srun commands in test suites etc. to ease a transition to flux, which can run in more environments (thus making their test suites more portable).
The caveat is srun has a ton of options and building a full replica is not viable, nor are the returns probably worth going beyond the simplest, most commonly used options and behaviors.
Let's collect some design requirements for this thing here.
The text was updated successfully, but these errors were encountered: