Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use case: heterogeneous clusters #4143

Closed
ryanday36 opened this issue Feb 17, 2022 · 15 comments
Closed

use case: heterogeneous clusters #4143

ryanday36 opened this issue Feb 17, 2022 · 15 comments

Comments

@ryanday36
Copy link

use case: We have a cluster with multiple generations of nodes that we would like to schedule using a Flux system instance. The cluster has 82 nodes with 4 AMD MI50 gpus per node and 82 nodes with 4 AMD MI60 gpus per node. Users should be able to specify whether they want their job / instance to go on nodes with mi60 gpus or mi50 gpus. Ideally, they should also be able to specify specific combinations or that they don't care which generation of gpu they get.

@grondo
Copy link
Contributor

grondo commented Feb 17, 2022

Thanks @ryanday36!

There are (at least) three gaps here:

  1. Rv1 at least does not have a way to specify properties/tags/features for resources, though this is likely possible with JGF. How can we configure the system instance to most easily label the two types of GPUs?
  2. As discussed in today's meeting, the current version of jobspec also does not have a way to constrain a resource type by an arbitrary property.
  3. Finally, the front end flux mini tools, and the jobspec C and Python interfaces, do not have a way to associate a property requirement with a given resource request.

These gaps assume an approach using properties or tags associated with sets of resources. There are possibly other solutions here, including "partitioning" resources based on queues or similar (since we have no concept of partitions in Flux at this point.)

One benefit of associating properties with resources is that "tagging" a resource with a given string could be potentially be used to implement a kind of partition support (e.g. by tagging a set of resources with a partition name or setting a queue or partition property). A job submitted to a given queue could imply that resources assigned to jobs in that queue would have a given tag or property value.

(Sorry if the above was not at all helpful, just thought I'd get the discussion started)

@grondo
Copy link
Contributor

grondo commented Feb 24, 2022

Ok, in a series of offline chats with @garlick we've come up with some ideas for the 3 gaps above:

First, @garlick's suggestion to make this first cut as simple as possible is to introduce string "properties" at the execution target (broker rank, i.e. node) level only at this stage. This allows for the simple proposal introduced at a high level below.

  1. Rv1: An optional properties object will be introduced to Rv1, at the same level as nodelist. The properties object will be a dictionary of properties, with the corresponding value of each property being a Idset of ranks which have the named property. This eases the configuration of properties while also avoiding the requirement for a new R revision, e.g.:
    $ flux R encode --ranks=0-7 -c 0-3 -g 0 -H host[0-7] --property=foo:0-1 --property=bar:7 | jq
    {
      "version": 1,
      "execution": {
        "R_lite": [
          {
            "rank": "0-7",
            "children": {
              "gpu": "0",
              "core": "0-3"
            }
          }
        ],
        "starttime": 0,
        "expiration": 0,
        "nodelist": [
          "host[0-7]"
        ],
        "properties": {
          "foo": "0-1",
          "bar": "7"
        }
      }
    }
  2. For jobspec constraints, we considered if property constraints could be directly placed in the resources section, but due to the design of jobspec, there will not always be a node resource present to which to tie the constraints. Therefore we thought it best to keep these property constraints separately in jobspec, e.g. attributes.system.constraints.properties or just attributes.system.properties could be an array of strings, each of which would be a required property of the resulting resource allocation (implied AND). We also could allow some way to specify NOT (e.g. precede the property with a - or !).
  3. With this type of simple property constraint list, we could easily extend the C and Python Jobspec APIs to allow appending a property to the constraints, with perhaps a new flux mini option like --constraint or maybe --requires

Long term, it would probably be preferable to encode resource constraints like required properties in the resources section of jobspec, but with V1 this just doesn't seem to be a possibility. For now, we could call these constraints "global" jobspec constraints, and open the door for improvements in the future that tie constraints to the exact resource spec to which they apply.

Another point: if we get this working, we could give all nodes an implicit property of their hostname. Then with support for NOT, we could get a way to exclude hosts by name (#2413).

@grondo
Copy link
Contributor

grondo commented Feb 25, 2022

@dongahn, if you have a chance, could you briefly comment if the above strategy will work for Fluxion property matching? If not, happy to quickly iterate on a simple solution that will work for us near term.

@dongahn
Copy link
Member

dongahn commented Feb 25, 2022

@grondo: this is a good start! A few things to consider that can potentially make our short term effort be a bit more future proof.

  1. --property=foo:0-1 --property=bar:7: I think it would be best to use this feature to start exploring multi-queue support within fluxion beyond our current support level implemented within qmanager. So it would be useful to be able to specify distinct properties in an overlapping fashion across different nodes (e.g., --property=foo:0-7 --property=bar:7 supported) . I assume the intent is to allow for such overlapping?

  2. We will need to augment flux ion-R to translate this into JGF. As part of that, what I am wondering is if we can add the properties not only to the node level but also to the node-local resource level like gpu. What this will buy us is then, we will have something concrete to test in spec'ing-out our canonical jobspec RFC with true property support. It seems like this can be done if have an optional "node-local" resource type in --property=foo@gpu:0-1?

  3. --constraint can take the augmented form like foo@gpu? If the scheduler can support it?

  4. Having said all that, I'd like to have early access to your flux R encode prototype to scope this effort a bit better.

@grondo
Copy link
Contributor

grondo commented Feb 25, 2022

e.g., --property=foo:0-7 --property=bar:7 supported) . I assume the intent is to allow for such overlapping?

Yes, If I understand the question correctly, in this proposal an execution target can have an arbitrary number of simple string properties.

It seems like this can be done if have an optional "node-local" resource type in --property=foo@gpu:0-1?

At the flux-core and Rv1 level, we would treat foo@gpu as a string property that happens to have a @ in it, but the property would be applied to the execution targets 0,1. Is that what you intend? If instead you mean that the specification should apply the property foo to child resources with type gpu and ids 0,1, then we can't easily support that in flux-core & Rv1 at this time. However, maybe it would work if the property name encoded the gpu ids in it? e.g. --property=foo@gpu[0-1]:0-1 would set a property foo@gpu[0-1] on execution targets 0,1. Fluxion could then further split this property name and apply foo to gpu 0,1?

Users could then use the constraint foo@gpu as you suggest -- though this would not work in flux-core only (no execution target has property foo@gpu) it could work for Fluxion since it could look for constraints with @ and apply them as needed.

--constraint can take the augmented form like foo@gpu? If the scheduler can support it?

Yes, a constraint could be any string, which opens the door to support more advanced logic down the road (e.g. I'm thinking that ^ prefix could be used to support not).

Having said all that, I'd like to have early access to your flux R encode prototype to scope this effort a bit better.

Ok, you can check out my rlist-properties branch. Note that only flux R encode and flux R decode are fully supported, some of the other rlist set methods may not work properly (such as flux R intersect).

@dongahn
Copy link
Member

dongahn commented Feb 25, 2022

At the flux-core and Rv1 level, we would treat foo@gpu as a string property that happens to have a @ in it, but the property would be applied to the execution targets 0,1. Is that what you intend?

Yes. 0-1 are the execution target ranks. With having that @gpu, I want to see if flux ion-R can add this property to the GPU resource type under rank 0-1 for the fluxion level tests.

--property=foo@gpu[0-1]:0-1

This would even be better! I don't see us having to support multiple different types of GPUs on a node in anytime soon, though. Now that I think about this, in general we can treat the suffix string that comes after @ as the scheduler specific.

Users could then use the constraint foo@gpu as you suggest -- though this would not work in flux-core only (no execution target has property foo@gpu) it could work for Fluxion since it could look for constraints with @ and apply them as needed.

Exactly!

@dongahn
Copy link
Member

dongahn commented Mar 5, 2022

@grondo: I just played with your branch to see if I can easily propagate the rank-level properties to the resource types that comes after @. Just wanted to leave a comment that this format is pretty easy to deal with. Still need to work on fluxion proper for property based matching but so far so good. We will ultimately want to document the format (e.g., reserve the use of @ symbol for specific scheduler use etc), though.

@dongahn
Copy link
Member

dongahn commented Mar 5, 2022

WIP is here

@garlick
Copy link
Member

garlick commented Mar 7, 2022

@ryanday36 mentioned in flux-framework/flux-accounting#207:

This is something that we should probably track elsewhere, but we are going to want flux jobs to have options to filter jobs by queue once we start using multiple queues.

I wanted to be sure to cross-reference that comment here, since up at the top of this issue @grondo suggested:

A job submitted to a given queue could imply that resources assigned to jobs in that queue would have a given tag or property value.

I don't think we've quite gotten to the tooling detail yet, but it would seem to make sense to give flux jobs the ability to list jobs by property.

@grondo
Copy link
Contributor

grondo commented Mar 8, 2022

but it would seem to make sense to give flux jobs the ability to list jobs by property.

That might be tricky given the proposal for job constraints in flux-framework/rfc#314.

Also, if the jobspec contains a request for a "queue" in some scheduler specific space in the jobspec, the jobspec might only encode the queue name itself not the property constraints associated with that queue, which is only known by the scheduler. I think there was some plan for a conformant scheduler to annotate jobs with the queue name, since by default no jobs would have this encoded in jobspec for query by the job-list module. Maybe the scheduler could do something similar for properties? (sorry, that is getting quite off topic for this issue)

@dongahn
Copy link
Member

dongahn commented Mar 16, 2022

I think there was some plan for a conformant scheduler to annotate jobs with the queue name, since by default no jobs would have this encoded in jobspec for query by the job-list module.

This has already been implemented in fluxion:
https://github.com/flux-framework/flux-sched/blob/master/qmanager/modules/qmanager_callbacks.cpp#L110

I think the plan was to have a way to sort the job listing by the scheduler-annotated string queue names. Other properties can be piggybacked.

@grondo
Copy link
Contributor

grondo commented Mar 16, 2022

Ok, it does appear that sched.queue is part of the RFC 27 Alloc annotate definition. Therefore, in flux jobs or job-list we can add an option to specifically filter jobs by this annotation key.

However, any requested queue is currently encoded in jobspec under the opaque, scheduler-specific attributes.system.scheduler. key, and therefore in order to request a specific queue, users will have to do something like

$ flux mini batch --setattr=system.scheduler.queue=NAME ...

which seems user unfriendly.

Additionally, all jobs enqueued which have not yet had an alloc request sent to the scheduler will, by definition, not have any scheduler annotations, and thus the queue will be unknown, and flux jobs will not be able to filter by queue for these jobs.

Because of these two issues, perhaps we should elevate the requested queue to a defined RFC 14 and/or RFC 25 jobspec property, so that our core utilities can be made aware of the requested queue. Schedulers which do not support a queue (like sched-simple) would reject the job during feasibility checks if it was submitted with a queue name other than a defined default (which I propose be called "default"). Utilities that process jobspec, such as job-list and flux-jobs could then directly support the queue name in a straightforward manner, and the queue would be available before annotations are made to the job.

@dongahn
Copy link
Member

dongahn commented Mar 16, 2022

A correction on my part: it is actually flux mini submit -n 1 --setattr system.queue=debug hostname, fluxion has been using. system.scheduler.queue is a key in RCF 20 resource set.

Regardless, formatting queue into those RFCs make sense to me.

fluxion annotates (or can annotate if not already) each queue unspecified job with the queue name where it is scheduled. It is alreadydefault if no multi queue has been configured or the first queue name that appears in the config.

I like what you are proposing. We should formalize queue so that conforming schedulers can annotate a default queue name or flux-core front end can add one if the scheduler doesn't do it.

@garlick
Copy link
Member

garlick commented Apr 5, 2022

Closing this issue with the landing of generic resource properties in flux-core 0.38.0 and the RFCs.

Do we need to open new issue(s) on flux-core tooling for multiple queues?

@garlick garlick closed this as completed Apr 5, 2022
@dongahn
Copy link
Member

dongahn commented Apr 6, 2022

I believe so. It's been long enough I can't remember if there was already an existing ticket. But good to have a new ticket to keep track of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants