Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exclusive node scheduling policy for system-instance day 1 #833

Closed
dongahn opened this issue May 6, 2021 · 6 comments
Closed

Add exclusive node scheduling policy for system-instance day 1 #833

dongahn opened this issue May 6, 2021 · 6 comments

Comments

@dongahn
Copy link
Member

dongahn commented May 6, 2021

flux-framework/flux-core#3143

@dongahn dongahn added this to the 2021 October release milestone Oct 19, 2021
@dongahn
Copy link
Member Author

dongahn commented Oct 20, 2021

We had a nice coffee hour discussion and the immediate course of action is:

Generate modified Jobspecs using --dry-run of flux mini batch -N k -n k to add the min : 1 to request k nodes with a minimum of 1 core to see if fluxion can generate reasonable R with k nodes each with fully populated node-local resources.

Based on the findings, we can discuss effective ways to enforce node exclusivity as well as an alternative front end command to submit node exclusive jobs to the system instance.

@dongahn
Copy link
Member Author

dongahn commented Oct 20, 2021

It looks like with min : 1, Fluxion only match and emit one core.

  "resources": [
    {
      "type": "node",
      "count": 2,
      "with": [
        {
          "type": "slot",
          "count": 1,
          "with": [
            {
              "type": "core",
              "count": {
                  "min": 1,
                  "max": 100,
                  operator: "+",
                  operand: 1
              }
            }
          ],
          "label": "task"
        }
      ]
    }
  ]
ahn1@docker-desktop:/usr/src/NODE_X$ ../resource/utilities/resource-query -L tiny.graphml -F pretty_simple
INFO: Loading a matcher: CA
resource-query> match allocate batch.yaml
      ---tiny0[1:shared]
      ------node3[1:shared]
      ---------core17[1:exclusive]
      ------node2[1:shared]
      ---------core17[1:exclusive]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=ALLOCATED
INFO: SCHEDULED AT=Now
INFO: =============================

@dongahn
Copy link
Member Author

dongahn commented Oct 20, 2021

Given the RFC 14 says the following, this is non-compliant.

The default value for max SHALL be infinite, therefore a count which specifies only the min key SHALL be considered a request for at least that number of a resource, and the scheduler SHALL generate the R that contains the maximum number of the resource that is available and subject to the operator and operand. 

@dongahn
Copy link
Member Author

dongahn commented Oct 21, 2021

Ah... there seems to be a bug with the first match policy. With other 'global' policies like low id first and high id first, the fluxion is compliant.

INFO: Loading a matcher: CA
resource-query> match allocate batch.yaml
      ---tiny0[1:shared]
      ------node1[1:shared]
      ---------core17[1:exclusive]
      ---------core16[1:exclusive]
      ---------core15[1:exclusive]
      ---------core14[1:exclusive]
      ---------core13[1:exclusive]
      ---------core12[1:exclusive]
      ---------core11[1:exclusive]
      ---------core10[1:exclusive]
      ---------core9[1:exclusive]
      ---------core8[1:exclusive]
      ---------core7[1:exclusive]
      ---------core6[1:exclusive]
      ---------core5[1:exclusive]
      ---------core4[1:exclusive]
      ---------core3[1:exclusive]
      ---------core2[1:exclusive]
      ---------core1[1:exclusive]
      ---------core0[1:exclusive]
      ------node0[1:shared]
      ---------core17[1:exclusive]
      ---------core16[1:exclusive]
      ---------core15[1:exclusive]
      ---------core14[1:exclusive]
      ---------core13[1:exclusive]
      ---------core12[1:exclusive]
      ---------core11[1:exclusive]
      ---------core10[1:exclusive]
      ---------core9[1:exclusive]
      ---------core8[1:exclusive]
      ---------core7[1:exclusive]
      ---------core6[1:exclusive]
      ---------core5[1:exclusive]
      ---------core4[1:exclusive]
      ---------core3[1:exclusive]
      ---------core2[1:exclusive]
      ---------core1[1:exclusive]
      ---------core0[1:exclusive]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=ALLOCATED
INFO: SCHEDULED AT=Now
INFO: =============================

@dongahn
Copy link
Member Author

dongahn commented Oct 25, 2021

PR #878 augmented our support for min/max count requests which we plan to use for node-exclusive scheduling. Once that is merged, our next step should be

  1. front end interface that can generate min request based node-exclusive Jobspec.
  2. ways to enforce node-exclusivity -- e.g., via a simple job validation plugin.

@dongahn
Copy link
Member Author

dongahn commented Nov 10, 2021

With PR #875 and PR #878, this can be tested. I will use an interim submit interface by working with @ryanday36.

We still need flux-core flux-framework/flux-core#3944 and RFC flux-framework/rfc#302 to be merged though.

@dongahn dongahn closed this as completed Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant