Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match allocate_with_satisfiability success repsonse contains empty R #679

Closed
SteVwonder opened this issue Jun 25, 2020 · 9 comments
Closed

Comments

@SteVwonder
Copy link
Member

While interactively debugging an issue with #677, @dongahn and I encountered a situation where the flux-ion-resource.py script was printing a successful allocation with an empty R value. Need to investigate further under what circumstances the response is empty.

@dongahn
Copy link
Member

dongahn commented Jun 25, 2020

Thanks. Once @grondo has ubuntu focal docker image, let's see if this still is the problem under that environment and if so debug...

@grondo
Copy link
Contributor

grondo commented Jun 25, 2020

I pushed a fluxrm/flux-core:focal image by hand while I figure out what's wrong with ubuntu 20.04 builder. Maybe give the image a quick sanity check and let me know if there are any issues.

@dongahn
Copy link
Member

dongahn commented Jun 25, 2020

Awesome! Thank you SO much @grondo!

@SteVwonder
Copy link
Member Author

Turns out that this will happen when you use the rv1_nosched writer and then pass in a jobspec that isn't valid V1 jobspec but claims to be. The specific requests that this was breaking on requested node->slot->socket with no core or gpu. I believe it is the lack of core and gpu that is causing the rv1 writer to break, since it expects a non-empty child set and the only valid child in v1 are core and gpu. I tested this by adding socket as a valid child type for the rv1 writer, and the writer emitted a "modified" rv1 successfully after that.

This wasn't breaking in our tests with resource-query because the default writer for resource-query is simple, which can handle this situation just fine.

In terms of action items, here are the ones that come to mind:

  1. There are some error paths within the sched-fluxion-resource module that are allowing errors to pass by silently and responding to the request RPC with a successful response. In particular, this happens when the writer fails to emit properly. We should make these errors loud and respond to the RPC with an error.
    • I started working on this one and then ran in flux_respond_error: accept an errnum of 0 flux-core#3036. So in the meantime, @grondo recommended we just pick an errno. I originally thought this wasn't due to bad input from the user, but since it turns out that it actually is, EINVAL might be a good one to use for now.
  2. After implementing the above, at least one test within t4004 will begin failing. In the short term, the easy lift will be to just update the failing tests to use a different writer that properly emits R on non-V1 jobspec.
  3. In the longer term, we need to properly version the jobspecs that we are testing with (e.g., t/data/resource/jobspecs/basics/test009.yaml is not valid V1 jobspec, but should be valid V2 jobspec)
    1. This will require some (minor?) changes in sched-fluxion-resource to accept jobspec > v1
    2. Bonus points for error'ing out as soon as a V2+ jobspec is received when using a V1 writer

Note: for the most part, I don't think this bug will affect a full production system since the job-ingest module will reject any non-V1 jobspecs masquerading as V1.

@dongahn
Copy link
Member

dongahn commented Jul 8, 2020

Turns out that this will happen when you use the rv1_nosched writer and then pass in a jobspec that isn't valid V1 jobspec but claims to be. The specific requests that this was breaking on requested node->slot->socket with no core or gpu. I believe it is the lack of core and gpu that is causing the rv1 writer to break, since it expects a non-empty child set and the only valid child in v1 are core and gpu. I tested this by adding socket as a valid child type for the rv1 writer, and the writer emitted a "modified" rv1 successfully after that.

This problem occurs both with hwloc v1 and v2, correct?

Yes, R_lite1 portion of rv1 requires the leaf resources (core and cpu) to be selected/allocated.

Since jobspec will ultimately grow to be the full canonical specification, I think a future proof solution will be to add support to the traverser such that we can emit the full subtree under an exclusive allocation. Then, rv1_noshed should work w/ no modification.

For example, currently, with jobspec=node->slot->socket, the allocation can be node1->socket0 since we only write "matched" resources. But we should add an option or similar to traverser to generate node1->socket0->core[0-8].

We talked about this before (I think the topic was shadowed resource or something) but decided to kick the can down the road. Is this a good time to revisit this?

There are some error paths within the sched-fluxion-resource module that are allowing errors to pass by silently and responding to the request RPC with a successful response.

For #1, please see Issue #618. We do need to firm up error propagation. If you want, you can add #1 to that ticket as well.

@SteVwonder
Copy link
Member Author

Since jobspec will ultimately grow to be the full canonical specification, I think a future proof solution will be to add support to the traverser such that we can emit the full subtree under an exclusive allocation. Then, rv1_noshed should work w/ no modification.

Ah! That makes a lot of sense. I like the idea. Is there an open issue on that? I don't see one from a cursory search/glance. If not, maybe we split that out into a separate issue.

Is this a good time to revisit this?

Good question. I'm not sure. Maybe something to synchronize on during the team meeting on Thursday?

If you want, you can add 1 to that ticket as well.

Will do! I'll also open an issue about the versioning of jobspecs in the t/data directories. We can probably punt on that until we support jobspec v2 more thoroughly.

@dongahn
Copy link
Member

dongahn commented Jul 8, 2020

Ah! That makes a lot of sense. I like the idea. Is there an open issue on that? I don't see one from a cursory search/glance. If not, maybe we split that out into a separate issue.

Probably not on this particular issue. We should create a ticket on the semantics and handling for show resources.

@SteVwonder
Copy link
Member Author

Note: for the most part, I don't think this bug will affect a full production system since the job-ingest module will reject any non-V1 jobspecs masquerading as V1.

I was actually wrong about this. The current V1 validation is not robust enough to catch this type of invalid V1 jobspec. Opened an issue over in flux-core: flux-framework/flux-core#3039

Probably not on this particular issue. We should create a ticket on the semantics and handling for show resources.

Ok. I'll open a new issue.

@dongahn
Copy link
Member

dongahn commented Jul 12, 2020

@SteVwonder: should we still keep this open with Issue #689 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants