Match allocate_with_satisfiability success repsonse contains empty `R` #679

SteVwonder · 2020-06-25T22:01:04Z

While interactively debugging an issue with #677, @dongahn and I encountered a situation where the flux-ion-resource.py script was printing a successful allocation with an empty R value. Need to investigate further under what circumstances the response is empty.

The text was updated successfully, but these errors were encountered:

dongahn · 2020-06-25T22:02:46Z

Thanks. Once @grondo has ubuntu focal docker image, let's see if this still is the problem under that environment and if so debug...

grondo · 2020-06-25T22:08:08Z

I pushed a fluxrm/flux-core:focal image by hand while I figure out what's wrong with ubuntu 20.04 builder. Maybe give the image a quick sanity check and let me know if there are any issues.

dongahn · 2020-06-25T22:52:06Z

Awesome! Thank you SO much @grondo!

SteVwonder · 2020-07-08T02:59:15Z

Turns out that this will happen when you use the rv1_nosched writer and then pass in a jobspec that isn't valid V1 jobspec but claims to be. The specific requests that this was breaking on requested node->slot->socket with no core or gpu. I believe it is the lack of core and gpu that is causing the rv1 writer to break, since it expects a non-empty child set and the only valid child in v1 are core and gpu. I tested this by adding socket as a valid child type for the rv1 writer, and the writer emitted a "modified" rv1 successfully after that.

This wasn't breaking in our tests with resource-query because the default writer for resource-query is simple, which can handle this situation just fine.

In terms of action items, here are the ones that come to mind:

There are some error paths within the sched-fluxion-resource module that are allowing errors to pass by silently and responding to the request RPC with a successful response. In particular, this happens when the writer fails to emit properly. We should make these errors loud and respond to the RPC with an error.
- I started working on this one and then ran in flux_respond_error: accept an errnum of 0 flux-core#3036. So in the meantime, @grondo recommended we just pick an errno. I originally thought this wasn't due to bad input from the user, but since it turns out that it actually is, EINVAL might be a good one to use for now.
After implementing the above, at least one test within t4004 will begin failing. In the short term, the easy lift will be to just update the failing tests to use a different writer that properly emits R on non-V1 jobspec.
In the longer term, we need to properly version the jobspecs that we are testing with (e.g., t/data/resource/jobspecs/basics/test009.yaml is not valid V1 jobspec, but should be valid V2 jobspec)
1. This will require some (minor?) changes in sched-fluxion-resource to accept jobspec > v1
2. Bonus points for error'ing out as soon as a V2+ jobspec is received when using a V1 writer

Note: for the most part, I don't think this bug will affect a full production system since the job-ingest module will reject any non-V1 jobspecs masquerading as V1.

dongahn · 2020-07-08T16:53:07Z

Turns out that this will happen when you use the rv1_nosched writer and then pass in a jobspec that isn't valid V1 jobspec but claims to be. The specific requests that this was breaking on requested node->slot->socket with no core or gpu. I believe it is the lack of core and gpu that is causing the rv1 writer to break, since it expects a non-empty child set and the only valid child in v1 are core and gpu. I tested this by adding socket as a valid child type for the rv1 writer, and the writer emitted a "modified" rv1 successfully after that.

This problem occurs both with hwloc v1 and v2, correct?

Yes, R_lite1 portion of rv1 requires the leaf resources (core and cpu) to be selected/allocated.

Since jobspec will ultimately grow to be the full canonical specification, I think a future proof solution will be to add support to the traverser such that we can emit the full subtree under an exclusive allocation. Then, rv1_noshed should work w/ no modification.

For example, currently, with jobspec=node->slot->socket, the allocation can be node1->socket0 since we only write "matched" resources. But we should add an option or similar to traverser to generate node1->socket0->core[0-8].

We talked about this before (I think the topic was shadowed resource or something) but decided to kick the can down the road. Is this a good time to revisit this?

There are some error paths within the sched-fluxion-resource module that are allowing errors to pass by silently and responding to the request RPC with a successful response.

For #1, please see Issue #618. We do need to firm up error propagation. If you want, you can add #1 to that ticket as well.

SteVwonder · 2020-07-08T19:10:23Z

Since jobspec will ultimately grow to be the full canonical specification, I think a future proof solution will be to add support to the traverser such that we can emit the full subtree under an exclusive allocation. Then, rv1_noshed should work w/ no modification.

Ah! That makes a lot of sense. I like the idea. Is there an open issue on that? I don't see one from a cursory search/glance. If not, maybe we split that out into a separate issue.

Is this a good time to revisit this?

Good question. I'm not sure. Maybe something to synchronize on during the team meeting on Thursday?

If you want, you can add 1 to that ticket as well.

Will do! I'll also open an issue about the versioning of jobspecs in the t/data directories. We can probably punt on that until we support jobspec v2 more thoroughly.

dongahn · 2020-07-08T19:57:27Z

Ah! That makes a lot of sense. I like the idea. Is there an open issue on that? I don't see one from a cursory search/glance. If not, maybe we split that out into a separate issue.

Probably not on this particular issue. We should create a ticket on the semantics and handling for show resources.

SteVwonder · 2020-07-08T20:22:11Z

Note: for the most part, I don't think this bug will affect a full production system since the job-ingest module will reject any non-V1 jobspecs masquerading as V1.

I was actually wrong about this. The current V1 validation is not robust enough to catch this type of invalid V1 jobspec. Opened an issue over in flux-core: flux-framework/flux-core#3039

Probably not on this particular issue. We should create a ticket on the semantics and handling for show resources.

Ok. I'll open a new issue.

dongahn · 2020-07-12T21:25:10Z

@SteVwonder: should we still keep this open with Issue #689 ?

SteVwonder mentioned this issue Jul 8, 2020

fault tolerance: need error propagation analysis #618

Open

SteVwonder mentioned this issue Jul 8, 2020

t/data/resource/jobspecs: some jobspecs that are self-described as V1 aren't V1 #688

Closed

SteVwonder closed this as completed Jul 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match allocate_with_satisfiability success repsonse contains empty `R` #679

Match allocate_with_satisfiability success repsonse contains empty `R` #679

SteVwonder commented Jun 25, 2020

dongahn commented Jun 25, 2020

grondo commented Jun 25, 2020

dongahn commented Jun 25, 2020

SteVwonder commented Jul 8, 2020

dongahn commented Jul 8, 2020

SteVwonder commented Jul 8, 2020

dongahn commented Jul 8, 2020

SteVwonder commented Jul 8, 2020

dongahn commented Jul 12, 2020

Match allocate_with_satisfiability success repsonse contains empty R #679

Match allocate_with_satisfiability success repsonse contains empty R #679

Comments

SteVwonder commented Jun 25, 2020

dongahn commented Jun 25, 2020

grondo commented Jun 25, 2020

dongahn commented Jun 25, 2020

SteVwonder commented Jul 8, 2020

dongahn commented Jul 8, 2020

SteVwonder commented Jul 8, 2020

dongahn commented Jul 8, 2020

SteVwonder commented Jul 8, 2020

dongahn commented Jul 12, 2020

Match allocate_with_satisfiability success repsonse contains empty `R` #679

Match allocate_with_satisfiability success repsonse contains empty `R` #679