fix: add guard against NodeStatus #22

isubasinghe · 2023-05-20T05:14:34Z

Fixes #TODO

Please do not open a pull request until you have checked ALL of these:

Create the PR as draft .
Run make pre-commit -B to fix codegen and lint problems.
Sign-off your commits (otherwise the DCO check will fail).
Use a conventional commit message (otherwise the commit message check will fail).
"Fixes #" is in both the PR title (for release notes) and this description (to automatically link and close the issue).
Add unit or e2e tests. Say how you tested your changes. If you changed the UI, attach screenshots.
Github checks are green.
Once required tests have passed, mark your PR "Ready for review".

If changes were requested, and you've made them, dismiss the review to get it reviewed again.

Signed-off-by: isubasinghe <[email protected]>

isubasinghe · 2023-05-25T05:24:09Z

POST MORTEM

Nodes are defined as a map[string]NodeStatus. This combined with zero values returns resulted in the bug our customer encountered of container sets causing random failure. This confirms my (later stated) hunch, it's actually not the implementation of container sets that is causing issues but interactions of various components that weren't anticipated.

Unfortunately this brings to the point I had made to @JPZ13 earlier, making changes to Argo Workflows does requires a near 1 to 1 mapping of the Workflows codebase into memory, at the size of the codebase right now, that is an impossible task for most (but almost certainly everyone involved in the project).

To add to this as an example, I encountered a case where a bug prevented a bug from going through in the first place, that is they cancelled each other out, if you look at here, there was no mention of what to do if node was not present in the hashmap, but an invalid access resulted in an empty struct, this meant the boolean variables here get changed correctly.

This is to me fairly terrifying because it is impossible to distinguish between bug and intentional behaviour without knowing what the original author intended. I think the code should be as obvious as possible, when an explicit check is made against a map, the author also asserts the intended behaviour explicitly.

To generalise on the above comments, the core problem is that the codebase isn't as modular as it really should be, different sections of the codebase assume behaviour from other sections of the codebase. This is what makes the codebase hard to debug, hard to add features to and hard to change, ultimately I believe the container set issues were only just a symptom of the state of the codebase, the core issues lie within Argo Workflows itself. Putting time and effort into simplifying Workflows as much as possible will prevent future "whack-a-mole" type situations.

What slowed me down

I ended up chasing a wild goose of sorts, I assumed the error originated from boundaryID being "". This is actually an intended value for some nodes, but given zero values are a thing in Go, it is impossible to distinguish between:

"is this accidental because the Nodes map returned a zero value struct?"
"is this accidental because somewhere boundaryID was set to an empty string?",
"is this intentional because for some nodes, the boundaryID can be an empty string?".

In reality the actual strongly typed type of boundaryID is really Option[string] not string. This forms the basis of my argument against using zero values, it is impossible to distinguish what they really mean.

Prevention of this particular type of issue

Single Get/Set functions so that tracing issues relating to them can be done via debugging one function.
Avoid relying on zero values completely. This is somewhat anti-go but frankly I think this is a flaw in the language design. Relying on zero values doesn't scale well (with respect to features/code/(number of engineers) not performance).
Prefer panicking over zero values, a panic is an easy bug to fix, chasing down code that other people wrote that may or may not be the reason your workflow fails is really not.
Write a simple static analysis tool that detects accessing of map that have a non pointer return type. Essentially ensure that all maps are of the form map[A]*B not map[A]B. An access on a nil pointer is a panic, this should be the preferred option.

Joibel

I haven't commented on every use of logrus, but they should all change.

server/artifacts/artifact_server.go

pkg/apis/workflow/v1alpha1/workflow_types.go

util/resource/updater.go

workflow/controller/pod_cleanup.go

workflow/controller/steps.go

workflow/controller/taskresult.go

Signed-off-by: isubasinghe <[email protected]>

pkg/apis/workflow/v1alpha1/workflow_types.go

Signed-off-by: isubasinghe <[email protected]>

Fixes #TODO Please do not open a pull request until you have checked ALL of these: * [ ] Create the PR as draft . * [ ] Run `make pre-commit -B` to fix codegen and lint problems. * [ ] Sign-off your commits (otherwise the DCO check will fail). * [ ] Use [a conventional commit message](https://www.conventionalcommits.org/en/v1.0.0/) (otherwise the commit message check will fail). * [ ] "Fixes #" is in both the PR title (for release notes) and this description (to automatically link and close the issue). * [ ] Add unit or e2e tests. Say how you tested your changes. If you changed the UI, attach screenshots. * [ ] Github checks are green. * [ ] Once required tests have passed, mark your PR "Ready for review". If changes were requested, and you've made them, dismiss the review to get it reviewed again. --------- Signed-off-by: isubasinghe <[email protected]>

isubasinghe added 7 commits May 20, 2023 15:13

fix: add guard against NodeStatus

1961283

Signed-off-by: isubasinghe <[email protected]>

fix: get rid of return for Set, remove logs

086dca0

Signed-off-by: isubasinghe <[email protected]>

fix: has returns correct value

690b4dc

Signed-off-by: isubasinghe <[email protected]>

fix: remove debug logs

8d92a3a

Signed-off-by: isubasinghe <[email protected]>

fix: ensure tests pass

38d6f8a

Signed-off-by: isubasinghe <[email protected]>

fix: restore back to /bin/bash

70264d3

Signed-off-by: isubasinghe <[email protected]>

fix: remove logging

3a574c5

Signed-off-by: isubasinghe <[email protected]>

isubasinghe marked this pull request as ready for review May 25, 2023 05:01

JPZ13 requested a review from Joibel May 25, 2023 17:30

Joibel requested changes May 25, 2023

View reviewed changes

isubasinghe added 2 commits May 26, 2023 10:13

fix: remove logrus and replace with log

840a924

Signed-off-by: isubasinghe <[email protected]>

fix: replace panic with errors

bb8596b

Signed-off-by: isubasinghe <[email protected]>

isubasinghe requested a review from Joibel May 26, 2023 00:59

JPZ13 reviewed May 26, 2023

View reviewed changes

pkg/apis/workflow/v1alpha1/workflow_types.go Show resolved Hide resolved

JPZ13 reviewed May 26, 2023

View reviewed changes

pkg/apis/workflow/v1alpha1/workflow_types.go Outdated Show resolved Hide resolved

fix: add comments and use Get for helper fn

cbdd157

Signed-off-by: isubasinghe <[email protected]>

Joibel approved these changes May 28, 2023

View reviewed changes

JPZ13 merged commit 310f3e1 into pipekit:master Jun 1, 2023

JPZ13 mentioned this pull request Jul 4, 2023

cherry-pick: Reconcile master #28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add guard against NodeStatus #22

fix: add guard against NodeStatus #22

isubasinghe commented May 20, 2023

isubasinghe commented May 25, 2023 •

edited

Loading

Joibel left a comment

fix: add guard against NodeStatus #22

fix: add guard against NodeStatus #22

Conversation

isubasinghe commented May 20, 2023

isubasinghe commented May 25, 2023 • edited Loading

POST MORTEM

What slowed me down

Prevention of this particular type of issue

Joibel left a comment

Choose a reason for hiding this comment

isubasinghe commented May 25, 2023 •

edited

Loading