Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add guard against NodeStatus #22

Merged
merged 10 commits into from
Jun 1, 2023
Merged

Conversation

isubasinghe
Copy link
Collaborator

Fixes #TODO

Please do not open a pull request until you have checked ALL of these:

  • Create the PR as draft .
  • Run make pre-commit -B to fix codegen and lint problems.
  • Sign-off your commits (otherwise the DCO check will fail).
  • Use a conventional commit message (otherwise the commit message check will fail).
  • "Fixes #" is in both the PR title (for release notes) and this description (to automatically link and close the issue).
  • Add unit or e2e tests. Say how you tested your changes. If you changed the UI, attach screenshots.
  • Github checks are green.
  • Once required tests have passed, mark your PR "Ready for review".

If changes were requested, and you've made them, dismiss the review to get it reviewed again.

@isubasinghe isubasinghe marked this pull request as ready for review May 25, 2023 05:01
@isubasinghe
Copy link
Collaborator Author

isubasinghe commented May 25, 2023

POST MORTEM

Nodes are defined as a map[string]NodeStatus. This combined with zero values returns resulted in the bug our customer encountered of container sets causing random failure. This confirms my (later stated) hunch, it's actually not the implementation of container sets that is causing issues but interactions of various components that weren't anticipated.

Unfortunately this brings to the point I had made to @JPZ13 earlier, making changes to Argo Workflows does requires a near 1 to 1 mapping of the Workflows codebase into memory, at the size of the codebase right now, that is an impossible task for most (but almost certainly everyone involved in the project).

To add to this as an example, I encountered a case where a bug prevented a bug from going through in the first place, that is they cancelled each other out, if you look at here, there was no mention of what to do if node was not present in the hashmap, but an invalid access resulted in an empty struct, this meant the boolean variables here get changed correctly.

This is to me fairly terrifying because it is impossible to distinguish between bug and intentional behaviour without knowing what the original author intended. I think the code should be as obvious as possible, when an explicit check is made against a map, the author also asserts the intended behaviour explicitly.

To generalise on the above comments, the core problem is that the codebase isn't as modular as it really should be, different sections of the codebase assume behaviour from other sections of the codebase. This is what makes the codebase hard to debug, hard to add features to and hard to change, ultimately I believe the container set issues were only just a symptom of the state of the codebase, the core issues lie within Argo Workflows itself. Putting time and effort into simplifying Workflows as much as possible will prevent future "whack-a-mole" type situations.

What slowed me down

I ended up chasing a wild goose of sorts, I assumed the error originated from boundaryID being "". This is actually an intended value for some nodes, but given zero values are a thing in Go, it is impossible to distinguish between:

  • "is this accidental because the Nodes map returned a zero value struct?"
  • "is this accidental because somewhere boundaryID was set to an empty string?",
  • "is this intentional because for some nodes, the boundaryID can be an empty string?".

In reality the actual strongly typed type of boundaryID is really Option[string] not string. This forms the basis of my argument against using zero values, it is impossible to distinguish what they really mean.

Prevention of this particular type of issue

  1. Single Get/Set functions so that tracing issues relating to them can be done via debugging one function.
  2. Avoid relying on zero values completely. This is somewhat anti-go but frankly I think this is a flaw in the language design. Relying on zero values doesn't scale well (with respect to features/code/(number of engineers) not performance).
  3. Prefer panicking over zero values, a panic is an easy bug to fix, chasing down code that other people wrote that may or may not be the reason your workflow fails is really not.
  4. Write a simple static analysis tool that detects accessing of map that have a non pointer return type. Essentially ensure that all maps are of the form map[A]*B not map[A]B. An access on a nil pointer is a panic, this should be the preferred option.

@JPZ13 JPZ13 requested a review from Joibel May 25, 2023 17:30
Copy link
Collaborator

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't commented on every use of logrus, but they should all change.

server/artifacts/artifact_server.go Outdated Show resolved Hide resolved
pkg/apis/workflow/v1alpha1/workflow_types.go Outdated Show resolved Hide resolved
util/resource/updater.go Outdated Show resolved Hide resolved
workflow/controller/pod_cleanup.go Outdated Show resolved Hide resolved
workflow/controller/steps.go Outdated Show resolved Hide resolved
workflow/controller/taskresult.go Outdated Show resolved Hide resolved
@isubasinghe isubasinghe requested a review from Joibel May 26, 2023 00:59
@JPZ13 JPZ13 merged commit 310f3e1 into pipekit:master Jun 1, 2023
JPZ13 pushed a commit that referenced this pull request Jun 5, 2023
Fixes #TODO

Please do not open a pull request until you have checked ALL of these:

* [ ] Create the PR as draft .
* [ ] Run `make pre-commit -B` to fix codegen and lint problems.
* [ ] Sign-off your commits (otherwise the DCO check will fail).
* [ ] Use [a conventional commit
message](https://www.conventionalcommits.org/en/v1.0.0/) (otherwise the
commit message check will fail).
* [ ] "Fixes #" is in both the PR title (for release notes) and this
description (to automatically link and close the issue).
* [ ] Add unit or e2e tests. Say how you tested your changes. If you
changed the UI, attach screenshots.
* [ ] Github checks are green.
* [ ] Once required tests have passed, mark your PR "Ready for review".

If changes were requested, and you've made them, dismiss the review to
get it reviewed again.

---------

Signed-off-by: isubasinghe <[email protected]>
JPZ13 pushed a commit that referenced this pull request Jul 5, 2023
Fixes #TODO

Please do not open a pull request until you have checked ALL of these:

* [ ] Create the PR as draft .
* [ ] Run `make pre-commit -B` to fix codegen and lint problems.
* [ ] Sign-off your commits (otherwise the DCO check will fail).
* [ ] Use [a conventional commit
message](https://www.conventionalcommits.org/en/v1.0.0/) (otherwise the
commit message check will fail).
* [ ] "Fixes #" is in both the PR title (for release notes) and this
description (to automatically link and close the issue).
* [ ] Add unit or e2e tests. Say how you tested your changes. If you
changed the UI, attach screenshots.
* [ ] Github checks are green.
* [ ] Once required tests have passed, mark your PR "Ready for review".

If changes were requested, and you've made them, dismiss the review to
get it reviewed again.

---------

Signed-off-by: isubasinghe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants