Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfixes for #4589 and #4611 #4610

Merged
merged 4 commits into from
Dec 6, 2023
Merged

Bugfixes for #4589 and #4611 #4610

merged 4 commits into from
Dec 6, 2023

Conversation

internet-diglett
Copy link
Contributor

@internet-diglett internet-diglett commented Dec 5, 2023

Bugfix for issue #4589.

The root cause ensure_ipv4_nat_entry previously would match against any existing table entries with the matching parameters. We need it to only match against entries that are active, or in implementation terms, entries whose version_removed column is NULL.

The events triggering the bug is as follows:

  1. User creates a new instance, eventually triggering the creation of new ipv4 nat entries, which are reconciled by the downstream dendrite workflow.
  2. User stops the instance. This triggers the soft-deletion of the ipv4 nat entries, which are again reconciled by the downstream dendrite workflow.
  3. The user restarts the instance. In the event that Nexus places the instance back on the same sled as last time, the external_ip may have the same parameters used by the soft-deleted nat records. Since we previously were not filtering for version_removed = NULL in ensure_ipv4_nat_entry, the soft-deleted records would still be treated as "live" in our db query, causing Nexus to skip inserting new nat records when the instance restarts.

This PR should resolve this unwanted behavior. However, a second issue was noticed during verification of the bug fix. I noticed that when running swadm nat list, the entries did not re-appear in the output even though dendrite was indeed picking up the new additions and configuring the softnpu asic accordingly. I believe this was also something @askfongjojo reported in chat. This means that we could have live entries on the switch and external traffic flowing to an instance, even though the nat entry is not appearing in swadm nat list. This PR also includes an upgraded dendrite that resolves that bug.


TODO

ensure_ipv4_nat_entry previously would match against *any* existing
table entries with the matching parameters. We need it to only
match against entries that are *active*, or in implementation terms,
entries whose version_removed column is NULL.
@internet-diglett internet-diglett marked this pull request as draft December 5, 2023 01:51
* Test for correct behavior of idempotency properties
This is to prevent a corner case where instances are deleted without
being stopped due to entering a "failed" state first.
@internet-diglett internet-diglett marked this pull request as ready for review December 6, 2023 19:34
@internet-diglett
Copy link
Contributor Author

@askfongjojo primary work for resolving #4589 and #4611 are ready for review. Doing some additional testing locally to try to verify fix for #4611, but I'm not sure of the "correct" way to replicate this bug on my local dev machine. 😂

@internet-diglett internet-diglett changed the title Ensure ipv4 nat entry is active Bugfixes for #4589 and #4611 Dec 6, 2023
@internet-diglett
Copy link
Contributor Author

No misbehavior detected in local testing, merging.

@internet-diglett internet-diglett merged commit 2a0595c into main Dec 6, 2023
20 checks passed
@internet-diglett internet-diglett deleted the bugfix-issue-4589 branch December 6, 2023 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Instance external IP is sometimes inaccessible after a stop-start power cycle
2 participants