Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete VM workflow should attempt to remove external addresses in ipv4_nat_entry table #4611

Closed
Tracked by #4610
askfongjojo opened this issue Dec 5, 2023 · 0 comments
Assignees
Milestone

Comments

@askfongjojo
Copy link

Currently external IP address entries are deleted during instance-stop operations. There are atypical lifecycle situations when instances are deleted without being explicitly stopped, e.g. when an instance is left in running state prior to a system update. The instance in this case transitions to failed state when user attempts to reboot it. If the user deletes the instance, the snat and external IP addresses in the ip_nat_entry table will remain as is while they are marked as deleted in the external_ip table.

I've run into such a situation on rack2. The end result was that a newly created VM could acquire the external ip address released from a failed VM after it's been deleted but would fail to start up. Here is the 500 error I saw in the Nexus log:

01:04:00.050Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = saga ACTION error at node "dpd_ensure": unexpected database error: duplicate key value violates unique constraint "overlapping_ipv4_nat_entry"
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/ff87a01/dropshot/src/server.rs:841
    latency_us = 413890
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 172.20.17.42:50510
    req_id = dff3c7b1-5fea-4597-b5c2-2763a0bdcfb5
    response_code = 500
    uri = https://oxide.sys.rack2.eng.oxide.computer/v1/instances/failed-vm/start?project=try

The ipv4_nat_entry will likely not have any active IP records for a stopped VM in most cases so the suggestion here is to attempt deletion only if such data exist, but do not complain if there isn't anything to delete.

@morlandi7 morlandi7 added this to the 5 milestone Dec 5, 2023
internet-diglett added a commit that referenced this issue Dec 6, 2023
Bugfix for issue #4589.

The root cause `ensure_ipv4_nat_entry` previously would match against
*any* existing table entries with the matching parameters. We need it to
only match against entries that are *active*, or in implementation
terms, entries whose `version_removed` column is `NULL`.

The events triggering the bug is as follows:
1. User creates a new instance, eventually triggering the creation of
new ipv4 nat entries, which are reconciled by the downstream dendrite
workflow.
2. User stops the instance. This triggers the soft-deletion of the ipv4
nat entries, which are again reconciled by the downstream dendrite
workflow.
3. The user restarts the instance. In the event that Nexus places the
instance back on the same sled as last time, the `external_ip` may have
the same parameters used by the soft-deleted nat records. Since we
previously were not filtering for `version_removed = NULL` in
`ensure_ipv4_nat_entry`, the soft-deleted records would still be treated
as "live" in our db query, causing Nexus to skip inserting new nat
records when the instance restarts.

This PR should resolve this unwanted behavior. However, a second issue
was noticed during verification of the bug fix. I noticed that when
running `swadm nat list`, the entries did not re-appear in the output
even though `dendrite` was indeed picking up the new additions and
configuring the softnpu asic accordingly. I believe this was also
something @askfongjojo reported in chat. This means that we could have
live entries on the switch and external traffic flowing to an instance,
even though the nat entry is not appearing in `swadm nat list`. This PR
also includes an upgraded dendrite that resolves that bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants