Delete VM workflow should attempt to remove external addresses in ipv4_nat_entry table #4611

askfongjojo · 2023-12-05T01:59:00Z

Currently external IP address entries are deleted during instance-stop operations. There are atypical lifecycle situations when instances are deleted without being explicitly stopped, e.g. when an instance is left in running state prior to a system update. The instance in this case transitions to failed state when user attempts to reboot it. If the user deletes the instance, the snat and external IP addresses in the ip_nat_entry table will remain as is while they are marked as deleted in the external_ip table.

I've run into such a situation on rack2. The end result was that a newly created VM could acquire the external ip address released from a failed VM after it's been deleted but would fail to start up. Here is the 500 error I saw in the Nexus log:

01:04:00.050Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = saga ACTION error at node "dpd_ensure": unexpected database error: duplicate key value violates unique constraint "overlapping_ipv4_nat_entry"
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/ff87a01/dropshot/src/server.rs:841
    latency_us = 413890
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 172.20.17.42:50510
    req_id = dff3c7b1-5fea-4597-b5c2-2763a0bdcfb5
    response_code = 500
    uri = https://oxide.sys.rack2.eng.oxide.computer/v1/instances/failed-vm/start?project=try

The ipv4_nat_entry will likely not have any active IP records for a stopped VM in most cases so the suggestion here is to attempt deletion only if such data exist, but do not complain if there isn't anything to delete.

The text was updated successfully, but these errors were encountered:

@askfongjojo

Bugfix for issue #4589. The root cause `ensure_ipv4_nat_entry` previously would match against *any* existing table entries with the matching parameters. We need it to only match against entries that are *active*, or in implementation terms, entries whose `version_removed` column is `NULL`. The events triggering the bug is as follows: 1. User creates a new instance, eventually triggering the creation of new ipv4 nat entries, which are reconciled by the downstream dendrite workflow. 2. User stops the instance. This triggers the soft-deletion of the ipv4 nat entries, which are again reconciled by the downstream dendrite workflow. 3. The user restarts the instance. In the event that Nexus places the instance back on the same sled as last time, the `external_ip` may have the same parameters used by the soft-deleted nat records. Since we previously were not filtering for `version_removed = NULL` in `ensure_ipv4_nat_entry`, the soft-deleted records would still be treated as "live" in our db query, causing Nexus to skip inserting new nat records when the instance restarts. This PR should resolve this unwanted behavior. However, a second issue was noticed during verification of the bug fix. I noticed that when running `swadm nat list`, the entries did not re-appear in the output even though `dendrite` was indeed picking up the new additions and configuring the softnpu asic accordingly. I believe this was also something @askfongjojo reported in chat. This means that we could have live entries on the switch and external traffic flowing to an instance, even though the nat entry is not appearing in `swadm nat list`. This PR also includes an upgraded dendrite that resolves that bug.

internet-diglett mentioned this issue Dec 5, 2023

Bugfixes for #4589 and #4611 #4610

Merged

4 tasks

morlandi7 added this to the 5 milestone Dec 5, 2023

askfongjojo assigned internet-diglett Dec 5, 2023

internet-diglett closed this as completed Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete VM workflow should attempt to remove external addresses in ipv4_nat_entry table #4611

Delete VM workflow should attempt to remove external addresses in ipv4_nat_entry table #4611

askfongjojo commented Dec 5, 2023

Delete VM workflow should attempt to remove external addresses in ipv4_nat_entry table #4611

Delete VM workflow should attempt to remove external addresses in ipv4_nat_entry table #4611

Comments

askfongjojo commented Dec 5, 2023