You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently external IP address entries are deleted during instance-stop operations. There are atypical lifecycle situations when instances are deleted without being explicitly stopped, e.g. when an instance is left in running state prior to a system update. The instance in this case transitions to failed state when user attempts to reboot it. If the user deletes the instance, the snat and external IP addresses in the ip_nat_entry table will remain as is while they are marked as deleted in the external_ip table.
I've run into such a situation on rack2. The end result was that a newly created VM could acquire the external ip address released from a failed VM after it's been deleted but would fail to start up. Here is the 500 error I saw in the Nexus log:
01:04:00.050Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
error_message_external = Internal Server Error
error_message_internal = saga ACTION error at node "dpd_ensure": unexpected database error: duplicate key value violates unique constraint "overlapping_ipv4_nat_entry"
file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/ff87a01/dropshot/src/server.rs:841
latency_us = 413890
local_addr = 172.30.2.5:443
method = POST
remote_addr = 172.20.17.42:50510
req_id = dff3c7b1-5fea-4597-b5c2-2763a0bdcfb5
response_code = 500
uri = https://oxide.sys.rack2.eng.oxide.computer/v1/instances/failed-vm/start?project=try
The ipv4_nat_entry will likely not have any active IP records for a stopped VM in most cases so the suggestion here is to attempt deletion only if such data exist, but do not complain if there isn't anything to delete.
The text was updated successfully, but these errors were encountered:
Bugfix for issue #4589.
The root cause `ensure_ipv4_nat_entry` previously would match against
*any* existing table entries with the matching parameters. We need it to
only match against entries that are *active*, or in implementation
terms, entries whose `version_removed` column is `NULL`.
The events triggering the bug is as follows:
1. User creates a new instance, eventually triggering the creation of
new ipv4 nat entries, which are reconciled by the downstream dendrite
workflow.
2. User stops the instance. This triggers the soft-deletion of the ipv4
nat entries, which are again reconciled by the downstream dendrite
workflow.
3. The user restarts the instance. In the event that Nexus places the
instance back on the same sled as last time, the `external_ip` may have
the same parameters used by the soft-deleted nat records. Since we
previously were not filtering for `version_removed = NULL` in
`ensure_ipv4_nat_entry`, the soft-deleted records would still be treated
as "live" in our db query, causing Nexus to skip inserting new nat
records when the instance restarts.
This PR should resolve this unwanted behavior. However, a second issue
was noticed during verification of the bug fix. I noticed that when
running `swadm nat list`, the entries did not re-appear in the output
even though `dendrite` was indeed picking up the new additions and
configuring the softnpu asic accordingly. I believe this was also
something @askfongjojo reported in chat. This means that we could have
live entries on the switch and external traffic flowing to an instance,
even though the nat entry is not appearing in `swadm nat list`. This PR
also includes an upgraded dendrite that resolves that bug.
Currently external IP address entries are deleted during instance-stop operations. There are atypical lifecycle situations when instances are deleted without being explicitly stopped, e.g. when an instance is left in
running
state prior to a system update. The instance in this case transitions tofailed
state when user attempts to reboot it. If the user deletes the instance, the snat and external IP addresses in theip_nat_entry
table will remain as is while they are marked as deleted in the external_ip table.I've run into such a situation on rack2. The end result was that a newly created VM could acquire the external ip address released from a failed VM after it's been deleted but would fail to start up. Here is the 500 error I saw in the Nexus log:
The
ipv4_nat_entry
will likely not have any active IP records for a stopped VM in most cases so the suggestion here is to attempt deletion only if such data exist, but do not complain if there isn't anything to delete.The text was updated successfully, but these errors were encountered: