Instance external IP is sometimes inaccessible after a stop-start power cycle #4589

askfongjojo · 2023-11-30T22:01:50Z

The issue was reproduced consistently for the instance with external ip 172.20.26.132 on rack2. The access is reenabled after I stopped and started the instance once or twice again. Since I provisioned the VM with local auth enabled, when I couldn't SSH to it, I was still able to log in via the serial console and confirm that sshd was running and listening to port 22.

During the VM start/stop, I also inspected the nat entries the switch zone. In all cases, including when I could ssh to the instance, I didn't see the .132 address being included. Here are all the entries I saw the whole time:

root@oxz_switch:~# /opt/oxide/dendrite/bin/swadm nat list
External IP    Port low  Port high  Internal IP            Inner MAC          VNI
172.20.26.1    0         65535      fd00:1122:3344:106::1  a8:40:25:ff:b0:9c  100
172.20.26.2    0         65535      fd00:1122:3344:101::1  a8:40:25:ff:d0:b4  100
172.20.26.3    0         65535      fd00:1122:3344:103::1  a8:40:25:ff:a6:83  100
172.20.26.4    0         65535      fd00:1122:3344:104::1  a8:40:25:ff:b4:c1  100
172.20.26.5    0         65535      fd00:1122:3344:102::1  a8:40:25:ff:c6:59  100
172.20.26.6    0         16383      fd00:1122:3344:104::1  a8:40:25:ff:b2:52  100
172.20.26.7    16384     32767      fd00:1122:3344:102::1  a8:40:25:ff:a0:f9  100
172.20.26.46   0         16383      fd00:1122:3344:10b::1  a8:40:25:fb:56:f7  14363509
172.20.26.46   32768     49151      fd00:1122:3344:109::1  a8:40:25:fd:e8:46  10005096
172.20.26.46   49152     65535      fd00:1122:3344:109::1  a8:40:25:fe:48:d3  10005096
172.20.26.48   0         65535      fd00:1122:3344:102::1  a8:40:25:f2:0b:ef  1203680
172.20.26.54   0         16383      fd00:1122:3344:102::1  a8:40:25:f9:a0:b0  10005096
172.20.26.54   16384     32767      fd00:1122:3344:109::1  a8:40:25:f7:ce:7e  10005096
172.20.26.54   32768     49151      fd00:1122:3344:10b::1  a8:40:25:f6:b7:4e  10005096
172.20.26.54   49152     65535      fd00:1122:3344:109::1  a8:40:25:f9:83:8f  10005096
172.20.26.56   0         65535      fd00:1122:3344:10b::1  a8:40:25:fb:56:f7  14363509
172.20.26.70   0         16383      fd00:1122:3344:10b::1  a8:40:25:fb:44:63  8476972
172.20.26.71   0         65535      fd00:1122:3344:10b::1  a8:40:25:fb:44:63  8476972
172.20.26.72   16384     32767      fd00:1122:3344:102::1  a8:40:25:f4:f8:93  10005096
172.20.26.72   32768     49151      fd00:1122:3344:108::1  a8:40:25:f5:d7:f1  10005096
172.20.26.72   49152     65535      fd00:1122:3344:105::1  a8:40:25:fd:0f:3a  10005096
172.20.26.82   32768     49151      fd00:1122:3344:10b::1  a8:40:25:f1:28:b6  10005096
172.20.26.82   49152     65535      fd00:1122:3344:103::1  a8:40:25:fd:46:cd  14363509
172.20.26.94   0         16383      fd00:1122:3344:106::1  a8:40:25:f7:aa:fc  10005096
172.20.26.94   16384     32767      fd00:1122:3344:105::1  a8:40:25:f8:39:e6  10005096
172.20.26.95   0         65535      fd00:1122:3344:103::1  a8:40:25:fd:46:cd  14363509
172.20.26.105  0         16383      fd00:1122:3344:102::1  a8:40:25:fa:87:95  10005096
172.20.26.105  49152     65535      fd00:1122:3344:101::1  a8:40:25:f2:5e:1a  10005096
172.20.26.161  32768     49151      fd00:1122:3344:108::1  a8:40:25:f5:61:04  10005096
172.20.26.161  49152     65535      fd00:1122:3344:101::1  a8:40:25:f4:d9:6e  1508093
172.20.26.165  0         65535      fd00:1122:3344:101::1  a8:40:25:f4:d9:6e  1508093
172.20.26.166  0         16383      fd00:1122:3344:109::1  a8:40:25:f6:75:89  14363509
172.20.26.166  49152     65535      fd00:1122:3344:106::1  a8:40:25:fa:1a:99  10005096
172.20.26.167  0         65535      fd00:1122:3344:109::1  a8:40:25:f6:75:89  14363509
172.20.26.170  0         16383      fd00:1122:3344:105::1  a8:40:25:f2:0b:9d  10005096
172.20.26.170  16384     32767      fd00:1122:3344:109::1  a8:40:25:f3:bc:4c  10005096
172.20.26.170  32768     49151      fd00:1122:3344:102::1  a8:40:25:f2:d8:6e  10005096
172.20.26.170  49152     65535      fd00:1122:3344:104::1  a8:40:25:f7:34:4f  10005096
172.20.26.171  0         16383      fd00:1122:3344:104::1  a8:40:25:f1:45:2d  10005096
172.20.26.171  16384     32767      fd00:1122:3344:104::1  a8:40:25:fd:6d:58  10005096
172.20.26.171  32768     49151      fd00:1122:3344:108::1  a8:40:25:f1:d4:4a  10005096
172.20.26.171  49152     65535      fd00:1122:3344:105::1  a8:40:25:f5:be:35  10005096
172.20.26.172  0         16383      fd00:1122:3344:10b::1  a8:40:25:f2:99:1b  10005096

The instance external connectivity never came back after 18:08:49. I stopped and started the instance many times after that but it never regained the NAT entry.

The complete dendrite log file from switch 0 (sled 14) is located at catacomb:/data/staff/angela/oxide-dendrite_default.log.20231130

The text was updated successfully, but these errors were encountered:

askfongjojo · 2023-12-01T16:14:11Z

I retried starting-stopping-starting the said instance on rack2 (instance id: b54a1c5b-68e2-43c1-8cf0-c5814e282992) after the rack was updated yesterday. After the first VM start, I was able to ssh to it. But the second start didn't bring back its external IP access. This time I left the VM in its current state. The nexus and dendrite logs are a lot smaller so that it's easier to track the events. They are located at catacomb:/data/staff/angela/oxz_dendrite_default.log.20231201 and oxz_nexus_default.log.20231201.

Here are the last three VMM db records that show the sled locations of the propolis zone:

root@[fd00:1122:3344:105::3]:32221/omicron> select * from vmm where instance_id = 'b54a1c5b-68e2-43c1-8cf0-c5814e282992' order by time_created desc limit 3;
                   id                  |         time_created          |         time_deleted          |             instance_id              |   state   |      time_state_updated       | state_generation |               sled_id                |       propolis_ip
---------------------------------------+-------------------------------+-------------------------------+--------------------------------------+-----------+-------------------------------+------------------+--------------------------------------+--------------------------
  01a8f6a8-6c09-4959-b39f-fb5101da430d | 2023-12-01 05:25:40.168758+00 | NULL                          | b54a1c5b-68e2-43c1-8cf0-c5814e282992 | running   | 2023-12-01 05:26:03.880349+00 |                3 | 71def415-55ad-46b4-ba88-3ca55d7fb287 | fd00:1122:3344:10b::1f8
  fcf84fe8-f470-4c93-8360-0124f983da39 | 2023-12-01 05:08:54.110705+00 | 2023-12-01 05:24:36.643649+00 | b54a1c5b-68e2-43c1-8cf0-c5814e282992 | destroyed | 2023-12-01 05:24:32.388234+00 |                7 | 71def415-55ad-46b4-ba88-3ca55d7fb287 | fd00:1122:3344:10b::1f7
  0122be38-0519-4708-ba3e-886f48e8bca0 | 2023-11-30 18:28:14.558614+00 | 2023-11-30 18:46:45.110641+00 | b54a1c5b-68e2-43c1-8cf0-c5814e282992 | destroyed | 2023-11-30 18:46:40.847732+00 |                7 | 5f6720b8-8a31-45f8-8c94-8e699218f28b | fd00:1122:3344:109::1f9
(3 rows)

The nexus log shows that in both instance-start operations, the DpdEnsure saga event POST requests to the sled /rpw/nat/ipv4/trigger endpoint were successfully made (with a 200 response).

root@oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7:~# grep DpdClient /var/svc/log/oxide-nexus\:default.log | grep POST 
{"msg":"client request","v":0,"name":"nexus","level":20,"time":"2023-12-01T05:08:54.358499033Z","hostname":"oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7","pid":1613,"component":"DpdClient","component":"nexus","component":"ServerContext","name":"65a11c18-7f59-41ac-b9e7-680627f996e7","body":"None","uri":"http://[fd00:1122:3344:108::2]:12224/rpw/nat/ipv4/trigger","method":"POST"}
{"msg":"client request","v":0,"name":"nexus","level":20,"time":"2023-12-01T05:08:54.452159419Z","hostname":"oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7","pid":1613,"component":"DpdClient","component":"nexus","component":"ServerContext","name":"65a11c18-7f59-41ac-b9e7-680627f996e7","body":"None","uri":"http://[fd00:1122:3344:10b::2]:12224/rpw/nat/ipv4/trigger","method":"POST"}
{"msg":"client request","v":0,"name":"nexus","level":20,"time":"2023-12-01T05:25:40.421153609Z","hostname":"oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7","pid":1613,"component":"DpdClient","component":"nexus","component":"ServerContext","name":"65a11c18-7f59-41ac-b9e7-680627f996e7","body":"None","uri":"http://[fd00:1122:3344:108::2]:12224/rpw/nat/ipv4/trigger","method":"POST"}
{"msg":"client request","v":0,"name":"nexus","level":20,"time":"2023-12-01T05:25:40.517480445Z","hostname":"oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7","pid":1613,"component":"DpdClient","component":"nexus","component":"ServerContext","name":"65a11c18-7f59-41ac-b9e7-680627f996e7","body":"None","uri":"http://[fd00:1122:3344:10b::2]:12224/rpw/nat/ipv4/trigger","method":"POST"}

The thing that seems odd to me is the sled IP addresses being referenced: They don't match entirely with the propolis zones and the ipv4_nat_entry table.

root@[fd00:1122:3344:105::3]:32221/omicron> select * from ipv4_nat_entry where external_address in ('172.20.26.132', '172.20.26.105') order by time_created desc limit 4;
                   id                  | external_address | first_port | last_port |     sled_address      |   vni   |       mac       | version_added | version_removed |         time_created          |         time_deleted
---------------------------------------+------------------+------------+-----------+-----------------------+---------+-----------------+---------------+-----------------+-------------------------------+--------------------------------
  acedd9bd-6013-4bac-a283-f01dd65dbdcd | 172.20.26.105    |      16384 |     32767 | fd00:1122:3344:10b::1 | 1508093 | 184993468032486 |          5200 |            5202 | 2023-12-01 05:08:54.354703+00 | 2023-12-01 05:24:36.539218+00
  135f456c-53d8-4c4e-8903-335980a2ed19 | 172.20.26.132    |          0 |     65535 | fd00:1122:3344:10b::1 | 1508093 | 184993468032486 |          5199 |            5201 | 2023-12-01 05:08:54.349866+00 | 2023-12-01 05:24:36.532224+00
  d8c5bbe0-b07e-49a6-b2cb-3d7f6c4b4b44 | 172.20.26.105    |      16384 |     32767 | fd00:1122:3344:109::1 | 1508093 | 184993468032486 |          5178 |            5180 | 2023-11-30 18:07:37.793733+00 | 2023-11-30 18:08:48.171501+00
  96113946-7ca9-45aa-a0f9-63e8041a7f37 | 172.20.26.132    |          0 |     65535 | fd00:1122:3344:109::1 | 1508093 | 184993468032486 |          5177 |            5179 | 2023-11-30 18:07:37.788729+00 | 2023-11-30 18:08:48.164714+00

@internet-diglett - Feel free to stop/start the instance. I'm going to leave it untouched for your debugging.

askfongjojo · 2023-12-01T20:37:44Z

@internet-diglett - There seems to be a pattern to this issue. The problem happens whenever the propolis zone restart uses the same sled as before. If it is placed on a different sled, the problem is not happening based on the limited number of tests I've done so far.

internet-diglett · 2023-12-01T22:15:33Z

@askfongjojo That's actually a very useful observation! There may be an edge case in our instance start / stop sagas or something similar that doesn't issue the necessary requests? 🤔

@askfongjojo

Bugfix for issue #4589. The root cause `ensure_ipv4_nat_entry` previously would match against *any* existing table entries with the matching parameters. We need it to only match against entries that are *active*, or in implementation terms, entries whose `version_removed` column is `NULL`. The events triggering the bug is as follows: 1. User creates a new instance, eventually triggering the creation of new ipv4 nat entries, which are reconciled by the downstream dendrite workflow. 2. User stops the instance. This triggers the soft-deletion of the ipv4 nat entries, which are again reconciled by the downstream dendrite workflow. 3. The user restarts the instance. In the event that Nexus places the instance back on the same sled as last time, the `external_ip` may have the same parameters used by the soft-deleted nat records. Since we previously were not filtering for `version_removed = NULL` in `ensure_ipv4_nat_entry`, the soft-deleted records would still be treated as "live" in our db query, causing Nexus to skip inserting new nat records when the instance restarts. This PR should resolve this unwanted behavior. However, a second issue was noticed during verification of the bug fix. I noticed that when running `swadm nat list`, the entries did not re-appear in the output even though `dendrite` was indeed picking up the new additions and configuring the softnpu asic accordingly. I believe this was also something @askfongjojo reported in chat. This means that we could have live entries on the switch and external traffic flowing to an instance, even though the nat entry is not appearing in `swadm nat list`. This PR also includes an upgraded dendrite that resolves that bug.

askfongjojo added this to the 5 milestone Nov 30, 2023

morlandi7 assigned internet-diglett Dec 4, 2023

internet-diglett mentioned this issue Dec 5, 2023

Bugfixes for #4589 and #4611 #4610

Merged

4 tasks

morlandi7 linked a pull request Dec 6, 2023 that will close this issue

Bugfixes for #4589 and #4611 #4610

Merged

4 tasks

internet-diglett closed this as completed in #4610 Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance external IP is sometimes inaccessible after a stop-start power cycle #4589

Instance external IP is sometimes inaccessible after a stop-start power cycle #4589

askfongjojo commented Nov 30, 2023

askfongjojo commented Dec 1, 2023

askfongjojo commented Dec 1, 2023

internet-diglett commented Dec 1, 2023

Instance external IP is sometimes inaccessible after a stop-start power cycle #4589

Instance external IP is sometimes inaccessible after a stop-start power cycle #4589

Comments

askfongjojo commented Nov 30, 2023

askfongjojo commented Dec 1, 2023

askfongjojo commented Dec 1, 2023

internet-diglett commented Dec 1, 2023