Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance external IP is sometimes inaccessible after a stop-start power cycle #4589

Closed
askfongjojo opened this issue Nov 30, 2023 · 3 comments · Fixed by #4610
Closed

Instance external IP is sometimes inaccessible after a stop-start power cycle #4589

askfongjojo opened this issue Nov 30, 2023 · 3 comments · Fixed by #4610
Assignees
Milestone

Comments

@askfongjojo
Copy link

The issue was reproduced consistently for the instance with external ip 172.20.26.132 on rack2. The access is reenabled after I stopped and started the instance once or twice again. Since I provisioned the VM with local auth enabled, when I couldn't SSH to it, I was still able to log in via the serial console and confirm that sshd was running and listening to port 22.

During the VM start/stop, I also inspected the nat entries the switch zone. In all cases, including when I could ssh to the instance, I didn't see the .132 address being included. Here are all the entries I saw the whole time:

root@oxz_switch:~# /opt/oxide/dendrite/bin/swadm nat list
External IP    Port low  Port high  Internal IP            Inner MAC          VNI
172.20.26.1    0         65535      fd00:1122:3344:106::1  a8:40:25:ff:b0:9c  100
172.20.26.2    0         65535      fd00:1122:3344:101::1  a8:40:25:ff:d0:b4  100
172.20.26.3    0         65535      fd00:1122:3344:103::1  a8:40:25:ff:a6:83  100
172.20.26.4    0         65535      fd00:1122:3344:104::1  a8:40:25:ff:b4:c1  100
172.20.26.5    0         65535      fd00:1122:3344:102::1  a8:40:25:ff:c6:59  100
172.20.26.6    0         16383      fd00:1122:3344:104::1  a8:40:25:ff:b2:52  100
172.20.26.7    16384     32767      fd00:1122:3344:102::1  a8:40:25:ff:a0:f9  100
172.20.26.46   0         16383      fd00:1122:3344:10b::1  a8:40:25:fb:56:f7  14363509
172.20.26.46   32768     49151      fd00:1122:3344:109::1  a8:40:25:fd:e8:46  10005096
172.20.26.46   49152     65535      fd00:1122:3344:109::1  a8:40:25:fe:48:d3  10005096
172.20.26.48   0         65535      fd00:1122:3344:102::1  a8:40:25:f2:0b:ef  1203680
172.20.26.54   0         16383      fd00:1122:3344:102::1  a8:40:25:f9:a0:b0  10005096
172.20.26.54   16384     32767      fd00:1122:3344:109::1  a8:40:25:f7:ce:7e  10005096
172.20.26.54   32768     49151      fd00:1122:3344:10b::1  a8:40:25:f6:b7:4e  10005096
172.20.26.54   49152     65535      fd00:1122:3344:109::1  a8:40:25:f9:83:8f  10005096
172.20.26.56   0         65535      fd00:1122:3344:10b::1  a8:40:25:fb:56:f7  14363509
172.20.26.70   0         16383      fd00:1122:3344:10b::1  a8:40:25:fb:44:63  8476972
172.20.26.71   0         65535      fd00:1122:3344:10b::1  a8:40:25:fb:44:63  8476972
172.20.26.72   16384     32767      fd00:1122:3344:102::1  a8:40:25:f4:f8:93  10005096
172.20.26.72   32768     49151      fd00:1122:3344:108::1  a8:40:25:f5:d7:f1  10005096
172.20.26.72   49152     65535      fd00:1122:3344:105::1  a8:40:25:fd:0f:3a  10005096
172.20.26.82   32768     49151      fd00:1122:3344:10b::1  a8:40:25:f1:28:b6  10005096
172.20.26.82   49152     65535      fd00:1122:3344:103::1  a8:40:25:fd:46:cd  14363509
172.20.26.94   0         16383      fd00:1122:3344:106::1  a8:40:25:f7:aa:fc  10005096
172.20.26.94   16384     32767      fd00:1122:3344:105::1  a8:40:25:f8:39:e6  10005096
172.20.26.95   0         65535      fd00:1122:3344:103::1  a8:40:25:fd:46:cd  14363509
172.20.26.105  0         16383      fd00:1122:3344:102::1  a8:40:25:fa:87:95  10005096
172.20.26.105  49152     65535      fd00:1122:3344:101::1  a8:40:25:f2:5e:1a  10005096
172.20.26.161  32768     49151      fd00:1122:3344:108::1  a8:40:25:f5:61:04  10005096
172.20.26.161  49152     65535      fd00:1122:3344:101::1  a8:40:25:f4:d9:6e  1508093
172.20.26.165  0         65535      fd00:1122:3344:101::1  a8:40:25:f4:d9:6e  1508093
172.20.26.166  0         16383      fd00:1122:3344:109::1  a8:40:25:f6:75:89  14363509
172.20.26.166  49152     65535      fd00:1122:3344:106::1  a8:40:25:fa:1a:99  10005096
172.20.26.167  0         65535      fd00:1122:3344:109::1  a8:40:25:f6:75:89  14363509
172.20.26.170  0         16383      fd00:1122:3344:105::1  a8:40:25:f2:0b:9d  10005096
172.20.26.170  16384     32767      fd00:1122:3344:109::1  a8:40:25:f3:bc:4c  10005096
172.20.26.170  32768     49151      fd00:1122:3344:102::1  a8:40:25:f2:d8:6e  10005096
172.20.26.170  49152     65535      fd00:1122:3344:104::1  a8:40:25:f7:34:4f  10005096
172.20.26.171  0         16383      fd00:1122:3344:104::1  a8:40:25:f1:45:2d  10005096
172.20.26.171  16384     32767      fd00:1122:3344:104::1  a8:40:25:fd:6d:58  10005096
172.20.26.171  32768     49151      fd00:1122:3344:108::1  a8:40:25:f1:d4:4a  10005096
172.20.26.171  49152     65535      fd00:1122:3344:105::1  a8:40:25:f5:be:35  10005096
172.20.26.172  0         16383      fd00:1122:3344:10b::1  a8:40:25:f2:99:1b  10005096

The instance external connectivity never came back after 18:08:49. I stopped and started the instance many times after that but it never regained the NAT entry.

The complete dendrite log file from switch 0 (sled 14) is located at catacomb:/data/staff/angela/oxide-dendrite_default.log.20231130

@askfongjojo askfongjojo added this to the 5 milestone Nov 30, 2023
@askfongjojo
Copy link
Author

I retried starting-stopping-starting the said instance on rack2 (instance id: b54a1c5b-68e2-43c1-8cf0-c5814e282992) after the rack was updated yesterday. After the first VM start, I was able to ssh to it. But the second start didn't bring back its external IP access. This time I left the VM in its current state. The nexus and dendrite logs are a lot smaller so that it's easier to track the events. They are located at catacomb:/data/staff/angela/oxz_dendrite_default.log.20231201 and oxz_nexus_default.log.20231201.

Here are the last three VMM db records that show the sled locations of the propolis zone:

root@[fd00:1122:3344:105::3]:32221/omicron> select * from vmm where instance_id = 'b54a1c5b-68e2-43c1-8cf0-c5814e282992' order by time_created desc limit 3;
                   id                  |         time_created          |         time_deleted          |             instance_id              |   state   |      time_state_updated       | state_generation |               sled_id                |       propolis_ip
---------------------------------------+-------------------------------+-------------------------------+--------------------------------------+-----------+-------------------------------+------------------+--------------------------------------+--------------------------
  01a8f6a8-6c09-4959-b39f-fb5101da430d | 2023-12-01 05:25:40.168758+00 | NULL                          | b54a1c5b-68e2-43c1-8cf0-c5814e282992 | running   | 2023-12-01 05:26:03.880349+00 |                3 | 71def415-55ad-46b4-ba88-3ca55d7fb287 | fd00:1122:3344:10b::1f8
  fcf84fe8-f470-4c93-8360-0124f983da39 | 2023-12-01 05:08:54.110705+00 | 2023-12-01 05:24:36.643649+00 | b54a1c5b-68e2-43c1-8cf0-c5814e282992 | destroyed | 2023-12-01 05:24:32.388234+00 |                7 | 71def415-55ad-46b4-ba88-3ca55d7fb287 | fd00:1122:3344:10b::1f7
  0122be38-0519-4708-ba3e-886f48e8bca0 | 2023-11-30 18:28:14.558614+00 | 2023-11-30 18:46:45.110641+00 | b54a1c5b-68e2-43c1-8cf0-c5814e282992 | destroyed | 2023-11-30 18:46:40.847732+00 |                7 | 5f6720b8-8a31-45f8-8c94-8e699218f28b | fd00:1122:3344:109::1f9
(3 rows)

The nexus log shows that in both instance-start operations, the DpdEnsure saga event POST requests to the sled /rpw/nat/ipv4/trigger endpoint were successfully made (with a 200 response).

root@oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7:~# grep DpdClient /var/svc/log/oxide-nexus\:default.log | grep POST 
{"msg":"client request","v":0,"name":"nexus","level":20,"time":"2023-12-01T05:08:54.358499033Z","hostname":"oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7","pid":1613,"component":"DpdClient","component":"nexus","component":"ServerContext","name":"65a11c18-7f59-41ac-b9e7-680627f996e7","body":"None","uri":"http://[fd00:1122:3344:108::2]:12224/rpw/nat/ipv4/trigger","method":"POST"}
{"msg":"client request","v":0,"name":"nexus","level":20,"time":"2023-12-01T05:08:54.452159419Z","hostname":"oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7","pid":1613,"component":"DpdClient","component":"nexus","component":"ServerContext","name":"65a11c18-7f59-41ac-b9e7-680627f996e7","body":"None","uri":"http://[fd00:1122:3344:10b::2]:12224/rpw/nat/ipv4/trigger","method":"POST"}
{"msg":"client request","v":0,"name":"nexus","level":20,"time":"2023-12-01T05:25:40.421153609Z","hostname":"oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7","pid":1613,"component":"DpdClient","component":"nexus","component":"ServerContext","name":"65a11c18-7f59-41ac-b9e7-680627f996e7","body":"None","uri":"http://[fd00:1122:3344:108::2]:12224/rpw/nat/ipv4/trigger","method":"POST"}
{"msg":"client request","v":0,"name":"nexus","level":20,"time":"2023-12-01T05:25:40.517480445Z","hostname":"oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7","pid":1613,"component":"DpdClient","component":"nexus","component":"ServerContext","name":"65a11c18-7f59-41ac-b9e7-680627f996e7","body":"None","uri":"http://[fd00:1122:3344:10b::2]:12224/rpw/nat/ipv4/trigger","method":"POST"}

The thing that seems odd to me is the sled IP addresses being referenced: They don't match entirely with the propolis zones and the ipv4_nat_entry table.

root@[fd00:1122:3344:105::3]:32221/omicron> select * from ipv4_nat_entry where external_address in ('172.20.26.132', '172.20.26.105') order by time_created desc limit 4;
                   id                  | external_address | first_port | last_port |     sled_address      |   vni   |       mac       | version_added | version_removed |         time_created          |         time_deleted
---------------------------------------+------------------+------------+-----------+-----------------------+---------+-----------------+---------------+-----------------+-------------------------------+--------------------------------
  acedd9bd-6013-4bac-a283-f01dd65dbdcd | 172.20.26.105    |      16384 |     32767 | fd00:1122:3344:10b::1 | 1508093 | 184993468032486 |          5200 |            5202 | 2023-12-01 05:08:54.354703+00 | 2023-12-01 05:24:36.539218+00
  135f456c-53d8-4c4e-8903-335980a2ed19 | 172.20.26.132    |          0 |     65535 | fd00:1122:3344:10b::1 | 1508093 | 184993468032486 |          5199 |            5201 | 2023-12-01 05:08:54.349866+00 | 2023-12-01 05:24:36.532224+00
  d8c5bbe0-b07e-49a6-b2cb-3d7f6c4b4b44 | 172.20.26.105    |      16384 |     32767 | fd00:1122:3344:109::1 | 1508093 | 184993468032486 |          5178 |            5180 | 2023-11-30 18:07:37.793733+00 | 2023-11-30 18:08:48.171501+00
  96113946-7ca9-45aa-a0f9-63e8041a7f37 | 172.20.26.132    |          0 |     65535 | fd00:1122:3344:109::1 | 1508093 | 184993468032486 |          5177 |            5179 | 2023-11-30 18:07:37.788729+00 | 2023-11-30 18:08:48.164714+00

@internet-diglett - Feel free to stop/start the instance. I'm going to leave it untouched for your debugging.

@askfongjojo
Copy link
Author

@internet-diglett - There seems to be a pattern to this issue. The problem happens whenever the propolis zone restart uses the same sled as before. If it is placed on a different sled, the problem is not happening based on the limited number of tests I've done so far.

@internet-diglett
Copy link
Contributor

@askfongjojo That's actually a very useful observation! There may be an edge case in our instance start / stop sagas or something similar that doesn't issue the necessary requests? 🤔

@morlandi7 morlandi7 linked a pull request Dec 6, 2023 that will close this issue
4 tasks
internet-diglett added a commit that referenced this issue Dec 6, 2023
Bugfix for issue #4589.

The root cause `ensure_ipv4_nat_entry` previously would match against
*any* existing table entries with the matching parameters. We need it to
only match against entries that are *active*, or in implementation
terms, entries whose `version_removed` column is `NULL`.

The events triggering the bug is as follows:
1. User creates a new instance, eventually triggering the creation of
new ipv4 nat entries, which are reconciled by the downstream dendrite
workflow.
2. User stops the instance. This triggers the soft-deletion of the ipv4
nat entries, which are again reconciled by the downstream dendrite
workflow.
3. The user restarts the instance. In the event that Nexus places the
instance back on the same sled as last time, the `external_ip` may have
the same parameters used by the soft-deleted nat records. Since we
previously were not filtering for `version_removed = NULL` in
`ensure_ipv4_nat_entry`, the soft-deleted records would still be treated
as "live" in our db query, causing Nexus to skip inserting new nat
records when the instance restarts.

This PR should resolve this unwanted behavior. However, a second issue
was noticed during verification of the bug fix. I noticed that when
running `swadm nat list`, the entries did not re-appear in the output
even though `dendrite` was indeed picking up the new additions and
configuring the softnpu asic accordingly. I believe this was also
something @askfongjojo reported in chat. This means that we could have
live entries on the switch and external traffic flowing to an instance,
even though the nat entry is not appearing in `swadm nat list`. This PR
also includes an upgraded dendrite that resolves that bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants