-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instance external IP is sometimes inaccessible after a stop-start power cycle #4589
Comments
I retried starting-stopping-starting the said instance on rack2 (instance id: b54a1c5b-68e2-43c1-8cf0-c5814e282992) after the rack was updated yesterday. After the first VM start, I was able to ssh to it. But the second start didn't bring back its external IP access. This time I left the VM in its current state. The nexus and dendrite logs are a lot smaller so that it's easier to track the events. They are located at catacomb:/data/staff/angela/oxz_dendrite_default.log.20231201 and oxz_nexus_default.log.20231201. Here are the last three VMM db records that show the sled locations of the propolis zone:
The nexus log shows that in both instance-start operations, the
The thing that seems odd to me is the sled IP addresses being referenced: They don't match entirely with the propolis zones and the
@internet-diglett - Feel free to stop/start the instance. I'm going to leave it untouched for your debugging. |
@internet-diglett - There seems to be a pattern to this issue. The problem happens whenever the propolis zone restart uses the same sled as before. If it is placed on a different sled, the problem is not happening based on the limited number of tests I've done so far. |
@askfongjojo That's actually a very useful observation! There may be an edge case in our instance start / stop sagas or something similar that doesn't issue the necessary requests? 🤔 |
Bugfix for issue #4589. The root cause `ensure_ipv4_nat_entry` previously would match against *any* existing table entries with the matching parameters. We need it to only match against entries that are *active*, or in implementation terms, entries whose `version_removed` column is `NULL`. The events triggering the bug is as follows: 1. User creates a new instance, eventually triggering the creation of new ipv4 nat entries, which are reconciled by the downstream dendrite workflow. 2. User stops the instance. This triggers the soft-deletion of the ipv4 nat entries, which are again reconciled by the downstream dendrite workflow. 3. The user restarts the instance. In the event that Nexus places the instance back on the same sled as last time, the `external_ip` may have the same parameters used by the soft-deleted nat records. Since we previously were not filtering for `version_removed = NULL` in `ensure_ipv4_nat_entry`, the soft-deleted records would still be treated as "live" in our db query, causing Nexus to skip inserting new nat records when the instance restarts. This PR should resolve this unwanted behavior. However, a second issue was noticed during verification of the bug fix. I noticed that when running `swadm nat list`, the entries did not re-appear in the output even though `dendrite` was indeed picking up the new additions and configuring the softnpu asic accordingly. I believe this was also something @askfongjojo reported in chat. This means that we could have live entries on the switch and external traffic flowing to an instance, even though the nat entry is not appearing in `swadm nat list`. This PR also includes an upgraded dendrite that resolves that bug.
The issue was reproduced consistently for the instance with external ip 172.20.26.132 on rack2. The access is reenabled after I stopped and started the instance once or twice again. Since I provisioned the VM with local auth enabled, when I couldn't SSH to it, I was still able to log in via the serial console and confirm that sshd was running and listening to port 22.
During the VM start/stop, I also inspected the nat entries the switch zone. In all cases, including when I could ssh to the instance, I didn't see the
.132
address being included. Here are all the entries I saw the whole time:The instance external connectivity never came back after 18:08:49. I stopped and started the instance many times after that but it never regained the NAT entry.
The complete dendrite log file from switch 0 (sled 14) is located at catacomb:/data/staff/angela/oxide-dendrite_default.log.20231130
The text was updated successfully, but these errors were encountered: