Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: assertion failed: server_context.services.oximeter_stats.lock().await.is_none() due to race in teardown #432

Closed
gjcolombo opened this issue Jun 8, 2023 · 0 comments · Fixed by #715
Labels
bug Something that isn't working.
Milestone

Comments

@gjcolombo
Copy link
Contributor

Repro steps: seen in Omicron instance stress.

The basic problem here is

  • register_oximeter asserts that the server context's oximeter_stats is None
  • register_oximeter gets called from instance_ensure_common
    • the call is tied to instance ensure because the Oximeter registration uses the VM ID as a series identifier, and that's not known until instance ensure time
  • meanwhile, during instance shutdown, ServiceProviders::stop calls take_controller before oximeter_stats.take()

The net result of this is that during shutdown, there's a small window in which a new instance can be ensured (because the controller was taken) but the Oximeter registration hasn't been taken yet, which causes the assertion to fail.

There are several possible ways to fix this:

  • use a bigger lock to protect the ServiceProviders instead of letting them be locked individually
  • reorder the teardown steps so that the VM controller is taken last (but I don't like this very much; anything that requires this sort of strict ordering forces us to remember what that ordering is, and we also have to be sure it's safe to drop oximeter_stats first)
  • push oximeter_stats into the VM controller itself (might be OK, the only place we use it is to count instance reboot requests, and that could easily be pushed down to the controller)

Triage: marking as MVP. The reasons this comes up in Omicron stress in the first place appear to be that (a) the control plane reuses sled IDs and Propolis IDs when stopping and starting VMs, and (b) sled agent will reuse Propolis zones if possible (i.e. there seems to be a race that will allow sled agent to reuse a zone with a Propolis that previously hosted a stopped VM). These behaviors appear to cause lots of other problems (e.g. sled agent crashes due to zones being in unexpected states), so we probably want to address this behavior in the control plane, which will mitigate this issue.

@gjcolombo gjcolombo added the bug Something that isn't working. label Jun 8, 2023
@gjcolombo gjcolombo added this to the MVP milestone Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant