You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
register_oximeter asserts that the server context's oximeter_stats is None
register_oximeter gets called from instance_ensure_common
the call is tied to instance ensure because the Oximeter registration uses the VM ID as a series identifier, and that's not known until instance ensure time
meanwhile, during instance shutdown, ServiceProviders::stop calls take_controller before oximeter_stats.take()
The net result of this is that during shutdown, there's a small window in which a new instance can be ensured (because the controller was taken) but the Oximeter registration hasn't been taken yet, which causes the assertion to fail.
There are several possible ways to fix this:
use a bigger lock to protect the ServiceProviders instead of letting them be locked individually
reorder the teardown steps so that the VM controller is taken last (but I don't like this very much; anything that requires this sort of strict ordering forces us to remember what that ordering is, and we also have to be sure it's safe to drop oximeter_stats first)
push oximeter_stats into the VM controller itself (might be OK, the only place we use it is to count instance reboot requests, and that could easily be pushed down to the controller)
Triage: marking as MVP. The reasons this comes up in Omicron stress in the first place appear to be that (a) the control plane reuses sled IDs and Propolis IDs when stopping and starting VMs, and (b) sled agent will reuse Propolis zones if possible (i.e. there seems to be a race that will allow sled agent to reuse a zone with a Propolis that previously hosted a stopped VM). These behaviors appear to cause lots of other problems (e.g. sled agent crashes due to zones being in unexpected states), so we probably want to address this behavior in the control plane, which will mitigate this issue.
The text was updated successfully, but these errors were encountered:
Repro steps: seen in Omicron instance stress.
The basic problem here is
register_oximeter
asserts that the server context'soximeter_stats
isNone
register_oximeter
gets called frominstance_ensure_common
ServiceProviders::stop
callstake_controller
beforeoximeter_stats.take()
The net result of this is that during shutdown, there's a small window in which a new instance can be ensured (because the controller was taken) but the Oximeter registration hasn't been taken yet, which causes the assertion to fail.
There are several possible ways to fix this:
ServiceProviders
instead of letting them be locked individuallyoximeter_stats
first)oximeter_stats
into the VM controller itself (might be OK, the only place we use it is to count instance reboot requests, and that could easily be pushed down to the controller)Triage: marking as MVP. The reasons this comes up in Omicron stress in the first place appear to be that (a) the control plane reuses sled IDs and Propolis IDs when stopping and starting VMs, and (b) sled agent will reuse Propolis zones if possible (i.e. there seems to be a race that will allow sled agent to reuse a zone with a Propolis that previously hosted a stopped VM). These behaviors appear to cause lots of other problems (e.g. sled agent crashes due to zones being in unexpected states), so we probably want to address this behavior in the control plane, which will mitigate this issue.
The text was updated successfully, but these errors were encountered: