server: `assertion failed: server_context.services.oximeter_stats.lock().await.is_none()` due to race in teardown #432

gjcolombo · 2023-06-08T18:24:54Z

Repro steps: seen in Omicron instance stress.

The basic problem here is

register_oximeter asserts that the server context's oximeter_stats is None
register_oximeter gets called from instance_ensure_common
- the call is tied to instance ensure because the Oximeter registration uses the VM ID as a series identifier, and that's not known until instance ensure time
meanwhile, during instance shutdown, ServiceProviders::stop calls take_controller before oximeter_stats.take()

The net result of this is that during shutdown, there's a small window in which a new instance can be ensured (because the controller was taken) but the Oximeter registration hasn't been taken yet, which causes the assertion to fail.

There are several possible ways to fix this:

use a bigger lock to protect the ServiceProviders instead of letting them be locked individually
reorder the teardown steps so that the VM controller is taken last (but I don't like this very much; anything that requires this sort of strict ordering forces us to remember what that ordering is, and we also have to be sure it's safe to drop oximeter_stats first)
push oximeter_stats into the VM controller itself (might be OK, the only place we use it is to count instance reboot requests, and that could easily be pushed down to the controller)

Triage: marking as MVP. The reasons this comes up in Omicron stress in the first place appear to be that (a) the control plane reuses sled IDs and Propolis IDs when stopping and starting VMs, and (b) sled agent will reuse Propolis zones if possible (i.e. there seems to be a race that will allow sled agent to reuse a zone with a Propolis that previously hosted a stopped VM). These behaviors appear to cause lots of other problems (e.g. sled agent crashes due to zones being in unexpected states), so we probably want to address this behavior in the control plane, which will mitigate this issue.

The text was updated successfully, but these errors were encountered:

gjcolombo added the bug Something that isn't working. label Jun 8, 2023

gjcolombo added this to the MVP milestone Jun 8, 2023

gjcolombo mentioned this issue Jul 1, 2024

server: refactor instance initialization #715

Merged

gjcolombo closed this as completed in #715 Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: `assertion failed: server_context.services.oximeter_stats.lock().await.is_none()` due to race in teardown #432

server: `assertion failed: server_context.services.oximeter_stats.lock().await.is_none()` due to race in teardown #432

gjcolombo commented Jun 8, 2023

server: assertion failed: server_context.services.oximeter_stats.lock().await.is_none() due to race in teardown #432

server: assertion failed: server_context.services.oximeter_stats.lock().await.is_none() due to race in teardown #432

Comments

gjcolombo commented Jun 8, 2023

server: `assertion failed: server_context.services.oximeter_stats.lock().await.is_none()` due to race in teardown #432

server: `assertion failed: server_context.services.oximeter_stats.lock().await.is_none()` due to race in teardown #432