crucible agent failed to start a downstairs #704

leftwo · 2023-04-21T22:24:34Z

On rack2 (dogfood) we uploaded an image, then created an instance based on this image.

Something happened in the crucible agent for one of the downstairs and it failed to start that downstairs.
propolis server was stuck waiting for the third downstairs.

leftwo · 2023-04-21T22:40:41Z

Sled agent log:
sled-agent-log-when-agent-failed.log

leftwo · 2023-04-21T22:45:02Z

nexus log (too big to attach) is here: /net/catacomb/data/staff/core/rack2/BRM42220070/crucible-704

jmpesp · 2023-04-24T18:58:16Z

We saw Nexus successfully tell the agent to start three Downstairs for a snapshot:

{
  "msg": "crucible running snapshot RunningSnapshot { id: RegionId(\"5dfcd3b5-fec0-4d3b-a383-121fa6a5e916\"), name: \"d83d29ce-b16f-45bd-848a-bbae5b002693\", port_number: 19001, state: Created }",
  "v": 0,
  "name": "74403a62-fffd-45bb-aeb7-de0af88df320",
  "level": 30,
  "time": "2023-04-21T13:46:21.074502015-07:00",
  "hostname": "oxz_nexus",
  "pid": 16238,
  "saga_id": "aba0627b-b11e-43ee-821a-2ef8ccd216c0",
  "saga_name": "finalize-disk",
  "component": "ServerContext"
}
{
  "msg": "crucible running snapshot RunningSnapshot { id: RegionId(\"68cb66a7-7a9f-4c9e-95f9-9617f27289b0\"), name: \"d83d29ce-b16f-45bd-848a-bbae5b002693\", port_number: 19001, state: Created }",
  "v": 0,
  "name": "74403a62-fffd-45bb-aeb7-de0af88df320",
  "level": 30,
  "time": "2023-04-21T13:46:21.136658265-07:00",
  "hostname": "oxz_nexus",
  "pid": 16238,
  "saga_id": "aba0627b-b11e-43ee-821a-2ef8ccd216c0",
  "saga_name": "finalize-disk",
  "component": "ServerContext"
}
{
  "msg": "crucible running snapshot RunningSnapshot { id: RegionId(\"6bbc4067-b739-47ee-9472-d336a7f5c286\"), name: \"d83d29ce-b16f-45bd-848a-bbae5b002693\", port_number: 19001, state: Created }",
  "v": 0,
  "name": "74403a62-fffd-45bb-aeb7-de0af88df320",
  "level": 30,
  "time": "2023-04-21T13:46:21.198667256-07:00",
  "hostname": "oxz_nexus",
  "pid": 16238,
  "saga_id": "aba0627b-b11e-43ee-821a-2ef8ccd216c0",
  "saga_name": "finalize-disk",
  "component": "ServerContext"
}

We saw each respective crucible agent record in /data/crucible.json that a running snapshot had been requested, but only two of those actually spun up the read-only downstairs. When the third agent was bounced with svcadm, it read crucible.json and brought up the missing read-only downstairs.

Something got stuck in the agent - it received (and enqueued) the "running snapshot" request but didn't carry it out.

My normal flow for booting an instance is: - create an image - create a disk with that image as a source - attach that disk to an instance, and boot This masked the problem that this PR fixes: the crucible agent does not call `apply_smf` when creating or destroying "running snapshots" (aka read-only downstairs for snapshots), **only** when creating or destroying regions. If a user creates an image, the read-only downstairs are not provisioned. If a user then creates a new disk with that image as a source, `apply_smf` is called as part of creating the region for that new disk, and this will also provision the read-only downstairs. If a user only created a disk from a bulk import (like oxidecomputer/omicron#3034) then the read-only downstairs would not be started. If that disk is then attached to an instance, it will not boot because the Upstairs cannot connect to the non-existent read-only downstairs: May 07 06:23:09.145 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1 May 07 06:23:09.146 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0 May 07 06:23:09.146 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2 May 07 06:23:19.155 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1 May 07 06:23:19.158 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0 May 07 06:23:19.158 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2 Fixes oxidecomputer/omicron#3034 Fixes oxidecomputer#704

jmpesp mentioned this issue May 8, 2023

VM stuck in starting state when using an image created from a disk snapshot oxidecomputer/omicron#3034

Closed

jmpesp added this to the MVP milestone May 8, 2023

jmpesp self-assigned this May 8, 2023

jmpesp mentioned this issue May 9, 2023

Actually provision read-only downstairs! #728

Merged

jmpesp closed this as completed in #728 May 10, 2023

jmpesp closed this as completed in 9f69dea May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crucible agent failed to start a downstairs #704

crucible agent failed to start a downstairs #704

leftwo commented Apr 21, 2023

leftwo commented Apr 21, 2023

leftwo commented Apr 21, 2023

jmpesp commented Apr 24, 2023 •

edited

Loading

crucible agent failed to start a downstairs #704

crucible agent failed to start a downstairs #704

Comments

leftwo commented Apr 21, 2023

leftwo commented Apr 21, 2023

leftwo commented Apr 21, 2023

jmpesp commented Apr 24, 2023 • edited Loading

jmpesp commented Apr 24, 2023 •

edited

Loading