Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crucible agent failed to start a downstairs #704

Closed
leftwo opened this issue Apr 21, 2023 · 3 comments · Fixed by #728
Closed

crucible agent failed to start a downstairs #704

leftwo opened this issue Apr 21, 2023 · 3 comments · Fixed by #728
Assignees
Milestone

Comments

@leftwo
Copy link
Contributor

leftwo commented Apr 21, 2023

On rack2 (dogfood) we uploaded an image, then created an instance based on this image.

Something happened in the crucible agent for one of the downstairs and it failed to start that downstairs.
propolis server was stuck waiting for the third downstairs.

@leftwo
Copy link
Contributor Author

leftwo commented Apr 21, 2023

Sled agent log:
sled-agent-log-when-agent-failed.log

@leftwo
Copy link
Contributor Author

leftwo commented Apr 21, 2023

nexus log (too big to attach) is here: /net/catacomb/data/staff/core/rack2/BRM42220070/crucible-704

@jmpesp
Copy link
Contributor

jmpesp commented Apr 24, 2023

We saw Nexus successfully tell the agent to start three Downstairs for a snapshot:

{
  "msg": "crucible running snapshot RunningSnapshot { id: RegionId(\"5dfcd3b5-fec0-4d3b-a383-121fa6a5e916\"), name: \"d83d29ce-b16f-45bd-848a-bbae5b002693\", port_number: 19001, state: Created }",
  "v": 0,
  "name": "74403a62-fffd-45bb-aeb7-de0af88df320",
  "level": 30,
  "time": "2023-04-21T13:46:21.074502015-07:00",
  "hostname": "oxz_nexus",
  "pid": 16238,
  "saga_id": "aba0627b-b11e-43ee-821a-2ef8ccd216c0",
  "saga_name": "finalize-disk",
  "component": "ServerContext"
}
{
  "msg": "crucible running snapshot RunningSnapshot { id: RegionId(\"68cb66a7-7a9f-4c9e-95f9-9617f27289b0\"), name: \"d83d29ce-b16f-45bd-848a-bbae5b002693\", port_number: 19001, state: Created }",
  "v": 0,
  "name": "74403a62-fffd-45bb-aeb7-de0af88df320",
  "level": 30,
  "time": "2023-04-21T13:46:21.136658265-07:00",
  "hostname": "oxz_nexus",
  "pid": 16238,
  "saga_id": "aba0627b-b11e-43ee-821a-2ef8ccd216c0",
  "saga_name": "finalize-disk",
  "component": "ServerContext"
}
{
  "msg": "crucible running snapshot RunningSnapshot { id: RegionId(\"6bbc4067-b739-47ee-9472-d336a7f5c286\"), name: \"d83d29ce-b16f-45bd-848a-bbae5b002693\", port_number: 19001, state: Created }",
  "v": 0,
  "name": "74403a62-fffd-45bb-aeb7-de0af88df320",
  "level": 30,
  "time": "2023-04-21T13:46:21.198667256-07:00",
  "hostname": "oxz_nexus",
  "pid": 16238,
  "saga_id": "aba0627b-b11e-43ee-821a-2ef8ccd216c0",
  "saga_name": "finalize-disk",
  "component": "ServerContext"
}

We saw each respective crucible agent record in /data/crucible.json that a running snapshot had been requested, but only two of those actually spun up the read-only downstairs. When the third agent was bounced with svcadm, it read crucible.json and brought up the missing read-only downstairs.

Something got stuck in the agent - it received (and enqueued) the "running snapshot" request but didn't carry it out.

@jmpesp jmpesp added this to the MVP milestone May 8, 2023
@jmpesp jmpesp self-assigned this May 8, 2023
jmpesp added a commit to jmpesp/crucible that referenced this issue May 9, 2023
My normal flow for booting an instance is:

- create an image
- create a disk with that image as a source
- attach that disk to an instance, and boot

This masked the problem that this PR fixes: the crucible agent does not
call `apply_smf` when creating or destroying "running snapshots" (aka
read-only downstairs for snapshots), **only** when creating or
destroying regions.

If a user creates an image, the read-only downstairs are not
provisioned. If a user then creates a new disk with that image as a
source, `apply_smf` is called as part of creating the region for that
new disk, and this will also provision the read-only downstairs. If a
user only created a disk from a bulk import (like
oxidecomputer/omicron#3034) then the read-only downstairs would not be
started. If that disk is then attached to an instance, it will not boot
because the Upstairs cannot connect to the non-existent read-only
downstairs:

    May 07 06:23:09.145 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1
    May 07 06:23:09.146 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0
    May 07 06:23:09.146 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2
    May 07 06:23:19.155 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1
    May 07 06:23:19.158 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0
    May 07 06:23:19.158 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2

Fixes oxidecomputer/omicron#3034
Fixes oxidecomputer#704
@jmpesp jmpesp closed this as completed in 9f69dea May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants