-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crucible agent failed to start a downstairs #704
Comments
Sled agent log: |
nexus log (too big to attach) is here: /net/catacomb/data/staff/core/rack2/BRM42220070/crucible-704 |
We saw Nexus successfully tell the agent to start three Downstairs for a snapshot: {
"msg": "crucible running snapshot RunningSnapshot { id: RegionId(\"5dfcd3b5-fec0-4d3b-a383-121fa6a5e916\"), name: \"d83d29ce-b16f-45bd-848a-bbae5b002693\", port_number: 19001, state: Created }",
"v": 0,
"name": "74403a62-fffd-45bb-aeb7-de0af88df320",
"level": 30,
"time": "2023-04-21T13:46:21.074502015-07:00",
"hostname": "oxz_nexus",
"pid": 16238,
"saga_id": "aba0627b-b11e-43ee-821a-2ef8ccd216c0",
"saga_name": "finalize-disk",
"component": "ServerContext"
}
{
"msg": "crucible running snapshot RunningSnapshot { id: RegionId(\"68cb66a7-7a9f-4c9e-95f9-9617f27289b0\"), name: \"d83d29ce-b16f-45bd-848a-bbae5b002693\", port_number: 19001, state: Created }",
"v": 0,
"name": "74403a62-fffd-45bb-aeb7-de0af88df320",
"level": 30,
"time": "2023-04-21T13:46:21.136658265-07:00",
"hostname": "oxz_nexus",
"pid": 16238,
"saga_id": "aba0627b-b11e-43ee-821a-2ef8ccd216c0",
"saga_name": "finalize-disk",
"component": "ServerContext"
}
{
"msg": "crucible running snapshot RunningSnapshot { id: RegionId(\"6bbc4067-b739-47ee-9472-d336a7f5c286\"), name: \"d83d29ce-b16f-45bd-848a-bbae5b002693\", port_number: 19001, state: Created }",
"v": 0,
"name": "74403a62-fffd-45bb-aeb7-de0af88df320",
"level": 30,
"time": "2023-04-21T13:46:21.198667256-07:00",
"hostname": "oxz_nexus",
"pid": 16238,
"saga_id": "aba0627b-b11e-43ee-821a-2ef8ccd216c0",
"saga_name": "finalize-disk",
"component": "ServerContext"
} We saw each respective crucible agent record in Something got stuck in the agent - it received (and enqueued) the "running snapshot" request but didn't carry it out. |
My normal flow for booting an instance is: - create an image - create a disk with that image as a source - attach that disk to an instance, and boot This masked the problem that this PR fixes: the crucible agent does not call `apply_smf` when creating or destroying "running snapshots" (aka read-only downstairs for snapshots), **only** when creating or destroying regions. If a user creates an image, the read-only downstairs are not provisioned. If a user then creates a new disk with that image as a source, `apply_smf` is called as part of creating the region for that new disk, and this will also provision the read-only downstairs. If a user only created a disk from a bulk import (like oxidecomputer/omicron#3034) then the read-only downstairs would not be started. If that disk is then attached to an instance, it will not boot because the Upstairs cannot connect to the non-existent read-only downstairs: May 07 06:23:09.145 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1 May 07 06:23:09.146 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0 May 07 06:23:09.146 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2 May 07 06:23:19.155 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1 May 07 06:23:19.158 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0 May 07 06:23:19.158 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2 Fixes oxidecomputer/omicron#3034 Fixes oxidecomputer#704
On rack2 (dogfood) we uploaded an image, then created an instance based on this image.
Something happened in the crucible agent for one of the downstairs and it failed to start that downstairs.
propolis server was stuck waiting for the third downstairs.
The text was updated successfully, but these errors were encountered: