Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to fully add a new sled on madrid #5111

Closed
jgallagher opened this issue Feb 21, 2024 · 14 comments
Closed

Failed to fully add a new sled on madrid #5111

jgallagher opened this issue Feb 21, 2024 · 14 comments

Comments

@jgallagher
Copy link
Contributor

I tried going through the "add sled" process today on madrid, to mixed results. I'm dropping my notes on problems I hit in this issue, and then may spawn additional issues / PRs for specific problems.

  1. The initial attempt to add the sled from the Oxide CLI claimed to fail with a 500:
% oxide system hardware sled add --part 913-0000019 --serial BRM42220004
error
Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "044ca7fb-4a39-4232-abf5-d4c65a9a7a4c", "content-length": "124", "date": "Wed, 21 Feb 2024 18:16:02 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "044ca7fb-4a39-4232-abf5-d4c65a9a7a4c" }

However, after another minute or two, the sled did successfully get added; it no longer appeared in sled list-uninitialized, and the sled-agent logs indicated successful self-reporting to Nexus. I'm not sure if we want to raise the timeout on this endpoint, or maybe split it into a start / poll for progress pair?


  1. Upon receiving the request from Nexus to join the cluster, sled-agent only used 8 of the 10 U.2s.

On sled-agent startup, we saw 10 U.2s and 2 M.2s as expected:

00:01:39.280Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@0,0/pci1de,fff9@3,3/pci1b96,0@0/blkdev@w0014EE81000BC4F1,0", dev_path: Some("/dev/dsk/c3t0014EE81000BC4F1d0") }, slot: 5, variant: U2, identity: DiskIdentity { vendor: "1b96", serial: "A079E53D", model: "WUS4C6432DSP3X3" }, is_boot_disk: false })
00:01:39.280Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@0,0/pci1de,fff9@3,2/pci1b96,0@0/blkdev@w0014EE81000BC13A,0", dev_path: Some("/dev/dsk/c2t0014EE81000BC13Ad0") }, slot: 6, variant: U2, identity: DiskIdentity { vendor: "1b96", serial: "A079DEA6", model: "WUS4C6432DSP3X3" }, is_boot_disk: false })
00:01:39.280Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0", dev_path: Some("/dev/dsk/c10t0014EE81000BC440d0") }, slot: 1, variant: U2, identity: DiskIdentity { vendor: "1b96", serial: "A079E3F8", model: "WUS4C6432DSP3X3" }, is_boot_disk: false })
00:01:39.280Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0", dev_path: Some("/dev/dsk/c6t0014EE81000BC129d0") }, slot: 8, variant: U2, identity: DiskIdentity { vendor: "1b96", serial: "A079DE8D", model: "WUS4C6432DSP3X3" }, is_boot_disk: false })
00:01:39.281Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@ab,0/pci1de,fff9@1,3/pci1b96,0@0/blkdev@w0014EE81000BC43B,0", dev_path: Some("/dev/dsk/c11t0014EE81000BC43Bd0") }, slot: 2, variant: U2, identity: DiskIdentity { vendor: "1b96", serial: "A079E3F3", model: "WUS4C6432DSP3X3" }, is_boot_disk: false })
00:01:39.281Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,1/pci1b96,0@0/blkdev@w0014EE81000BC486,0", dev_path: Some("/dev/dsk/c5t0014EE81000BC486d0") }, slot: 9, variant: U2, identity: DiskIdentity { vendor: "1b96", serial: "A079E4AA", model: "WUS4C6432DSP3X3" }, is_boot_disk: false })
00:01:39.281Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@ab,0/pci1de,fff9@1,4/pci1b96,0@0/blkdev@w0014EE81000BC323,0", dev_path: Some("/dev/dsk/c12t0014EE81000BC323d0") }, slot: 3, variant: U2, identity: DiskIdentity { vendor: "1b96", serial: "A079E1B3", model: "WUS4C6432DSP3X3" }, is_boot_disk: false })
00:01:39.281Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@3,3/pci1344,3100@0/blkdev@w00A0750132753630,0", dev_path: Some("/dev/dsk/c8t00A0750132753630d0") }, slot: 18, variant: M2, identity: DiskIdentity { vendor: "1344", serial: "214132753630", model: "Micron_7300_MTFDHBG1T9TDF" }, is_boot_disk: false })
00:01:39.281Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@ab,0/pci1de,fff9@1,1/pci1b96,0@0/blkdev@w0014EE81000D2F21,0", dev_path: Some("/dev/dsk/c9t0014EE81000D2F21d0") }, slot: 0, variant: U2, identity: DiskIdentity { vendor: "1b96", serial: "A084A733", model: "WUS4C6432DSP3X3" }, is_boot_disk: false })
00:01:39.281Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@38,0/pci1de,fff9@1,3/pci1b96,0@0/blkdev@w0014EE81000BC4E6,0", dev_path: Some("/dev/dsk/c7t0014EE81000BC4E6d0") }, slot: 7, variant: U2, identity: DiskIdentity { vendor: "1b96", serial: "A079E576", model: "WUS4C6432DSP3X3" }, is_boot_disk: false })
00:01:39.281Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@0,0/pci1de,fff9@1,3/pci1344,3100@0/blkdev@w00A075013274B311,0", dev_path: Some("/dev/dsk/c1t00A075013274B311d0") }, slot: 17, variant: M2, identity: DiskIdentity { vendor: "1344", serial: "21413274B311", model: "Micron_7300_MTFDHBG1T9TDF" }, is_boot_disk: true })
00:01:39.281Z INFO SledAgent (HardwareManager): Update from polling device tree: DiskAdded(UnparsedDisk { paths: DiskPaths { devfs_path: "/devices/pci@0,0/pci1de,fff9@3,4/pci1b96,0@0/blkdev@w0014EE81000BC49E,0", dev_path: Some("/dev/dsk/c4t0014EE81000BC49Ed0") }, slot: 4, variant: U2, identity: DiskIdentity { vendor: "1b96", serial: "A079E4C6", model: "WUS4C6432DSP3X3" }, is_boot_disk: false })

However, two of these U.2s induced "persistent errors" from sled-agent's storage manager, for different reasons:

03:28:49.303Z INFO SledAgent (StorageManager): Disk at /devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0 already has a GPT
    file = sled-hardware/src/illumos/partitions.rs:103
03:28:49.303Z ERRO SledAgent (StorageManager): Persistent error:not queueing disk
    disk_id = DiskIdentity { vendor: "1b96", serial: "A079E3F8", model: "WUS4C6432DSP3X3" }
    err = PooledDisk(BadPartitionLayout { path: "/devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0:wd,raw", why: "Expected 0 partitions, only saw 1" })
    file = sled-storage/src/manager.rs:472
03:28:50.842Z INFO SledAgent (StorageManager): Disk at /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0 already has a GPT
    file = sled-hardware/src/illumos/partitions.rs:103
03:28:50.852Z INFO SledAgent (StorageManager): GPT exists without Zpool: formatting zpool at /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a
    file = sled-hardware/src/disk.rs:247
    disk_id = DiskIdentity { vendor: "1b96", serial: "A079DE8D", model: "WUS4C6432DSP3X3" }
    err = PooledDisk(ZpoolCreate(CreateError { err: Execution(CommandFailure(CommandFailureInfo { command: "/usr/sbin/zpool create oxp_3cfbacb7-204f-4af7-9c92-7e112b906304 /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "cannot open '/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a': No such device or address\\n" })) }))
    file = sled-storage/src/manager.rs:472

Each of those errors appears in the sled-agent log twice (log attached).

The second of those errors seems particularly odd because when I looked at the sled later, the path it's complaining about did exist:

BRM42220004 # ls -l /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a
brw-r-----   1 root     root      18, 128 Dec 28  1986 /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a

  1. Attempting to generate a blueprint from the latest collection failed with a 500

The specific error:

18:37:58.000Z INFO 2be3b4a1-c7ed-49ab-a198-9ca4bd90b32f (dropshot_internal): request completed
    error_message_external = Internal Server Error
    error_message_internal = operation not allowed from API handlers
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:837
    latency_us = 25198
    local_addr = [fd00:1122:3344:101::5]:12221
    method = POST
    remote_addr = [fd00:1122:3344:101::2]:37662
    req_id = 29424657-0533-4ec5-9153-520649ec59b5
    response_code = 500
    uri = /deployment/blueprints/generate-from-collection

This check was added in #4989, to prevent non-saga/RPW endpoints from performing expensive database operations. Blueprint work will eventually be automated, but for now we're driving it via manual API calls. Dave suggests we change this check to only apply to the external API, for now. I hot patched the check out altogether to make continue the experiment:

--- a/nexus/db-queries/src/db/datastore/sled.rs
+++ b/nexus/db-queries/src/db/datastore/sled.rs
@@ -90,7 +90,7 @@ impl DataStore {
         opctx: &OpContext,
     ) -> ListResultVec<Sled> {
         opctx.authorize(authz::Action::ListChildren, &authz::FLEET).await?;
-        opctx.check_complex_operations_allowed()?;
+        //opctx.check_complex_operations_allowed()?;

         let mut all_sleds = Vec::new();
         let mut paginator = Paginator::new(SQL_BATCH_SIZE);

  1. After regenerating a blueprint with an NTP zone for the new sled, making it the target, and waiting for the sled to timesync, regenerate another new blueprint didn't add any crucible zones.

This was a matter of patience: the reconfigurator planner waits until the latest inventory collection shows the NTP zone, so even though the sled had already timesynced, I had to wait ~10 minutes for an inventory collection pass. I wonder if we should be able to trigger an inventory collection manually from omdb? Or perhaps the blueprint executor should trigger it if new zones were added? (That might require information from sled-agent about whether the PUT actually changed anything?)


  1. While investigating the previous issue, I noticed there were 5 inventory collections, all from approximately the same time.

I don't know why we got 5 here when there are only 3 Nexuses. Maybe we're triggering collections too aggressively, as even the latest one still didn't show the NTP zone's presence.

root@oxz_switch1:~# omdb db inventory collections list
ID                                   STARTED              TOOK    NSPS NERRORS
ed31afc8-8172-4b71-8de7-338e5dacb382 2024-02-21T18:57:32Z 5400 ms 6    8
3565e26f-6e97-41c5-8712-c76ceafb3f76 2024-02-21T18:57:32Z 6325 ms 6    8
a306fef2-f03e-4951-bd94-5348d1d68009 2024-02-21T18:57:32Z 8325 ms 6    8
43db6182-d5e7-4e02-85ad-fab82258692e 2024-02-21T18:57:37Z 4433 ms 6    8
ddfa90f4-6cf3-470f-ae8c-ad959d6b7561 2024-02-21T18:57:38Z 3826 ms 6    8
932bb314-ba2f-4526-a8b9-de2944d4efbd 2024-02-21T18:57:41Z 4901 ms 6    8

Roughly 10 minutes later, 3 new collections showed up, and the 2 oldest were purged (as expected).


  1. Regenerating a blueprint to add crucible zones only added 8.

I believe this is covered by item 2 above; the storage manager failed to bring in 2 disks, so sled-agent presumably only reported 8 to Nexus.

  sled c2cfbfdd-df01-4788-abc1-c900226194f3
-     zone config generation 2
+     zone config generation 3
         zone 68ade02c-fd7a-4794-974b-457bdfa8fafc type internal_ntp underlay IP fd00:1122:3344:121::21 (unchanged)
+        zone 02335712-66f5-4de5-ac81-43d999baccc1 type crucible underlay IP fd00:1122:3344:121::26 (added)
+        zone 06a9ad2b-42d2-4b98-a8c5-a8d350027d42 type crucible underlay IP fd00:1122:3344:121::22 (added)
+        zone 46042106-79d0-4540-a316-3b9a04ae6356 type crucible underlay IP fd00:1122:3344:121::28 (added)
+        zone 6fcdb962-a31a-4799-b6b2-b16288ef1d3a type crucible underlay IP fd00:1122:3344:121::24 (added)
+        zone 7e6ac4f6-aff2-4dd0-b61f-1e750c90ec74 type crucible underlay IP fd00:1122:3344:121::27 (added)
+        zone 9548b841-c5db-4a0f-906f-9bcf2c4071ed type crucible underlay IP fd00:1122:3344:121::25 (added)
+        zone 9b32d4ef-500f-4456-9320-661619b3170d type crucible underlay IP fd00:1122:3344:121::23 (added)
+        zone e197ec22-9981-460c-ae31-324c550f4a61 type crucible underlay IP fd00:1122:3344:121::29 (added)

  1. Launching new instances was a mixed success

I started 10 instances after the new crucible zones were online. 3 of those instances landed on the new sled, and they all seemed to be healthy: I was able to SSH in, they had booted fine, I was able to create files, etc. However, none of the 10 new instances had disks backed on the new sled. The disks for the 10 new instances were:

HOST_SERIAL DISK_NAME            INSTANCE_NAME   PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE
BRM42220081 john-add-sled-0-disk john-add-sled-0 oxz_propolis-server_14790254-e479-46d7-a2af-d39353e4c671 d9376cae-5ffe-4e60-a62c-10d9c9bc3951 attached
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK
BRM42220046 ba5c2310-84ac-4f66-a3a4-318595f579bf oxz_crucible_6613c542-004f-4db0-aef9-e06437fefca6 715ba6e2-bf87-48d0-8880-a0e87b736af7
BRM42220081 3d090ae2-6180-4f22-ba03-74dfef4d23d9 oxz_crucible_74c4cdc5-869b-48a0-99f2-f8abbaf4f1ec a9d80546-5cf5-4e5e-99c9-c0cc90cda93e
BRM44220001 17a6748a-ed85-41ed-b06f-86b310341cc1 oxz_crucible_af05714c-8f44-4302-ae1a-a8c76d1f8405 199e999b-b4d7-4c72-bf32-48c1a5d8b69e
HOST_SERIAL DISK_NAME            INSTANCE_NAME   PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE
BRM44220001 john-add-sled-9-disk john-add-sled-9 oxz_propolis-server_0a68687b-1170-4167-ab1b-5dcb12fb1336 52821d6e-3aa8-4a66-bae6-207e46bfe1a2 attached
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK
BRM42220081 5c051726-4a00-44e9-a572-52cfdc80eb23 oxz_crucible_20b4e92e-a3b1-4729-bb63-85a546b081e8 12345639-c16c-48e8-bf51-bfb94da76060
BRM42220046 8a6159aa-5292-43e0-9659-46408fd41bf2 oxz_crucible_2fef647e-4827-4f83-a2fd-566459d34c24 d33704d7-01ef-4f9f-8aed-1cee2319fa82
BRM44220001 a81fbe91-79ab-4a6e-a272-2bb6688e3d95 oxz_crucible_cd29643e-f080-4a7c-8f4f-74f30c842df9 b1cbf7d9-0b35-41bf-be8d-65fe16ac2e7b
HOST_SERIAL DISK_NAME            INSTANCE_NAME   PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE
BRM42220081 john-add-sled-7-disk john-add-sled-7 oxz_propolis-server_f097df68-edb1-4934-95e7-931addcc047b 4ac49673-9343-40ad-a84c-7eaa11fb0cc4 attached
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK
BRM42220046 778b452d-4b2f-40cf-b1c7-683eb31a42b4 oxz_crucible_2fef647e-4827-4f83-a2fd-566459d34c24 d33704d7-01ef-4f9f-8aed-1cee2319fa82
BRM42220081 ac5c3a82-b855-4b65-be00-f90e2d38f3a6 oxz_crucible_558b375b-75e7-4d74-9958-bf93bc418b7d 8c9eca81-f54e-4dc0-b535-2c2eca63f1f5
BRM44220001 d92e8987-d897-4ee5-a76d-1b03000d849a oxz_crucible_d3c2e5b5-cdc3-42e6-82c2-68357a4d13df 974348d1-87fc-46e3-8355-847c6f36d62d
HOST_SERIAL DISK_NAME            INSTANCE_NAME   PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE
BRM42220004 john-add-sled-5-disk john-add-sled-5 oxz_propolis-server_a0ba8feb-6024-49a0-943b-1dbb9e1a6a5e 5e8cefc2-6094-49e4-8e26-750473e1fc02 attached
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK
BRM42220081 5dee8d02-20c9-4b64-b3ce-8e5013cfcb28 oxz_crucible_2e5dba4d-6907-40cb-b32f-130a7fef8471 60dd88da-c3bd-4120-8442-f4318a224546
BRM42220046 5ab65461-f7e6-4c71-8f92-b4e78965db4f oxz_crucible_6613c542-004f-4db0-aef9-e06437fefca6 715ba6e2-bf87-48d0-8880-a0e87b736af7
BRM44220001 c22a5921-0ef9-4691-adc2-63220f8e3bfc oxz_crucible_af05714c-8f44-4302-ae1a-a8c76d1f8405 199e999b-b4d7-4c72-bf32-48c1a5d8b69e
HOST_SERIAL DISK_NAME            INSTANCE_NAME   PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE
BRM42220004 john-add-sled-1-disk john-add-sled-1 oxz_propolis-server_8ff86349-8a07-4de5-8477-afaa51935fda 6b085bd4-4b60-4bbb-af69-301b68adc483 attached
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK
BRM42220081 0f8d562d-33e0-4cb5-86e3-d2c2d0a74cde oxz_crucible_2e5dba4d-6907-40cb-b32f-130a7fef8471 60dd88da-c3bd-4120-8442-f4318a224546
BRM42220046 2a65e1fb-2c4a-4b95-950b-7d3c83fd7ca2 oxz_crucible_64272a80-2f5f-48f0-bf2e-48db198cd383 8905aca2-b226-4b58-8fb6-55c2d7987bd1
BRM44220001 a7b4d37b-9fe7-4835-9bef-25c08ee6f333 oxz_crucible_90b32bb5-8859-40d5-ac68-6b332b15a9e4 e8bcd929-bd69-4b36-9962-b60893381dec
HOST_SERIAL DISK_NAME            INSTANCE_NAME   PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE
BRM42220004 john-add-sled-3-disk john-add-sled-3 oxz_propolis-server_28e93d58-d061-49ce-9676-15fff9065ad6 f1f97fa6-f259-4189-9b74-8425e0f5735f attached
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK
BRM42220081 f6bf8832-cf4b-497b-b617-20212473a8f3 oxz_crucible_74c4cdc5-869b-48a0-99f2-f8abbaf4f1ec a9d80546-5cf5-4e5e-99c9-c0cc90cda93e
BRM44220001 51a5881b-97f6-4233-bfdf-35b336488ef2 oxz_crucible_77183adc-9286-4c1b-bf58-f0224e13b0f8 0d0e0d7b-723c-44d4-8dd8-6334b571dd01
BRM42220046 76a38dce-33e7-49e8-921e-090a95bec0f4 oxz_crucible_eb26cd8b-0d79-41b8-926a-52b28991de8d 4d6f9cf8-e846-4c50-a683-c9cc0f8c3003
HOST_SERIAL DISK_NAME            INSTANCE_NAME   PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE
BRM44220001 john-add-sled-4-disk john-add-sled-4 oxz_propolis-server_f1ba0f7b-b969-4eb1-840c-1b7d393982de b1f94a76-85d2-4a01-83f2-43af1e24c3c0 attached
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK
BRM42220081 174179ce-9076-4eef-baf5-d595f26c7f99 oxz_crucible_2e5dba4d-6907-40cb-b32f-130a7fef8471 60dd88da-c3bd-4120-8442-f4318a224546
BRM44220001 32a29c28-4427-44ff-99ce-7309bce035b5 oxz_crucible_4a2836f7-56bd-4e3e-9ba9-8bfe28578054 a7d2a93c-e66e-4fa4-93a8-78a4c8a07c9d
BRM42220046 74e823ff-7be9-4c10-8d5f-f10eecd78d61 oxz_crucible_75f3a7a2-5699-4d51-b961-577b28dba29d a239d27c-5ad2-4697-a0ab-47688548b412
HOST_SERIAL DISK_NAME            INSTANCE_NAME   PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE
BRM42220081 john-add-sled-8-disk john-add-sled-8 oxz_propolis-server_e8d0cb58-c7b0-486a-b458-2a89b1c1c8ea 7b51e91e-7f5d-451c-a9bd-12a564456719 attached
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK
BRM42220046 401e8c19-f5fa-46ee-af51-621e281c9c57 oxz_crucible_50e1e228-0fee-4a24-b73f-3aaa29e01208 973de75d-2b52-408c-9092-bd67154be0ca
BRM42220081 f8a24b9e-0cbe-4f38-86f0-7f163e85ee81 oxz_crucible_74c4cdc5-869b-48a0-99f2-f8abbaf4f1ec a9d80546-5cf5-4e5e-99c9-c0cc90cda93e
BRM44220001 360da5cf-e778-4e72-beb2-1ac107984e29 oxz_crucible_90b32bb5-8859-40d5-ac68-6b332b15a9e4 e8bcd929-bd69-4b36-9962-b60893381dec
HOST_SERIAL DISK_NAME            INSTANCE_NAME   PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE
BRM44220001 john-add-sled-2-disk john-add-sled-2 oxz_propolis-server_c5e38b5c-2062-40d4-9a38-3843acbe1650 7a4b0b7d-7dde-49a3-a76a-09d87cc632fb attached
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK
BRM42220046 dea61400-fd33-4d56-8501-00b3d07fa280 oxz_crucible_50e1e228-0fee-4a24-b73f-3aaa29e01208 973de75d-2b52-408c-9092-bd67154be0ca
BRM44220001 ec56bcb3-52a3-4c25-a7eb-6e92adf11e0d oxz_crucible_6dffef67-2ce3-469e-8d9d-f84a71c64273 f913ceb5-ae6a-4590-95f9-2853b1f97d4a
BRM42220081 bff468ff-4503-4e12-bc73-7b156b74d669 oxz_crucible_f12618be-a796-4c87-8fb2-2a97ae0bc8a3 892c1865-e111-426e-8743-06def0822495
HOST_SERIAL DISK_NAME            INSTANCE_NAME   PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE
BRM42220046 john-add-sled-6-disk john-add-sled-6 oxz_propolis-server_ef3afa4e-1f9b-461a-bd10-effeb5818979 a4477901-f39f-478e-bd8c-68420514cefb attached
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK
BRM42220081 52116ab9-c7ae-4086-8059-d45d4978b1f6 oxz_crucible_9a8d2f85-b023-4724-9a7e-ff56c213352c f829761d-8d5b-4fea-bc2e-7ed838656998
BRM42220046 954d3a17-b3cb-451d-b207-6eadcd7f97d3 oxz_crucible_9d18078c-d61a-4be2-8eeb-e3f24333e1d2 75b5d562-0c52-4711-8626-95bdcb287485
BRM44220001 bab19e99-aa66-4e32-93aa-7da1377432a5 oxz_crucible_d3c2e5b5-cdc3-42e6-82c2-68357a4d13df 974348d1-87fc-46e3-8355-847c6f36d62d

BRM42220004 is the serial of the new sled. It shows up as the owner of the propolis zone for 3 of the disks (john-add-sled-{1,3,5}-disk), but none of the crucible zones for any of the above. Poking around the code and CRDB, I believe this is because disk allocation looks at the dataset table, and there were no rows for the new crucible zones in that table:

root@[fd00:1122:3344:101::3]:32221/omicron  OPEN> select * from dataset where kind='crucible' order by time_created;
                   id                  |         time_created          |         time_modified         | time_deleted | rcgen |               pool_id                |           ip           | port  |   kind   |  size_used
---------------------------------------+-------------------------------+-------------------------------+--------------+-------+--------------------------------------+------------------------+-------+----------+---------------
  6dffef67-2ce3-469e-8d9d-f84a71c64273 | 2024-02-21 15:05:44.407924+00 | 2024-02-21 15:05:44.407924+00 | NULL         |     1 | f2298af4-6fed-4a53-83b3-8f645fbf5db9 | fd00:1122:3344:101::b  | 32345 | crucible | 216895848448
  d3c2e5b5-cdc3-42e6-82c2-68357a4d13df | 2024-02-21 15:05:44.407924+00 | 2024-02-21 15:05:44.407924+00 | NULL         |     1 | d0d7b830-a7f2-45e3-b87f-de85b98f08e1 | fd00:1122:3344:101::a  | 32345 | crucible | 644245094400
  af05714c-8f44-4302-ae1a-a8c76d1f8405 | 2024-02-21 15:05:44.407924+00 | 2024-02-21 15:05:44.407924+00 | NULL         |     1 | e014097e-eef9-4b73-bd24-337f610cd0c4 | fd00:1122:3344:101::d  | 32345 | crucible | 858993459200
  90b32bb5-8859-40d5-ac68-6b332b15a9e4 | 2024-02-21 15:05:44.407924+00 | 2024-02-21 15:05:44.407924+00 | NULL         |     1 | 6e749ef3-d960-4534-8b93-e47205972f60 | fd00:1122:3344:101::8  | 32345 | crucible | 644245094400
  77183adc-9286-4c1b-bf58-f0224e13b0f8 | 2024-02-21 15:05:44.407924+00 | 2024-02-21 15:05:44.407924+00 | NULL         |     1 | fc2669a0-ea73-4ded-8107-85e82776ae92 | fd00:1122:3344:101::9  | 32345 | crucible | 429496729600
  4a2836f7-56bd-4e3e-9ba9-8bfe28578054 | 2024-02-21 15:05:44.407924+00 | 2024-02-21 15:05:44.407924+00 | NULL         |     1 | 84b2ee25-13d0-44f6-8158-e2409e736a49 | fd00:1122:3344:101::c  | 32345 | crucible | 429496729600
  cd29643e-f080-4a7c-8f4f-74f30c842df9 | 2024-02-21 15:05:44.407925+00 | 2024-02-21 15:05:44.407925+00 | NULL         |     1 | 3cd58305-77df-4f74-8174-76eeccc29f94 | fd00:1122:3344:101::10 | 32345 | crucible | 429496729600
  6d5929a4-3cf1-4cfd-b75f-e8b0ab83ef98 | 2024-02-21 15:05:44.407925+00 | 2024-02-21 15:05:44.407925+00 | NULL         |     1 | e60a21f4-f73e-4257-8148-79896b785b2f | fd00:1122:3344:101::f  | 32345 | crucible | 216895848448
  0babc1fd-7354-4505-b8af-7a6abf4c4a40 | 2024-02-21 15:05:44.407925+00 | 2024-02-21 15:05:44.407925+00 | NULL         |     1 | 695081ce-7377-4fde-9d53-178c46ea91eb | fd00:1122:3344:101::e  | 32345 | crucible | 644245094400
  75c8b60d-b690-49a8-a7fd-9aaccbe96586 | 2024-02-21 15:05:44.407926+00 | 2024-02-21 15:05:44.407926+00 | NULL         |     1 | 00c56f76-a5e5-45b2-950b-fd189d1560de | fd00:1122:3344:102::e  | 32345 | crucible |            0
  9d18078c-d61a-4be2-8eeb-e3f24333e1d2 | 2024-02-21 15:05:44.407926+00 | 2024-02-21 15:05:44.407926+00 | NULL         |     1 | 440ad8f7-4867-4352-bca2-891ce6569044 | fd00:1122:3344:102::8  | 32345 | crucible | 429496729600
  2fef647e-4827-4f83-a2fd-566459d34c24 | 2024-02-21 15:05:44.407926+00 | 2024-02-21 15:05:44.407926+00 | NULL         |     1 | 9cec1462-2d30-4344-b8ea-688392b2a34f | fd00:1122:3344:102::b  | 32345 | crucible | 858993459200
  6613c542-004f-4db0-aef9-e06437fefca6 | 2024-02-21 15:05:44.407926+00 | 2024-02-21 15:05:44.407926+00 | NULL         |     1 | 66787f85-4631-4b65-a1ba-9f761f2f79f7 | fd00:1122:3344:102::c  | 32345 | crucible | 431644213248
  4f3212c5-8d35-4ba5-a001-3b186ee636e1 | 2024-02-21 15:05:44.407926+00 | 2024-02-21 15:05:44.407926+00 | NULL         |     1 | e3ff512a-8225-4ed7-b653-54c134de7eb8 | fd00:1122:3344:102::d  | 32345 | crucible | 214748364800
  eb26cd8b-0d79-41b8-926a-52b28991de8d | 2024-02-21 15:05:44.407926+00 | 2024-02-21 15:05:44.407926+00 | NULL         |     1 | dc645b70-e0ee-45a8-8fd0-c23cae5b4f86 | fd00:1122:3344:102::9  | 32345 | crucible | 646392578048
  a33d4be0-ab8f-4640-95c2-f828ff5917cf | 2024-02-21 15:05:44.407926+00 | 2024-02-21 15:05:44.407926+00 | NULL         |     1 | 9e50d9ed-1793-4e9f-8df3-7c9649c0670f | fd00:1122:3344:102::a  | 32345 | crucible | 429496729600
  75f3a7a2-5699-4d51-b961-577b28dba29d | 2024-02-21 15:05:44.407927+00 | 2024-02-21 15:05:44.407927+00 | NULL         |     1 | 14fb88c8-f611-44c9-a015-e305584fcb93 | fd00:1122:3344:102::11 | 32345 | crucible | 429496729600
  64272a80-2f5f-48f0-bf2e-48db198cd383 | 2024-02-21 15:05:44.407927+00 | 2024-02-21 15:05:44.407927+00 | NULL         |     1 | da964e55-686b-4005-8cca-a3602f02e8c5 | fd00:1122:3344:102::10 | 32345 | crucible | 644245094400
  50e1e228-0fee-4a24-b73f-3aaa29e01208 | 2024-02-21 15:05:44.407927+00 | 2024-02-21 15:05:44.407927+00 | NULL         |     1 | 20fbb557-88b5-4027-8d65-02bcce25fc53 | fd00:1122:3344:102::f  | 32345 | crucible | 429496729600
  74c4cdc5-869b-48a0-99f2-f8abbaf4f1ec | 2024-02-21 15:05:44.407928+00 | 2024-02-21 15:05:44.407928+00 | NULL         |     1 | d0344ef8-ffec-4f20-8b14-d6749d87cf05 | fd00:1122:3344:103::9  | 32345 | crucible | 858993459200
  1d37887f-aa4b-45dc-b973-d0c221785098 | 2024-02-21 15:05:44.407928+00 | 2024-02-21 15:05:44.407928+00 | NULL         |     1 | 3b0f7b76-1d71-41aa-a1bc-a26ada040085 | fd00:1122:3344:103::c  | 32345 | crucible | 219043332096
  f12618be-a796-4c87-8fb2-2a97ae0bc8a3 | 2024-02-21 15:05:44.407928+00 | 2024-02-21 15:05:44.407928+00 | NULL         |     1 | 6fca83f1-f134-4b1c-9b68-3cbfa853a83b | fd00:1122:3344:103::a  | 32345 | crucible | 429496729600
  20b4e92e-a3b1-4729-bb63-85a546b081e8 | 2024-02-21 15:05:44.407928+00 | 2024-02-21 15:05:44.407928+00 | NULL         |     1 | d05fa466-6fbf-49d2-a581-528e734ab56d | fd00:1122:3344:103::b  | 32345 | crucible | 429496729600
  2e5dba4d-6907-40cb-b32f-130a7fef8471 | 2024-02-21 15:05:44.407928+00 | 2024-02-21 15:05:44.407928+00 | NULL         |     1 | 2737d855-5b88-4ef5-ab5a-e9d75ad8ca7c | fd00:1122:3344:103::7  | 32345 | crucible | 858993459200
  f8e984ce-10ac-4ffd-b305-bb83d4cdbba7 | 2024-02-21 15:05:44.407928+00 | 2024-02-21 15:05:44.407928+00 | NULL         |     1 | 4aadf142-686f-44b3-8e47-71f032944cd1 | fd00:1122:3344:103::8  | 32345 | crucible | 429496729600
  9a8d2f85-b023-4724-9a7e-ff56c213352c | 2024-02-21 15:05:44.407929+00 | 2024-02-21 15:05:44.407929+00 | NULL         |     1 | 4d846f5d-7b41-4345-af76-6e09d96d9cfd | fd00:1122:3344:103::d  | 32345 | crucible | 214748364800
  1b5fa9e4-6ae6-4eb3-961b-9ffbb28bbaa6 | 2024-02-21 15:05:44.407929+00 | 2024-02-21 15:05:44.407929+00 | NULL         |     1 | 806a9118-d6da-452b-b353-9c1df93eb0da | fd00:1122:3344:103::10 | 32345 | crucible | 214748364800
  3b01693a-e977-4426-9fe1-a1c151c64ca1 | 2024-02-21 15:05:44.407929+00 | 2024-02-21 15:05:44.407929+00 | NULL         |     1 | cd572df9-f4b0-40cf-9220-5ea5c0e63586 | fd00:1122:3344:103::f  | 32345 | crucible | 214748364800
  558b375b-75e7-4d74-9958-bf93bc418b7d | 2024-02-21 15:05:44.407929+00 | 2024-02-21 15:05:44.407929+00 | NULL         |     1 | 1282a769-bfb2-4f6a-9218-32b1560a399b | fd00:1122:3344:103::e  | 32345 | crucible | 644245094400
(29 rows)

These 29 rows match the 29 zpools from the 3 sleds used at RSS time. But in the zpool table, there are 37 rows for U.2s; the 29 above plus 8 from the new sled:

root@[fd00:1122:3344:101::3]:32221/omicron  OPEN> select * from zpool where total_size=3195455668224 order by time_created;
                   id                  |         time_created          |         time_modified         | time_deleted | rcgen |               sled_id                |           physical_disk_id           |  total_size
---------------------------------------+-------------------------------+-------------------------------+--------------+-------+--------------------------------------+--------------------------------------+----------------
  cd572df9-f4b0-40cf-9220-5ea5c0e63586 | 2024-02-21 15:04:14.102923+00 | 2024-02-21 15:04:14.102923+00 | NULL         |     2 | dc22b82b-2e48-4ed6-abad-2c4429e952a4 | 34e14340-2d7c-4b55-8559-3b4be7775833 | 3195455668224
  d05fa466-6fbf-49d2-a581-528e734ab56d | 2024-02-21 15:04:14.124683+00 | 2024-02-21 15:04:14.124683+00 | NULL         |     2 | dc22b82b-2e48-4ed6-abad-2c4429e952a4 | 12345639-c16c-48e8-bf51-bfb94da76060 | 3195455668224
  2737d855-5b88-4ef5-ab5a-e9d75ad8ca7c | 2024-02-21 15:04:29.088724+00 | 2024-02-21 15:04:29.088724+00 | NULL         |     5 | dc22b82b-2e48-4ed6-abad-2c4429e952a4 | 60dd88da-c3bd-4120-8442-f4318a224546 | 3195455668224
  3cd58305-77df-4f74-8174-76eeccc29f94 | 2024-02-21 15:04:33.538257+00 | 2024-02-21 15:04:33.538257+00 | NULL         |     2 | 7d7d3ff5-357f-4756-be5b-0cc459ca3811 | b1cbf7d9-0b35-41bf-be8d-65fe16ac2e7b | 3195455668224
  4d846f5d-7b41-4345-af76-6e09d96d9cfd | 2024-02-21 15:04:35.986221+00 | 2024-02-21 15:04:35.986221+00 | NULL         |     2 | dc22b82b-2e48-4ed6-abad-2c4429e952a4 | f829761d-8d5b-4fea-bc2e-7ed838656998 | 3195455668224
  d0d7b830-a7f2-45e3-b87f-de85b98f08e1 | 2024-02-21 15:04:36.077428+00 | 2024-02-21 15:04:36.077428+00 | NULL         |     2 | 7d7d3ff5-357f-4756-be5b-0cc459ca3811 | 974348d1-87fc-46e3-8355-847c6f36d62d | 3195455668224
  e014097e-eef9-4b73-bd24-337f610cd0c4 | 2024-02-21 15:04:36.942167+00 | 2024-02-21 15:04:36.942167+00 | NULL         |     2 | 7d7d3ff5-357f-4756-be5b-0cc459ca3811 | 199e999b-b4d7-4c72-bf32-48c1a5d8b69e | 3195455668224
  1282a769-bfb2-4f6a-9218-32b1560a399b | 2024-02-21 15:04:42.293528+00 | 2024-02-21 15:04:42.293528+00 | NULL         |     2 | dc22b82b-2e48-4ed6-abad-2c4429e952a4 | 8c9eca81-f54e-4dc0-b535-2c2eca63f1f5 | 3195455668224
  da964e55-686b-4005-8cca-a3602f02e8c5 | 2024-02-21 15:04:46.188105+00 | 2024-02-21 15:04:46.188105+00 | NULL         |     2 | fba3ebcd-c4d9-43b4-8214-01cfc14bee99 | 8905aca2-b226-4b58-8fb6-55c2d7987bd1 | 3195455668224
  806a9118-d6da-452b-b353-9c1df93eb0da | 2024-02-21 15:04:47.654995+00 | 2024-02-21 15:04:47.654995+00 | NULL         |     2 | dc22b82b-2e48-4ed6-abad-2c4429e952a4 | 18eea110-a109-43f8-bc6d-fcc5c50e1eee | 3195455668224
  84b2ee25-13d0-44f6-8158-e2409e736a49 | 2024-02-21 15:04:49.24759+00  | 2024-02-21 15:04:49.24759+00  | NULL         |     2 | 7d7d3ff5-357f-4756-be5b-0cc459ca3811 | a7d2a93c-e66e-4fa4-93a8-78a4c8a07c9d | 3195455668224
  00c56f76-a5e5-45b2-950b-fd189d1560de | 2024-02-21 15:04:49.761461+00 | 2024-02-21 15:04:49.761461+00 | NULL         |     2 | fba3ebcd-c4d9-43b4-8214-01cfc14bee99 | 5a4d1226-04c8-46ba-a1f9-86099e5d8654 | 3195455668224
  20fbb557-88b5-4027-8d65-02bcce25fc53 | 2024-02-21 15:04:52.856451+00 | 2024-02-21 15:04:52.856451+00 | NULL         |     2 | fba3ebcd-c4d9-43b4-8214-01cfc14bee99 | 973de75d-2b52-408c-9092-bd67154be0ca | 3195455668224
  9cec1462-2d30-4344-b8ea-688392b2a34f | 2024-02-21 15:04:53.585515+00 | 2024-02-21 15:04:53.585515+00 | NULL         |     2 | fba3ebcd-c4d9-43b4-8214-01cfc14bee99 | d33704d7-01ef-4f9f-8aed-1cee2319fa82 | 3195455668224
  fc2669a0-ea73-4ded-8107-85e82776ae92 | 2024-02-21 15:04:53.706475+00 | 2024-02-21 15:04:53.706475+00 | NULL         |     3 | 7d7d3ff5-357f-4756-be5b-0cc459ca3811 | 0d0e0d7b-723c-44d4-8dd8-6334b571dd01 | 3195455668224
  6fca83f1-f134-4b1c-9b68-3cbfa853a83b | 2024-02-21 15:04:54.172907+00 | 2024-02-21 15:04:54.172907+00 | NULL         |     2 | dc22b82b-2e48-4ed6-abad-2c4429e952a4 | 892c1865-e111-426e-8743-06def0822495 | 3195455668224
  3b0f7b76-1d71-41aa-a1bc-a26ada040085 | 2024-02-21 15:04:54.332733+00 | 2024-02-21 15:04:54.332733+00 | NULL         |     2 | dc22b82b-2e48-4ed6-abad-2c4429e952a4 | 1cef0d30-f930-4871-a9c8-f7d27ec61e36 | 3195455668224
  6e749ef3-d960-4534-8b93-e47205972f60 | 2024-02-21 15:04:54.458113+00 | 2024-02-21 15:04:54.458113+00 | NULL         |     4 | 7d7d3ff5-357f-4756-be5b-0cc459ca3811 | e8bcd929-bd69-4b36-9962-b60893381dec | 3195455668224
  14fb88c8-f611-44c9-a015-e305584fcb93 | 2024-02-21 15:04:56.254718+00 | 2024-02-21 15:04:56.254718+00 | NULL         |     2 | fba3ebcd-c4d9-43b4-8214-01cfc14bee99 | a239d27c-5ad2-4697-a0ab-47688548b412 | 3195455668224
  9e50d9ed-1793-4e9f-8df3-7c9649c0670f | 2024-02-21 15:04:58.743848+00 | 2024-02-21 15:04:58.743848+00 | NULL         |     2 | fba3ebcd-c4d9-43b4-8214-01cfc14bee99 | 8417cd16-226d-46cd-8abd-a038ccfcd49b | 3195455668224
  66787f85-4631-4b65-a1ba-9f761f2f79f7 | 2024-02-21 15:05:00.609673+00 | 2024-02-21 15:05:00.609673+00 | NULL         |     2 | fba3ebcd-c4d9-43b4-8214-01cfc14bee99 | 715ba6e2-bf87-48d0-8880-a0e87b736af7 | 3195455668224
  e60a21f4-f73e-4257-8148-79896b785b2f | 2024-02-21 15:05:00.994313+00 | 2024-02-21 15:05:00.994313+00 | NULL         |     2 | 7d7d3ff5-357f-4756-be5b-0cc459ca3811 | 4884bb10-c9bb-43c5-8fc9-c8aa363d97a7 | 3195455668224
  440ad8f7-4867-4352-bca2-891ce6569044 | 2024-02-21 15:05:01.578578+00 | 2024-02-21 15:05:01.578578+00 | NULL         |     5 | fba3ebcd-c4d9-43b4-8214-01cfc14bee99 | 75b5d562-0c52-4711-8626-95bdcb287485 | 3195455668224
  d0344ef8-ffec-4f20-8b14-d6749d87cf05 | 2024-02-21 15:05:04.267426+00 | 2024-02-21 15:05:04.267426+00 | NULL         |     2 | dc22b82b-2e48-4ed6-abad-2c4429e952a4 | a9d80546-5cf5-4e5e-99c9-c0cc90cda93e | 3195455668224
  dc645b70-e0ee-45a8-8fd0-c23cae5b4f86 | 2024-02-21 15:05:13.584497+00 | 2024-02-21 15:05:13.584497+00 | NULL         |     3 | fba3ebcd-c4d9-43b4-8214-01cfc14bee99 | 4d6f9cf8-e846-4c50-a683-c9cc0f8c3003 | 3195455668224
  4aadf142-686f-44b3-8e47-71f032944cd1 | 2024-02-21 15:05:15.556598+00 | 2024-02-21 15:05:15.556598+00 | NULL         |     2 | dc22b82b-2e48-4ed6-abad-2c4429e952a4 | 20d7b38a-1cc7-407c-a8bc-284df6cd498a | 3195455668224
  695081ce-7377-4fde-9d53-178c46ea91eb | 2024-02-21 15:05:29.868917+00 | 2024-02-21 15:05:29.868917+00 | NULL         |     2 | 7d7d3ff5-357f-4756-be5b-0cc459ca3811 | f934c48c-9970-48e0-8da0-26fae92ed697 | 3195455668224
  f2298af4-6fed-4a53-83b3-8f645fbf5db9 | 2024-02-21 15:05:38.148935+00 | 2024-02-21 15:05:38.148935+00 | NULL         |     2 | 7d7d3ff5-357f-4756-be5b-0cc459ca3811 | f913ceb5-ae6a-4590-95f9-2853b1f97d4a | 3195455668224
  e3ff512a-8225-4ed7-b653-54c134de7eb8 | 2024-02-21 15:05:40.272901+00 | 2024-02-21 15:05:40.272901+00 | NULL         |     2 | fba3ebcd-c4d9-43b4-8214-01cfc14bee99 | 58757d4b-be3b-4cc0-925c-7dd27613a9b3 | 3195455668224
  80d267e0-afbb-42ad-8745-fde2b8348399 | 2024-02-21 18:17:57.909938+00 | 2024-02-21 18:17:57.909938+00 | NULL         |     1 | c2cfbfdd-df01-4788-abc1-c900226194f3 | 2bb134e5-025f-45c6-a462-97813dd5dccc | 3195455668224
  561d204d-b989-48fa-b4a0-cfd07b116121 | 2024-02-21 18:18:09.401871+00 | 2024-02-21 18:18:09.401871+00 | NULL         |     1 | c2cfbfdd-df01-4788-abc1-c900226194f3 | 924863d1-b31b-4bd8-809e-0374085c8c87 | 3195455668224
  6cd4979d-7548-4ebb-90fd-396aed02aaf4 | 2024-02-21 18:18:09.796106+00 | 2024-02-21 18:18:09.796106+00 | NULL         |     1 | c2cfbfdd-df01-4788-abc1-c900226194f3 | 7a8ee047-9c4d-44c3-9c44-5ed4fe9a5d40 | 3195455668224
  a186b2e4-b6bf-4cbc-ac18-dda121b11816 | 2024-02-21 18:18:15.801725+00 | 2024-02-21 18:18:15.801725+00 | NULL         |     1 | c2cfbfdd-df01-4788-abc1-c900226194f3 | f9e5b42e-52d0-4b2f-bac0-c601f8da6408 | 3195455668224
  3ee8f2d0-e53d-4b69-82ce-518bbd65e7b8 | 2024-02-21 18:18:16.365083+00 | 2024-02-21 18:18:16.365083+00 | NULL         |     1 | c2cfbfdd-df01-4788-abc1-c900226194f3 | 8977ee53-d011-4ea9-8d21-3c194f906ea6 | 3195455668224
  2d91b0f7-d40c-4dc2-ba63-dd3cfc4c6349 | 2024-02-21 18:18:16.566012+00 | 2024-02-21 18:18:16.566012+00 | NULL         |     1 | c2cfbfdd-df01-4788-abc1-c900226194f3 | 27637d23-f170-4fd3-9d0e-4b6ab59f2d34 | 3195455668224
  0819800f-7651-43d3-aa0e-bd45f300f536 | 2024-02-21 18:18:39.744575+00 | 2024-02-21 18:18:39.744575+00 | NULL         |     1 | c2cfbfdd-df01-4788-abc1-c900226194f3 | a76fb342-586b-43d5-ba6c-0e8cdc432e40 | 3195455668224
  2ac0554c-0f30-4eae-a846-1d8e7ec04e4a | 2024-02-21 18:18:52.193842+00 | 2024-02-21 18:18:52.193842+00 | NULL         |     1 | c2cfbfdd-df01-4788-abc1-c900226194f3 | 74867cd4-e5fe-43f3-861e-7ec2431000da | 3195455668224
(37 rows)

(sled-id c2cfbfdd-df01-4788-abc1-c900226194f3 is the ID of BRM42220004, as expected by the gap in time_created.)

@jgallagher
Copy link
Contributor Author

The sled-agent log:
BRM42220004-sled-agent.log

@jgallagher
Copy link
Contributor Author

Nexus log from problem 1 is too big to attach to github; it's at /staff/lab/madrid/john-update-2024-02-21/nexus-add-sled-fail.log. It is indeed a timeout 15sec after the request started:

18:16:02.685Z INFO 6c7bc66a-1e69-4b09-a576-93e1fd48aca4 (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = failed to add sled with baseboard UninitializedSledId { serial: "BRM42220004", part: "913-0000019" } to rack 054a8a4c-9ebb-4622-bfb8-5af8c30694c0: Communication Error: error sending request for url (http://[fd00:1122:3344:101::1]:12345/sleds): operation timed out
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:837
    latency_us = 15126784
    local_addr = 172.30.2.6:443
    method = POST
    remote_addr = 172.20.16.242:54324
    req_id = 044ca7fb-4a39-4232-abf5-d4c65a9a7a4c
    response_code = 500
    uri = //v1/system/hardware/sleds

@jgallagher
Copy link
Contributor Author

Contents of /pool/int/*/config/*.json on the newly-added sled (both M.2s had identical files, as expected):

omicron-zones.json
sled-agent-request.json

@jgallagher
Copy link
Contributor Author

This is the sled-agent log from BRM44220001, which was the member of the cluster that handled the request from Nexus to add a new sled (problem 1): BRM44220001-sled-agent.log

Grepping for the Nexus source port, it looks like the request took just over 2 minutes:

% grep :59386 BRM44220001-sled-agent.log | looker
18:15:47.679Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:765
    local_addr = [fd00:1122:3344:101::1]:12345
    remote_addr = [fd00:1122:3344:102::5]:59386
18:17:50.861Z WARN SledAgent (dropshot (SledAgent)): client disconnected before response returned
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:923
    local_addr = [fd00:1122:3344:101::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:102::5]:59386
    req_id = dc8a5c33-eb99-472c-b9dc-357b82b14760
    uri = /sleds

@davepacheco
Copy link
Collaborator

I took a look at why we have so many inventory collections for only three Nexus instances. I think we're getting somewhat unlucky on timing. When I walked up I saw these collections:

root@[fd00:1122:3344:101::3]:32221/omicron> select c.*, count(e.inv_collection_id) from inv_collection c, inv_collection_error e WHERE c.id = e.inv_collection_id GROUP BY id;
                   id                  |         time_started          |           time_done           |              collector               | count
---------------------------------------+-------------------------------+-------------------------------+--------------------------------------+--------
  a8e7e17f-45f2-4bc6-b577-4a7659a5cd57 | 2024-02-21 21:37:32.942262+00 | 2024-02-21 21:37:38.218844+00 | 2be3b4a1-c7ed-49ab-a198-9ca4bd90b32f |     8
  e6b1169f-312a-44af-accf-7fbd41096eea | 2024-02-21 21:37:32.730654+00 | 2024-02-21 21:37:37.639467+00 | 6c7bc66a-1e69-4b09-a576-93e1fd48aca4 |     8
  24128f90-60ea-450e-98d6-6a96653a0177 | 2024-02-21 21:47:32.297056+00 | 2024-02-21 21:47:36.374437+00 | 7e787ab2-fd83-4c27-a9be-ca3b284067d7 |     8
  274ab7ad-affc-4d30-a330-8cfd4f8ee348 | 2024-02-21 21:37:32.407992+00 | 2024-02-21 21:37:37.011132+00 | 7e787ab2-fd83-4c27-a9be-ca3b284067d7 |     8
  7ab4d21b-84a0-4af1-8534-d36a66f5a3ff | 2024-02-21 21:47:32.914722+00 | 2024-02-21 21:47:38.537935+00 | 6c7bc66a-1e69-4b09-a576-93e1fd48aca4 |     8
  839e4e9f-db92-4200-967d-60a0965a4b86 | 2024-02-21 21:47:32.958048+00 | 2024-02-21 21:47:38.541542+00 | 2be3b4a1-c7ed-49ab-a198-9ca4bd90b32f |     8
(6 rows)

(All collections have 8 errors related to their SPs; I believe this is known and expected.)

I found that two Nexus instances always found few enough collections that there was nothing to prune. Their spitting out messages like this each time:

21:37:32.543Z DEBG 6c7bc66a-1e69-4b09-a576-93e1fd48aca4 (ServerContext): inventory_prune_one: nothing eligible for removal (too few)
    background_task = inventory_collection
    candidates = [(045df132-299d-414c-beb6-44d6067d61a7, 8), (067d8f47-6641-4d04-a736-13f49e701139, 8), (168e28f9-7bcd-461d-a7a7-2ca767449802, 8)] 
21:47:32.578Z DEBG 6c7bc66a-1e69-4b09-a576-93e1fd48aca4 (ServerContext): inventory_prune_one: nothing eligible for removal (too few)
    background_task = inventory_collection
    candidates = [(274ab7ad-affc-4d30-a330-8cfd4f8ee348, 8), (e6b1169f-312a-44af-accf-7fbd41096eea, 8), (a8e7e17f-45f2-4bc6-b577-4a7659a5cd57, 8)] 

The last one is doing the pruning, spitting out messages like this:

19:27:31.804Z DEBG 7e787ab2-fd83-4c27-a9be-ca3b284067d7 (ServerContext): inventory_prune_one: eligible for removal
    background_task = inventory_collection
    candidates = [(b848ba1c-319b-4738-b84b-1b91c2c14569, 8), (b6341ff2-d1b5-43db-b2a0-ccb22643e8dd, 8), (9e8c0339-66c1-4cd6-b718-e248c444a5a5, 8), (d2a9cac3-6519-41de-8e1d-82cf3649b781, 8)] 
    collection_id = b848ba1c-319b-4738-b84b-1b91c2c14569
19:27:31.835Z DEBG 7e787ab2-fd83-4c27-a9be-ca3b284067d7 (ServerContext): inventory_prune_one: eligible for removal
    background_task = inventory_collection
    candidates = [(b6341ff2-d1b5-43db-b2a0-ccb22643e8dd, 8), (9e8c0339-66c1-4cd6-b718-e248c444a5a5, 8), (d2a9cac3-6519-41de-8e1d-82cf3649b781, 8), (4b141df1-6322-46e3-9a6d-4f31af051f06, 8)] 

I happened to catch a moment when there were only three:

root@[fd00:1122:3344:101::3]:32221/omicron> select c.*, count(e.inv_collection_id) from inv_collection c, inv_collection_error e WHERE c.id = e.inv_collection_id GROUP BY id; 
                   id                  |         time_started          |           time_done           |              collector               | count
---------------------------------------+-------------------------------+-------------------------------+--------------------------------------+--------
  24128f90-60ea-450e-98d6-6a96653a0177 | 2024-02-21 21:47:32.297056+00 | 2024-02-21 21:47:36.374437+00 | 7e787ab2-fd83-4c27-a9be-ca3b284067d7 |     8   
  7ab4d21b-84a0-4af1-8534-d36a66f5a3ff | 2024-02-21 21:47:32.914722+00 | 2024-02-21 21:47:38.537935+00 | 6c7bc66a-1e69-4b09-a576-93e1fd48aca4 |     8   
  839e4e9f-db92-4200-967d-60a0965a4b86 | 2024-02-21 21:47:32.958048+00 | 2024-02-21 21:47:38.541542+00 | 2be3b4a1-c7ed-49ab-a198-9ca4bd90b32f |     8   
(3 rows)


Time: 6ms total (execution 6ms / network 0ms)

root@[fd00:1122:3344:101::3]:32221/omicron> SELECT NOW();
               now 
---------------------------------
  2024-02-21 21:57:36.583124+00
(1 row)

Time: 1ms total (execution 0ms / network 0ms)

By the time I'd typed the above query to print the time and ran the first one again, there were six again:

root@[fd00:1122:3344:101::3]:32221/omicron> select c.*, count(e.inv_collection_id) from inv_collection c, inv_collection_error e WHERE c.id = e.inv_collection_id GROUP BY id; 
                   id                  |         time_started          |           time_done           |              collector               | count
---------------------------------------+-------------------------------+-------------------------------+--------------------------------------+--------
  1c4a31a4-373f-48f8-8ccc-e1c046ce9cf8 | 2024-02-21 21:57:32.991799+00 | 2024-02-21 21:57:38.289313+00 | 2be3b4a1-c7ed-49ab-a198-9ca4bd90b32f |     8   
  24128f90-60ea-450e-98d6-6a96653a0177 | 2024-02-21 21:47:32.297056+00 | 2024-02-21 21:47:36.374437+00 | 7e787ab2-fd83-4c27-a9be-ca3b284067d7 |     8   
  71507fd0-36c8-4192-816f-9952523a6ae7 | 2024-02-21 21:57:32.342208+00 | 2024-02-21 21:57:37.331842+00 | 7e787ab2-fd83-4c27-a9be-ca3b284067d7 |     8   
  7ab4d21b-84a0-4af1-8534-d36a66f5a3ff | 2024-02-21 21:47:32.914722+00 | 2024-02-21 21:47:38.537935+00 | 6c7bc66a-1e69-4b09-a576-93e1fd48aca4 |     8   
  839e4e9f-db92-4200-967d-60a0965a4b86 | 2024-02-21 21:47:32.958048+00 | 2024-02-21 21:47:38.541542+00 | 2be3b4a1-c7ed-49ab-a198-9ca4bd90b32f |     8   
  c5eefe38-34a1-4fa0-af52-b682d65f0ec8 | 2024-02-21 21:57:32.819036+00 | 2024-02-21 21:57:37.748662+00 | 6c7bc66a-1e69-4b09-a576-93e1fd48aca4 |     8   
(6 rows)

The timestamps reflect that (1) the Nexus instances are all starting and finishing their collections within a second or so of each other, and (2) my query above happened to be right when they were finishing a collection, which is when they go to insert it, which is when they prune. So I think it's simply that all three are ready to insert around the same time, one prunes it down to the expected 3, the other two check the count during this window, and then all three wind up inserting theirs. I think this is fine and not a bug. It's a little harder to be sure why there were six right away without doing more digging into the historical Nexus logs to see why they were activated.

@davepacheco
Copy link
Collaborator

Small problem to add to the list: I don't see a "request completed" log entry for the sled agent request where Nexus gave up. Looking at dropshot's server.rs, it looks like that (along with USDT probe firing) happens in the context that got cancelled. That seems not great. I feel like we should move that stuff into the async context so that it always happens.

@davepacheco
Copy link
Collaborator

It seems like a big part of the problem here was the 2+ minute request to add the sled.

  • unknown issue causes it to take 2 minutes to add the sled
  • bug: Nexus gives up after only 15s, resulting in a 500. This timeout should be much longer or removed altogether. This is way too short, and I'm not sure it ever makes sense to give up if TCP keepalive is configured because in that case we know the other side is alive and doing stuff. (But this also means maybe we shouldn't be doing this synchronously from an external API context because that's a long time for an HTTP request to sit pending, especially over an external network.)
  • as a result, Nexus doesn't trigger the usual inventory collection after adding a sled (recently added), which is why @jgallagher had to wait 10 minutes for a new enough collection to show up

So regarding that 2-minute window: here's a timeline of events from the Nexus log and the Sled Agent BRM44220001 log (not the new Sled):

18:15:47Z     (inferred) Nexus 6c7bc66a-1e69-4b09-a576-93e1fd48aca4 starts the "add sled" request

18:15:47.679Z sled agent BRM44220001 accepts connection from Nexus
(sled agent address: [fd00:1122:3344:101::1]:12345)

18:15:47.943Z - 18:15:51.937Z Sled Agent BRM44220001 sees a bunch of GetShare requests received and responses sent out

18:16:02.685Z Nexus 6c7bc66a-1e69-4b09-a576-93e1fd48aca4 reports 500 error for 
POST /v1/system/hardware/sleds because the sled agent request to
fd00:1122:3344:101::1 timed out.  Latency is just over 15s.
req_id 044ca7fb-4a39-4232-abf5-d4c65a9a7a4c

18:17:50.852Z: Sled Agent BRM44220001 reports "Peer agent initialized" for:
    peer_bootstrap_addr = [fdb0:a840:2504:157::1]:12346
    peer_id = gimlet-BRM42220004-913-0000019-6
    sled_id = 7d7d3ff5-357f-4756-be5b-0cc459ca3811

18:17:50.861Z Sled Agent BRM44220001 notices that Nexus had given up already

I'm currently putting together a timeline from the new sled's log for the two-minute window to understand why that took so long. So far I've been able to determine the bulk of the gap is between the call to illumos_utils::opte::initialize_xde_driver(&log, &underlay_nics)?; and the call to services.sled_agent_started, both inside SledAgent::new. These log entries are about 1m55s apart so it's the bulk of the gap.

@davepacheco
Copy link
Collaborator

For the new sled (BRM42220004), the times are all on 1986-12-28 because it hasn't set up NTP yet.

Here are some relevant log entries:

03:28:48.734Z BRM42220004 "Loading Sled Agent: StartSledAgentRequest" { generation: 0, schema_version: 1, body: StartSledAgentRequestBody { id: c2cfbfdd-df0
1-4788-abc1-c900226194f3, rack_id: 054a8a4c-9ebb-4622-bfb8-5af8c30694c0, use_trust_quorum: true, is_lrtq_learner: true, subnet: Ipv6Subnet { net: Ipv6Net(Ipv6Network { addr: fd00:1122:3344:121::, prefix: 
64 }) } } } 
03:28:48.740Z BRM42220004: "setting up sled agent server"
03:28:48.743Z BRM42220004: "new DNS resolver" ("DnsResolver" log tag)
03:28:48.803Z BRM42220004: "SledAgent::new"(..) starting
03:28:53.016Z BRM42220004: "Requested swap device of size 256 GiB" 
03:28:54.243Z BRM42220004: "Mounting backing filesystems"
03:28:55.642Z using ... "as data links for xde driver"
03:28:55.660Z "Setting reservoir size"

Then we have a gap of 1m55.8s. Then:

03:30:51.502Z BRM42220004: "sled agent started"
03:30:51.502Z BRM42220004: "new DNS resolver" ("ServiceManager" log tag)
03:30:51.638Z BRM42220004: "Sled Agent loaded; recording configuration"
03:30:51.638Z BRM42220004: Sled Agent is listening
03:30:51.639Z contacting server nexus, registering sled

Inside SledAgent::start, the "Setting reservoir size" log message comes from this call:
https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/sled_agent.rs#L448

Still inside SledAgent::start, the "sled agent started" log message comes from this call in the same function:
https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/sled_agent.rs#L512

So our gap is between those two lines of code. The only thing of substance that happens there is the get_network_config loop. During the gap, the only log messages I see are:

  • "accepted connection" / "request completed" for GET /baseboard (runs every 12s or so and appears unrelated)
  • "failed to notify" Nexus about pools and physical disks because their parents don't exist (I think this is expected and relates to Omicron took 22 minutes to come online after receiving rack-initialize #3082, but I assume it's unrelated here because we don't seem to block on this completing

I don't see any of the "failed to get network config from bootstore" message. So it looks like we spent all this time in this one call:

let serialized_config = long_running_task_handles
.bootstore
.get_network_config()
.await

That calls out to the bootstore task, but the implementation of that is synchronous, so I can only guess that the bootstore task was busy (or more likely blocked) doing something else for all that time? But I'm not sure what.

@andrewjstone do you have any ideas here?

@andrewjstone
Copy link
Contributor

So, I don't think this is a bootstore issue. The bootstore itself, for better or worse receives the network config well before reconfiguration starts. The entire gossip process is shown below.

00:07:54.198Z INFO SledAgent (bootstore): Received network config from gimlet-BRM44220001-913-0000019-6 with generation: 1, current generation: 0
     file = bootstore/src/schemes/v0/peer.rs:956
     peer_id = gimlet-BRM42220004-913-0000019-6
 00:07:54.198Z INFO SledAgent (bootstore): Writing ledger to /pool/int/ad41e0a6-120f-4475-ac8b-9b11b20b4a10/cluster/.bootstore-network-config.json.tmp
     file = common/src/ledger.rs:198
     peer_id = gimlet-BRM42220004-913-0000019-6
 00:07:54.199Z INFO SledAgent (bootstore): Writing ledger to /pool/int/723cf5be-f05d-4894-a5bf-66451062ab03/cluster/.bootstore-network-config.json.tmp
     file = common/src/ledger.rs:198
     peer_id = gimlet-BRM42220004-913-0000019-6
 00:07:54.199Z INFO SledAgent (bootstore): Broadcasting network config with generation 1
     file = bootstore/src/schemes/v0/peer.rs:648
     peer_id = gimlet-BRM42220004-913-0000019-6
 00:07:54.199Z INFO SledAgent (bootstore): Sending network config with generation 1 to gimlet-BRM42220046-913-0000019-6
     file = bootstore/src/schemes/v0/peer.rs:658
     peer_id = gimlet-BRM42220004-913-0000019-6
 00:07:54.199Z INFO SledAgent (bootstore): Sending network config with generation 1 to gimlet-BRM42220081-913-0000019-6
     file = bootstore/src/schemes/v0/peer.rs:658
     peer_id = gimlet-BRM42220004-913-0000019-6
 00:07:54.199Z INFO SledAgent (bootstore): Received network config from gimlet-BRM42220081-913-0000019-6 with generation: 1, current generation: 1
     file = bootstore/src/schemes/v0/peer.rs:956
     peer_id = gimlet-BRM42220004-913-0000019-6
 00:07:54.199Z INFO SledAgent (bootstore): Received network config from gimlet-BRM42220046-913-0000019-6 with generation: 1, current generation: 1
     file = bootstore/src/schemes/v0/peer.rs:956

The call to bootstore.get_network_config() is message passing based, but synchronous. It just returns what the bootstore has in memory. We know the bootstore already has the V1 version, and we don't see any of the warning messages due to backoff which confirms it.

Then, the bootstore node becomes a learner so it can participate in lrtq, retrieves its keyshare, and persists its fsm state:

 03:28:48.734Z INFO SledAgent (SledAgentSprocketsServer): Accepted connection
     file = sled-agent/src/bootstrap/sprockets_server.rs:66
     remote_addr = [fdb0:a840:2504:355::1]:56721
 03:28:48.734Z INFO SledAgent (SledAgentMain): Loading Sled Agent: StartSledAgentRequest { generation: 0, schema_version: 1, body: StartSledAgentRequestBody { id: c2cfbfdd-df01-4788-abc1-c900226194f3, rack_id: 054a8a4c-9ebb-4622-bfb8-5af8c30694c0, use_trust_quorum: true, is_lrtq_learner: true, subnet: Ipv6Subnet { net: Ipv6Net(I
     file = sled-agent/src/bootstrap/server.rs:364
 03:28:48.734Z INFO SledAgent (SledAgentMain): KeyManager: using lrtq secret retriever
     file = sled-agent/src/bootstrap/server.rs:372
 03:28:48.736Z INFO SledAgent (SledAgentMain): Initializing sled as learner
     file = sled-agent/src/bootstrap/server.rs:384
 03:28:48.736Z INFO SledAgent (bootstore): InitLearner started
     file = bootstore/src/schemes/v0/peer.rs:484
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:48.736Z INFO SledAgent (bootstore): Sending Req(Request { id: b1bfb28f-46b3-43a3-aed3-f93b12551a49, type_: Learn }) to gimlet-BRM42220046-913-0000019-6
     file = bootstore/src/schemes/v0/peer.rs:695
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:48.736Z INFO SledAgent (bootstore): Received Req(Request { id: b1bfb28f-46b3-43a3-aed3-f93b12551a49, type_: GetShare { rack_uuid: RackUuid(054a8a4c-9ebb-4622-bfb8-5af8c30694c0) } }) from gimlet-BRM42220046-913-0000019-6
     file = bootstore/src/schemes/v0/peer.rs:935
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:48.736Z INFO SledAgent (bootstore): Sending Rsp(Response { request_id: b1bfb28f-46b3-43a3-aed3-f93b12551a49, type_: Error(StillLearning) }) to gimlet-BRM42220046-913-0000019-6
     file = bootstore/src/schemes/v0/peer.rs:695
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:48.739Z INFO SledAgent (bootstore): Received Rsp(Response { request_id: b1bfb28f-46b3-43a3-aed3-f93b12551a49, type_: LearnPkg(LearnedSharePkg { common: SharePkgCommon { rack_uuid: 054a8a4c-9ebb-4622-bfb8-5af8c30694c0, epoch: 0, threshold: 2, share: "Share", share_digests: "Digests" } }) }) from gimlet-BRM42220046-913-0000
     file = bootstore/src/schemes/v0/peer.rs:935
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:48.739Z INFO SledAgent (bootstore): Fsm output = LearningCompleted
     file = bootstore/src/schemes/v0/peer.rs:716
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:48.739Z INFO SledAgent (bootstore): Writing ledger to /pool/int/ad41e0a6-120f-4475-ac8b-9b11b20b4a10/cluster/.bootstore-fsm-state.json.tmp
     file = common/src/ledger.rs:198
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:48.740Z INFO SledAgent: setting up sled agent server
     file = sled-agent/src/server.rs:47
 03:28:48.740Z INFO SledAgent (StorageManager): Received KeyManagerReady
     file = sled-storage/src/manager.rs:346
 03:28:48.740Z INFO SledAgent (StorageManager): Attempting to add queued disks
     file = sled-storage/src/manager.rs:394
     num_disks = 10
 03:28:48.740Z INFO SledAgent (bootstore): Writing ledger to /pool/int/723cf5be-f05d-4894-a5bf-66451062ab03/cluster/.bootstore-fsm-state.json.tmp

Then immediately, the node starts to retrieve key shares so that it can configure the U.2 drives. It does this for each drive. It only takes ~2ms to load the rack secret, but I'm thinking about caching it for up to a minute since all drives typically get loaded at once in the real trust quorum.

03:28:49.019Z INFO SledAgent (bootstore): LoadRackSecret started
     file = bootstore/src/schemes/v0/peer.rs:498
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:49.019Z INFO SledAgent (bootstore): Sending Req(Request { id: 8b10e635-7962-47c4-a2e2-52804c712b5b, type_: GetShare { rack_uuid: RackUuid(054a8a4c-9ebb-4622-bfb8-5af8c30694c0) } }) to gimlet-BRM42220046-913-0000019-6
     file = bootstore/src/schemes/v0/peer.rs:695
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:49.019Z INFO SledAgent (bootstore): Sending Req(Request { id: 8b10e635-7962-47c4-a2e2-52804c712b5b, type_: GetShare { rack_uuid: RackUuid(054a8a4c-9ebb-4622-bfb8-5af8c30694c0) } }) to gimlet-BRM42220081-913-0000019-6
     file = bootstore/src/schemes/v0/peer.rs:695
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:49.019Z INFO SledAgent (bootstore): Sending Req(Request { id: 8b10e635-7962-47c4-a2e2-52804c712b5b, type_: GetShare { rack_uuid: RackUuid(054a8a4c-9ebb-4622-bfb8-5af8c30694c0) } }) to gimlet-BRM44220001-913-0000019-6
     file = bootstore/src/schemes/v0/peer.rs:695
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:49.019Z INFO SledAgent (bootstore): Received Rsp(Response { request_id: 8b10e635-7962-47c4-a2e2-52804c712b5b, type_: Share(Share) }) from gimlet-BRM42220081-913-0000019-6
     file = bootstore/src/schemes/v0/peer.rs:935
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:49.021Z INFO SledAgent (bootstore): Fsm output = RackSecret { request_id: 8b10e635-7962-47c4-a2e2-52804c712b5b, secret: [REDACTED bootstore::trust_quorum::rack_secret::RackSecret] }
     file = bootstore/src/schemes/v0/peer.rs:716
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:49.021Z INFO SledAgent (bootstore): Received Rsp(Response { request_id: 8b10e635-7962-47c4-a2e2-52804c712b5b, type_: Share(Share) }) from gimlet-BRM42220046-913-0000019-6
     file = bootstore/src/schemes/v0/peer.rs:935
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:49.021Z INFO SledAgent (bootstore): Received Rsp(Response { request_id: 8b10e635-7962-47c4-a2e2-52804c712b5b, type_: Share(Share) }) from gimlet-BRM44220001-913-0000019-6
     file = bootstore/src/schemes/v0/peer.rs:935
     peer_id = gimlet-BRM42220004-913-0000019-6
 03:28:49.021Z INFO SledAgent (StorageManager): Loaded latest secret
     disk_id = DiskIdentity {\n    vendor: "1b96",\n    serial: "A079E4C6",\n    model: "WUS4C6432DSP3X3",\n}
     epoch = 1
     file = sled-storage/src/dataset.rs:329
 03:28:49.021Z INFO SledAgent (StorageManager): Retrieving key
     disk_id = DiskIdentity {\n    vendor: "1b96",\n    serial: "A079E4C6",\n    model: "WUS4C6432DSP3X3",\n}
     epoch = 1
     file = sled-storage/src/dataset.rs:333
 03:28:49.021Z INFO SledAgent (StorageManager): Got key
     disk_id = DiskIdentity {\n    vendor: "1b96",\n    serial: "A079E4C6",\n    model: "WUS4C6432DSP3X3",\n}
     epoch = 1
     file = sled-storage/src/dataset.rs:335
 03:28:49.021Z INFO SledAgent (StorageManager): Created keyfile /var/run/oxide/1b96-A079E4C6-WUS4C6432DSP3X3-zfs-aes-256-gcm.key
     file = sled-storage/src/keyfile.rs:37
 03:28:49.021Z INFO SledAgent (StorageManager): Ensuring encrypted filesystem: crypt for epoch 1
     file = sled-storage/src/dataset.rs:347
 03:28:49.073Z INFO SledAgent (StorageManager): Zeroed and unlinked keyfile /var/run/oxide/1b96-A079E4C6-WUS4C6432DSP3X3-zfs-aes-256-gcm.key
     file = sled-storage/src/keyfile.rs:56
 03:28:49.088Z INFO SledAgent (StorageManager): Automatically destroying dataset: oxp_6cd4979d-7548-4ebb-90fd-396aed02aaf4/crypt/zone
     file = sled-storage/src/dataset.rs:389
 03:28:49.273Z INFO SledAgent (StorageManager): Looking for unencrypted datasets in oxp_6cd4979d-7548-4ebb-90fd-396aed02aaf4
     file = sled-storage/src/dataset.rs:474
 03:28:49.303Z INFO SledAgent (StorageManager): Disk at /devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0 already has a GPT
     file = sled-hardware/src/illumos/partitions.rs:103
 03:28:49.303Z ERRO SledAgent (StorageManager): Persistent error:not queueing disk
     disk_id = DiskIdentity { vendor: "1b96", serial: "A079E3F8", model: "WUS4C6432DSP3X3" }
     err = PooledDisk(BadPartitionLayout { path: "/devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0:wd,raw", why: "Expected 0 partitions, only saw 1" })
     file = sled-storage/src/manager.rs:472

The first disk, that I copied here just happened to be the one with the problem that John reported.

As far as I can tell, the last thing related to U.2 disk management and encryption is logged until the sled-agent starts.

03:28:52.986Z INFO SledAgent (StorageManager): Looking for unencrypted datasets in oxp_80d267e0-afbb-42ad-8745-fde2b8348399. 

I think it's still unclear why things are taking this long and there doesn't appear to be any more info in the sled-agent log on the added node.

@davepacheco
Copy link
Collaborator

@andrewjstone pointed out in chat that the actual call to set the reservoir size is also in the gap and could also be what got stuck for so long.

@davepacheco
Copy link
Collaborator

We spent some time looking through the code to find other events that might tell us post hoc how long the reservoir operation took but we did not find anything. Instead, we decided to try an experiment: we bounced BRM42220004, quickly logged in on the console before sled agent started, and used DTrace to watch how long the reservoir operation took. (thanks to @rmustacc for the D here). The thinking here was: if it took a long time to do the reservoir operation when we added this sled, there's a good chance it might also take that long on a fresh reboot, since the sled presumably hadn't done all that much work prior to the first "add sled" operation. At the very least: if it did take a long time to do the reservoir operation after a fresh boot, it's pretty plausible that we saw the same behavior here.

$ pfexec dtrace -n 'fbt::vmmr_set_target:entry{ self->t = timestamp; }' -n 'fbt::vmmr_set_target:return/self->t/{ printf("%d\n", timestamp - self->t); self->t = 0; }'
dtrace: description 'fbt::vmmr_set_target:entry' matched 1 probe
dtrace: description 'fbt::vmmr_set_target:return' matched 1 probe
...
CPU     ID                    FUNCTION:NAME
 60  70118           vmmr_set_target:return 115707459829

Bingo -- that's 115s to do this operation. That's exactly the length of our gap.

I also happened to catch it in the act:

BRM42220004 # grep -i reservoir $(svcs -L sled-agent)
{"msg":"Setting reservoir size to 829016 MiB bytes","v":0,"name":"SledAgent","level":30,"time":"1986-12-28T00:01:45.342790155Z","hostname":"BRM42220004","pid":660,"component":"InstanceManager","file":"sled-agent/src/instance_manager.rs:214"}
BRM42220004 # svcs -p sled-agent
STATE          STIME    FMRI
online          0:01:33 svc:/oxide/sled-agent:default
                0:01:33      659 ctrun
BRM42220004 # ptree 659
659    ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/o
  660    /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml
BRM42220004 # pstack 660/1
660:    /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.
--------------------- thread# 1 / lwp# 1 ---------------------
 fffff5ffef0ea24a ioctl () + a
 000000000261c568 _ZN13illumos_utils13vmm_reservoir16ReservoirControl3set17h348107624f648e6eE () + 78
 00000000018dd3e6 _ZN18omicron_sled_agent16instance_manager15InstanceManager18set_reservoir_size17hd469f9d7669fcb5cE () + 346
 0000000001570f17 _ZN18omicron_sled_agent6server6Server5start28_$u7b$$u7b$closure$u7d$$u7d$17ha0c2a2db21f4cc6bE () + 1317
 000000000157c0e2 _ZN18omicron_sled_agent9bootstrap6server6Server5start28_$u7b$$u7b$closure$u7d$$u7d$17hdaf5255d3e9cb2c0E () + 2692
 000000000159b2fe _ZN10sled_agent4main28_$u7b$$u7b$closure$u7d$$u7d$17h0527f9dfd5720989E.llvm.8908515661725214071 () + 30e
 00000000015957a5 _ZN5tokio7runtime7context7runtime13enter_runtime17h9ca28641368ac4abE () + 1e5
 00000000015da50d _ZN5tokio7runtime7runtime7Runtime8block_on17h081c890b11486607E () + 5d
 0000000001724e53 _ZN10sled_agent4main17h4542634fe9430657E () + 93
 00000000016d5526 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h4f71fac1628fb39fE () + 6
 000000000167459c _ZN3std2rt10lang_start28_$u7b$$u7b$closure$u7d$$u7d$17hf4604fe8cb90ad4fE.llvm.509618699310468040 () + c
 00000000032bc4a3 _ZN3std2rt19lang_start_internal17h286240f3fc8b359fE () + 2c3
 000000000172502c main () + 2c
 00000000013a8bb7 _start_crt () + 87
 00000000013a8b18 _start () + 18
BRM42220004 # truss -d -p 660/1
Base time stamp:  536112163.8969  [ Sun Dec 28 00:02:43 UTC 1986 ]
/1:     57.1533 ioctl(17, (('V'<<16)|('M'<<8)|17), 0xFFFFF5FFFFDF3548) = 0

Putting all that together, we have:

  • it started the syscall after the log entry, which is dated 00:01:45.342790155Z
  • truss started at 00:02:43
  • the syscall completed 57.1533s after truss started
  • so the syscall completed at 00:03:40
  • so the syscall took 1m55s

which of course matches the DTrace output.

I think there are a few action items here:

  • As I mentioned earlier, I think we should remove the Nexus timeout altogether or bump it to something like 300s. We know this is likely to take upwards of 115s on real hardware.
  • Should the reservoir operation be moved out of the sled agent startup path? I'm not sure about the difficulty or impact of this. Even if Sled Agent kept track of when this finished and pitched 503s until it did finish, that would seem like an improvement here because other activities could proceed.
  • We could make this whole flow async. That is, when Nexus hits PUT /sleds to a sled agent, that doesn't block until the corresponding sled agent starts. Nexus would presumably poll/long-poll on the status instead. This seems like more work in multiple components. It may be a good idea anyway, if we think it's possible other operations could cause this to take a while. This is not exclusive with moving the reservoir operation out of the startup path.

@jgallagher
Copy link
Contributor Author

  • as a result, Nexus doesn't trigger the usual inventory collection after adding a sled (recently added), which is why @jgallagher had to wait 10 minutes for a new enough collection to show up

FWIW, there were 2 periods where I ended up waiting on a collection:

  • After add sled, as discussed above and we now understand
  • In between making the blueprint that added an NTP zone to the new sled and regenerating the next blueprint

In the second case, even though the sled had already started the NTP zone and successfully timesync'd, I was still waiting for an inventory collection because the planner won't add anything else to the sled until the latest collection shows the NTP zone. I'm not sure where we could trigger a collection here, though - if sled-agent told us in the response whether the PUT /omicron-zones actually started new zones, could the reconfigurator task trigger a collection if new zones were added?

@jgallagher
Copy link
Contributor Author

9. While investigating the previous issue, I noticed there were 5 inventory collections, all from approximately the same time.

I don't know why we got 5 here when there are only 3 Nexuses. Maybe we're triggering collections too aggressively, as even the latest one still didn't show the NTP zone's presence.

root@oxz_switch1:~# omdb db inventory collections list
ID                                   STARTED              TOOK    NSPS NERRORS
ed31afc8-8172-4b71-8de7-338e5dacb382 2024-02-21T18:57:32Z 5400 ms 6    8
3565e26f-6e97-41c5-8712-c76ceafb3f76 2024-02-21T18:57:32Z 6325 ms 6    8
a306fef2-f03e-4951-bd94-5348d1d68009 2024-02-21T18:57:32Z 8325 ms 6    8
43db6182-d5e7-4e02-85ad-fab82258692e 2024-02-21T18:57:37Z 4433 ms 6    8
ddfa90f4-6cf3-470f-ae8c-ad959d6b7561 2024-02-21T18:57:38Z 3826 ms 6    8
932bb314-ba2f-4526-a8b9-de2944d4efbd 2024-02-21T18:57:41Z 4901 ms 6    8

Looked through the Nexus logs, and I think this is a side effect of my workaround for problem 3 (which has already been fixed by #5113). I hot patched the auth check out and restarted each Nexus instance. Looking at just one (but similar logs are in all three), we see the SMF restart here:

[ Feb 21 18:57:31 Stopping because service restarting. ]
[ Feb 21 18:57:31 Executing stop method (:kill). ]
[ Feb 21 18:57:31 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/nexus/bin/nexus /var/svc/manifest/site/nexus/config.toml &"). ]
[ Feb 21 18:57:31 Method "start" exited with status 0. ]

Snipping down to just the relevant log lines, we then see the initial timeout-triggered (because tokio::time::interval triggers immediately) inventory collection:

18:57:32.284Z DEBG 2be3b4a1-c7ed-49ab-a198-9ca4bd90b32f (ServerContext): activating
    background_task = inventory_collection
    iteration = 1
    reason = Timeout
18:57:41.102Z DEBG 2be3b4a1-c7ed-49ab-a198-9ca4bd90b32f (ServerContext): inventory collection complete
    background_task = inventory_collection
    collection_id = a306fef2-f03e-4951-bd94-5348d1d68009
    time_started = 2024-02-21 18:57:32.585501 UTC
18:57:41.102Z DEBG 2be3b4a1-c7ed-49ab-a198-9ca4bd90b32f (ServerContext): activation complete
    background_task = inventory_collection
    elapsed = 8.816738012s
    iteration = 1

and then immediately afterwards we see the initial Nexus-triggered inventory collection:

18:57:41.102Z DEBG 2be3b4a1-c7ed-49ab-a198-9ca4bd90b32f (ServerContext): activating
    background_task = inventory_collection
    iteration = 2
    reason = Signaled
18:57:46.243Z DEBG 2be3b4a1-c7ed-49ab-a198-9ca4bd90b32f (ServerContext): inventory collection complete
    background_task = inventory_collection
    collection_id = 932bb314-ba2f-4526-a8b9-de2944d4efbd
    time_started = 2024-02-21 18:57:41.206119 UTC
18:57:46.243Z DEBG 2be3b4a1-c7ed-49ab-a198-9ca4bd90b32f (ServerContext): activation complete
    background_task = inventory_collection
    elapsed = 5.141069215s
    iteration = 2

These two collection IDs are two of the six that confused me, and each of the other two Nexuses did this same thing.

andrewjstone pushed a commit that referenced this issue Feb 22, 2024
In #5111:

- the "add sled" Nexus external API call invokes `PUT /sleds` to some
sled agent
- `PUT /sleds` itself blocks until the new sled's sled agent has started
- sled agent startup blocks on setting the reservoir
- on production hardware, setting the reservoir took 115s
- the default Progenitor (reqwest) timeout is only 15s

So as a result, the "add sled" request failed, even though the operation
ultimately succeeded.

In this PR, I bump the timeout to 5 minutes. I do wonder if we should
remove it altogether, or if we should consider the other changes
mentioned in #5111 (like not blocking sled agent startup on this, or not
blocking these API calls in this way). But for now, this seems like a
low-risk way to improve this situation.
@jgallagher
Copy link
Contributor Author

Closing this as each problem in the top-level issue is either understood, fixed, or has its own separate issue:

  1. Immediate issue addressed by "add sled" needs a longer timeout #5116, longer term followups: Setting VMM Reservoir Takes a While - How do we cope? #5121, "sled add" could be more asynchronous #5132
  2. Ongoing discussion in system using only 8 of 10 U.2s on madrid BRM42220004 #5128
  3. Fixed by "expensive operation" check was too aggressive #5113
  4. (Will be) fixed by trigger inventory collection after blueprint execution #5130
  5. Not an issue; explained by my manual mucking around (explained in Failed to fully add a new sled on madrid #5111 (comment))
  6. Explained by problem 2, also covered by system using only 8 of 10 U.2s on madrid BRM42220004 #5128
  7. Tracked in Reconfigurator needs to create datasets for new crucible zones #5118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants