disk create called too soon after disk delete will fail #1972

leftwo · 2022-11-22T20:04:11Z

If I try to create a disk too soon after deleting one, the creation will fail.

I have this script to create a disk (assuming images have been populated):

#!/bin/bash
image_path=/system/images/ubuntu

oxide api "$image_path" | jq -r .size
if [[ $? -ne 0 ]]; then
    echo "Can't find ubuntu"
    exit 1
fi
imgsz=$(oxide api "$image_path" | jq -r .size)
((rem = imgsz % 1073741824))
((sz = imgsz + 1073741824 - rem))
((newsz = sz + 32212254720 ))

oxide api /organizations/myorg/projects/myproj/disks/ --method POST --input - <<EOF
{
  "name": "focaldisk",
  "description": "fdisk.raw blob",
  "block_size": 512,
  "size": $newsz,
  "disk_source": {
      "type": "global_image",
      "image_id": "$(oxide api ${image_path} | jq -r .id)"
  }
}
EOF

I run the above, Then:

oxide disk delete --confirm -o myorg -p myproj focaldisk

Then, run the create above a 2nd time (right away), the creation will eventually fail, the disk create saga will
fail, and nexus will dump core.

The disk delete only needs like four or five seconds to wrap up its work before a create
coming along afterwards will work.

The disks having the same name (as they do above) is not required for the failure.

The text was updated successfully, but these errors were encountered:

leftwo · 2022-11-23T01:48:34Z

An even simpler disk can be used to reproduce this, no global image required:

oxide api /organizations/myorg/projects/myproj/disks/ --method POST --input - <<EOF
{
  "name": "zpool",
  "description": "replace",
  "block_size": 4096,
  "size": 32212254720,
  "disk_source": {
      "type": "blank",
      "block_size": 4096
  }
}
EOF

In the crucible agent log, I see these messages (on pretty much all disk deletions):

Nov 23 01:16:55.892 INFO deleting zfs dataset "oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crucible/regions/2ec5040a-cc71-4176-a266-b3a64f11b3a3", regi
on: 2ec5040a-cc71-4176-a266-b3a64f11b3a3, component: worker                                                                                         
Nov 23 01:16:55.936 ERRO zfs dataset oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crucible/regions/2ec5040a-cc71-4176-a266-b3a64f11b3a3 delete attempt 0
 failed: out: err:cannot unmount '/data/regions/2ec5040a-cc71-4176-a266-b3a64f11b3a3': Device busy                                                  
, region: 2ec5040a-cc71-4176-a266-b3a64f11b3a3, component: worker

Then, right afterward, this message:

Nov 23 01:16:58.013 INFO region 2ec5040a-cc71-4176-a266-b3a64f11b3a3 state: Tombstoned -> Destroyed, component: datafile

If I try to create another volume and I wait till I see that "Tombstoned -> Destroyed", things work.
When I try to create before the tombstoned, that's when we see a problem.

leftwo · 2022-11-23T03:03:05Z

Even easier reproduction.
Create a disk.

oxide disk create -o myorg -p myproj --size 32212254720 --disk-source blank=4096 -D "to be deleted" deleteme

then, run the delete then the create all in one line:

oxide disk delete --confirm -o myorg -p myproj deleteme; oxide disk create -o myorg -p myproj --size 32212254720 --disk-source blank=4096 -D "to be deleted" deleteme

smklein · 2022-11-23T14:39:30Z

What's the user-visible error being propagated back during the creation failure?

smklein · 2022-11-23T14:39:58Z

Oh, you're saying nexus is crashing, not returning an error?

leftwo · 2022-11-23T22:57:45Z

I'm pretty sure this problem is down in the storage agent (crucible side).
I can see from the logs that the request for a new region arrives at the
agent while it is in the middle of a worker thread processing the delete request.
It appears like the worker thread does not notice any doorbell messages it receives
while it is in the middle of work. (debug is ongoing)

leftwo · 2022-11-24T01:15:49Z

The crucible side issue oxidecomputer/crucible#531

I suspect that will solve what we see from Omicron, but I'll leave this issue open until the fix is verified.

leftwo · 2022-11-30T19:40:24Z

This is fixed with crucible rev: 04ba0cb56f93396e115ea04591caaa1de8167a18
(or greater)

davepacheco · 2022-11-30T20:07:23Z

Is there a separate issue of Nexus crashing here when Crucible does something unexpected?

leftwo · 2022-11-30T22:16:45Z

I don't think specifically for this, no.
I believe what I saw the first time I did this was something like what is reported in oxidecomputer/steno#26

leftwo self-assigned this Nov 30, 2022

leftwo closed this as completed Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disk create called too soon after disk delete will fail #1972

disk create called too soon after disk delete will fail #1972

leftwo commented Nov 22, 2022

leftwo commented Nov 23, 2022 •

edited

Loading

leftwo commented Nov 23, 2022

smklein commented Nov 23, 2022

smklein commented Nov 23, 2022

leftwo commented Nov 23, 2022

leftwo commented Nov 24, 2022

leftwo commented Nov 30, 2022

davepacheco commented Nov 30, 2022

leftwo commented Nov 30, 2022

disk create called too soon after disk delete will fail #1972

disk create called too soon after disk delete will fail #1972

Comments

leftwo commented Nov 22, 2022

leftwo commented Nov 23, 2022 • edited Loading

leftwo commented Nov 23, 2022

smklein commented Nov 23, 2022

smklein commented Nov 23, 2022

leftwo commented Nov 23, 2022

leftwo commented Nov 24, 2022

leftwo commented Nov 30, 2022

davepacheco commented Nov 30, 2022

leftwo commented Nov 30, 2022

leftwo commented Nov 23, 2022 •

edited

Loading