Read-only region clone attempting to contact expunged downstairs #7209

jmpesp · 2024-12-05T15:20:13Z

The following region snapshot replacement is not making forward progress:

262313d2-72ef-482d-b750-1c9115d74d59:

                    started: 2024-12-05 01:33:05.919898 UTC
                      state: Allocating
            region snapshot: 3ec0e848-39de-495e-be0e-88241e11d0fb 2bf499e2-a4f4-4e2e-9977-875f4914a7ce 39618f4c-0040-4a8b-9faf-0391a560d960
              new region id: None
     in-progress steps left: 0

Looking at a corresponding region-snapshot-replacement-start saga invocation shows that the new_region_ensure node is failing:

d56b625d-1ada-4cd5-bde1-8318abbd951c | 2024-12-05T06:42:01.865Z |          |   5: region_snapshot_replacement_start.find_new_region         | succeeded     | "new_dataset_and_region" => [{"compression":null,"identity":{"id":"46d1afcc-cc3f-4b17-aafc-054dd4862d15","time_created":"2023-08-30T18:59:10.758515Z","time_modified":"2023-08-30T18:59:10.758515Z"},"ip":"fd00:1122:3344:106::5","kind":"Crucible","pool_id":"7f778610-7328-4554-98f6-b17f74f551c7","port":32345,"quota":null,"rcgen":1,"reservation":null,"size_used":341449900032,"time_deleted":null,"zone_name":null},{"block_size":512,"blocks_per_extent":131072,"dataset_id":"46d1afcc-cc3f-4b17-aafc-054dd4862d15","deleting":false,"extent_count":16,"identity":{"id":"b9a2d268-de4d-43d0-a8b8-69e1b2f47791","time_created":"2024-12-05T06:42:01.228607Z","time_modified":"2024-12-05T06:42:01.228607Z"},"port":null,"read_only":true,"volume_id":"9d6ff72d-fb95-42e2-9765-0ff3eb103636"}]
d56b625d-1ada-4cd5-bde1-8318abbd951c | 2024-12-05T06:42:01.871Z |          |   6: region_snapshot_replacement_start.new_region_ensure       | started       | 
d56b625d-1ada-4cd5-bde1-8318abbd951c | 2024-12-05T06:42:17.866Z |          |   6: region_snapshot_replacement_start.new_region_ensure       | failed        | "ensured_dataset_and_region" => {"ActionFailed":{"source_error":{"InternalError":{"internal_message":"Failed to create region, unexpected state: Failed"}}}}

This step is responsible for creating a new cloned read-only region that will be used to replace the region snapshot. It uses the region snapshot's region as the source of the clone, but if that region is on an expunged dataset then this clone will never succeed. Tracing through to the crucible agent logs shows this:

BRM42220006 # cat /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_crucible_46d1afcc-cc3f-4b17-aafc-054dd4862d15/oxide-crucible-agent:default.log.1733391340 | looker -o long | less

06:42:02.123Z INFO crucible: Connecting to [fd00:1122:3344:124::22]:23001 to obtain our extent files.
    task = clone

Error: Failed to get source region definition: Communication Error: error sending request for url (http://[fd00:1122:3344:124::22]:23001/region-info)

BRM42220006 # ping fd00:1122:3344:124::22
ping: sendto No route to host

This sled is gone:

root@[fd00:1122:3344:105::3]:32221/omicron> select serial_number,sled_policy from sled where ip::TEXT like '%fd00:1122:3344:124%';
  serial_number | sled_policy
----------------+--------------
  BRM23230018   | expunged
(1 row)

The fix for this is to choose any other region that shares the snapshot id which is not expunged. Each region snapshot that shares that snapshot id will be exactly the same and can be used as a clone source.

The text was updated successfully, but these errors were encountered:

jmpesp · 2024-12-05T15:30:44Z

Clone source is selected here:

omicron/nexus/src/app/sagas/region_snapshot_replacement_start.rs

Lines 402 to 440 in 37c7f18

    
           let region_snapshot = osagactx 
        
               .datastore() 
        
               .region_snapshot_get( 
        
                   params.request.old_dataset_id.into(), 
        
                   params.request.old_region_id, 
        
                   params.request.old_snapshot_id, 
        
               ) 
        
               .await 
        
               .map_err(ActionError::action_failed)?; 
        
           let Some(region_snapshot) = region_snapshot else { 
        
               return Err(ActionError::action_failed(format!( 
        
                   "region snapshot {} {} {} deleted!", 
        
                   params.request.old_dataset_id, 
        
                   params.request.old_region_id, 
        
                   params.request.old_snapshot_id, 
        
               ))); 
        
           }; 
        
           let (new_dataset, new_region) = new_dataset_and_region; 
        
           // Currently, the repair port is set using a fixed offset above the 
        
           // downstairs port. Once this goes away, Nexus will require a way to query 
        
           // for the repair port! 
        
           let mut source_repair_addr: SocketAddrV6 = 
        
               match region_snapshot.snapshot_addr.parse() { 
        
                   Ok(addr) => addr, 
        
                   Err(e) => { 
        
                       return Err(ActionError::action_failed(format!( 
        
                           "error parsing region_snapshot.snapshot_addr: {e}" 
        
                       ))); 
        
                   } 
        
               }; 
        
           source_repair_addr.set_port( 
        
               source_repair_addr.port() + crucible_common::REPAIR_PORT_OFFSET, 
        
           );

When performing region snapshot replacement, the associated start saga chose the request's region snapshot as the clone source, but if that region snapshot was backed by an expunged dataset then it may be gone. This commit adds logic to choose another clone source, either another region snapshot from the same snapshot, or one of the read-only regions for that snapshot. Basic sanity tests were added for ensuring that region replacements and region snapshot replacements resulting from expungement can occur. It was an oversight not to originally include these! Fixes oxidecomputer#7209

askfongjojo · 2024-12-12T09:44:54Z

Is it possible that the region replacements that failed are using up disk space in crucible on rack2? We've noticed some of the crucible used bytes are getting higher in the recent weeks, e.g., on sled 11

BRM42220006 # zfs list -o used,avail,name | grep '/crucible/regions$'
2.33T   485G  oxp_025725fa-9e40-4b46-b018-c420408394ef/crucible/regions
2.80T  1.19G  oxp_32b5303f-f667-4345-84d2-c7eec63b91b2/crucible/regions
1.71T  1.10T  oxp_4366f80d-3902-4b93-8f2d-380008e805fc/crucible/regions
1.17T  1.63T  oxp_4d2f5aaf-eb14-4b1e-aa99-ae38ec844605/crucible/regions
1.57T  1.23T  oxp_534bcd4b-502f-4109-af6e-4b28a22c20f1/crucible/regions
1.40T  1.40T  oxp_7f778610-7328-4554-98f6-b17f74f551c7/crucible/regions
2.24T   575G  oxp_834c9aad-c53b-4357-bc3f-f422efa63848/crucible/regions
1.66T  1.15T  oxp_cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crucible/regions
1.13T  1.68T  oxp_cf5f8849-0c5a-475b-8683-6d17da88d1d1/crucible/regions
2.01T   811G  oxp_d7665e0d-9354-4341-a76f-965d7c49f277/crucible/regions

Looking at the crucible zone that corresponds to oxp_32b5303f-f667-4345-84d2-c7eec63b91b2, I see many regions in failed state:

root@oxz_crucible_65b3db59:~# cat /data/crucible.json | jq -r '.regions | .. | objects | .id + "," + .state + "," + (.block_size|tostring) + "," + (.extent_size|tostring) + "," + (.extent_count|tostring)' | grep -v destroyed
,,null,null,null
006bca50-7241-41a2-a296-7fbaaf172e06,failed,512,131072,64
0084632a-a5ce-43a3-b9a0-489aaa7ca069,failed,512,131072,64
00bb7bcf-c31b-4be3-9f14-2c6ece1179e2,failed,512,131072,16
00dcbad7-6053-4ca6-bb11-cd802aa8939e,failed,512,131072,16
01382c30-e3bd-4c46-8868-f7bf4db635d4,failed,2048,32768,16
01832e4c-957e-44d1-bb01-70b770630acf,failed,512,131072,16
01897309-d5f1-47a6-9eb6-4d0ec1a929be,failed,512,131072,16
048233b2-22d5-46e6-b272-2423b8c3133c,failed,512,131072,16
048ded0b-fdb4-47a4-85e3-3595cfa8cb3a,failed,512,131072,16
05611463-cc99-4345-8995-fc38f7c25ce9,created,512,131072,144
05bcac31-5a48-48d7-932b-90888d170304,failed,2048,32768,16
05fd8535-3445-41c3-ac89-b0f05a118238,failed,512,131072,64
083a707f-d41a-440b-816a-97caeeca0689,failed,512,131072,64
090f4aba-273a-48d3-8941-63daace9bd98,failed,2048,32768,16
0971af29-1da5-49d0-aa35-a496af6ed6cd,failed,512,131072,64
09d05fc8-8cf2-408f-80ca-c7af55bbbc33,failed,2048,32768,16
0a98f029-058b-473b-bc7c-183bfaea6330,created,512,131072,160
0ae6abf3-47d9-4e2e-8ab4-d92c13e57b30,failed,512,131072,64
0b4b089e-61d4-4ea6-a58d-8ede71eb573f,failed,2048,32768,16
0c6241f6-4674-4b51-8b4c-6f8579546342,failed,2048,32768,16
0c9dd400-0b8e-47bf-af00-148ea3450e9b,failed,512,131072,16
0ce729e1-c846-437f-9cc5-827e81c2c85d,failed,512,131072,64
... 

root@oxz_crucible_65b3db59:~# cat /data/crucible.json | jq -r '.regions | .. | objects | .state' | grep failed | wc -l
     397

The actual usage of non-bogus regions should be around 1.6 TiB if I sum up all of the entries in "created" state

root@oxz_crucible_65b3db59:~# cat /data/crucible.json | jq -r '.regions | .. | objects | .id + "," + .state + "," + (.block_size|tostring) + "," + (.extent_size|tostring) + "," + (.extent_count|tostring)' | grep created        
05611463-cc99-4345-8995-fc38f7c25ce9,created,512,131072,144
0a98f029-058b-473b-bc7c-183bfaea6330,created,512,131072,160
16cbaedc-bf94-4d3b-a17a-764cffbf4fbd,created,512,131072,1600
33137857-d328-4102-a60c-ea9a57b93a9a,created,512,131072,160
4c11b3a1-652e-470f-b835-58a607cb55da,created,4096,16384,800
5704fb34-35e4-472d-ab97-ee25b9411dfd,created,512,131072,160
6efe0cf2-0d8c-474a-979c-144399c0f801,created,512,131072,640
8920bb85-d96c-424c-b837-d9f82c7883cd,created,2048,32768,16
98db14d2-035a-4558-8344-f2e600a20717,created,512,131072,64
a8f988d9-e0ea-4ff1-b45d-dfd45dc6ea91,created,512,131072,144
c21148af-f6ea-441b-a6c1-e3dc13b735de,created,512,131072,48
c7b4bcb7-37cf-41f5-99bb-5e34389950d6,created,512,131072,160
c8ba31fc-7b34-4508-91ab-47249d8eae91,created,512,131072,160
e727ef98-e0a1-4572-9688-dc0d93f67f9a,created,512,131072,480
f3c087b8-aac7-4725-b3c6-5148eed820c5,created,512,131072,3200
ffaea240-f349-4375-bb82-965b8e8c3859,created,512,131072,2048
ffde7910-5830-4b26-9054-82937ae78656,created,512,131072,16000

If the failed regions are indeed eating up disk space, we may need to find a way to clean them up. (This issue affects rack2 only.)

jmpesp · 2024-12-12T14:44:48Z

I think you're on to something. Looking at the same gimlet in rack2:

root@oxz_crucible_65b3db59:~# curl -sqSL 'http://[fd00:1122:3344:106::7]:32345/crucible/0/regions/fd86cf4b-cc6d-4e1c-a849-2b8669792f4e' | jq -r .state
failed

root@oxz_crucible_65b3db59:~# du -hs /data/regions/fd86cf4b-cc6d-4e1c-a849-2b8669792f4e/
8.16M   /data/regions/fd86cf4b-cc6d-4e1c-a849-2b8669792f4e

One of the common sharp edges of sagas is that the compensating action of a node does _not_ run if the forward action fails. Said another way, for this node: EXAMPLE -> "output" { + forward_action - forward_action_undo } If `forward_action` fails, `forward_action_undo` is never executed. Forward actions are therefore required to be atomic, in that they either fully apply or don't apply at all. Sagas with nodes that ensure multiple regions exist cannot be atomic because they can partially fail (for example: what if only 2 out of 3 ensures succeed?). In order for the compensating action to be run, it must exist as a separate node that has a no-op forward action: EXAMPLE_UNDO -> "not_used" { + noop - forward_action_undo } EXAMPLE -> "output" { + forward_action } The region snapshot replacement start saga will only ever ensure that a single region exists, so one might think they could get away with a single node that combines the forward and compensating action - you'd be mistaken! The Crucible agent's region ensure is not atomic in all cases: if the region fails to create, it enters the `failed` state, but is not deleted. Nexus must clean these up. Fixes an issue that Angela saw where failed regions were taking up disk space in rack2 (oxidecomputer#7209). Separate work will be needed to clean those up, where this commit simply stops the accumulation.

One of the common sharp edges of sagas is that the compensating action of a node does _not_ run if the forward action fails. Said another way, for this node: EXAMPLE -> "output" { + forward_action - forward_action_undo } If `forward_action` fails, `forward_action_undo` is never executed. Forward actions are therefore required to be atomic, in that they either fully apply or don't apply at all. Sagas with nodes that ensure multiple regions exist cannot be atomic because they can partially fail (for example: what if only 2 out of 3 ensures succeed?). In order for the compensating action to be run, it must exist as a separate node that has a no-op forward action: EXAMPLE_UNDO -> "not_used" { + noop - forward_action_undo } EXAMPLE -> "output" { + forward_action } The region snapshot replacement start saga will only ever ensure that a single region exists, so one might think they could get away with a single node that combines the forward and compensating action - you'd be mistaken! The Crucible agent's region ensure is not atomic in all cases: if the region fails to create, it enters the `failed` state, but is not deleted. Nexus must clean these up. Fixes an issue that Angela saw where failed regions were taking up disk space in rack2 (#7209). This commit also includes an omdb command for finding these orphaned regions and optionally cleaning them up.

When performing region snapshot replacement, the associated start saga chose the request's region snapshot as the clone source, but if that region snapshot was backed by an expunged dataset then it may be gone. This commit adds logic to choose another clone source, either another region snapshot from the same snapshot, or one of the read-only regions for that snapshot. Basic sanity tests were added for ensuring that region replacements and region snapshot replacements resulting from expungement can occur. It was an oversight not to originally include these! Rn order to support these new sanity tests, the simulated pantry has to fake activating volumes in the background. This commit also refactors the simulated Pantry to have one Mutex around an "inner" struct instead of many Mutexes. Fixes oxidecomputer#7209

smklein mentioned this issue Dec 6, 2024

500 error caused by CRDB TransactionRetryWithProtoRefreshError during concurrent disk deletion #7208

Closed

morlandi7 added this to the 12 milestone Dec 6, 2024

jmpesp mentioned this issue Dec 12, 2024

Delete failed regions when the saga unwinds #7245

Merged

morlandi7 mentioned this issue Dec 16, 2024

instance stuck in starting, propolis zone fails to come up #7257

Open

jmpesp mentioned this issue Dec 19, 2024

Handle region snapshot replacement volume deletes #7046

Merged

jmpesp mentioned this issue Dec 19, 2024

Pick a non-expunged clone source #7283

Merged

jmpesp closed this as completed in #7283 Dec 19, 2024

jmpesp closed this as completed in 460f038 Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read-only region clone attempting to contact expunged downstairs #7209

Read-only region clone attempting to contact expunged downstairs #7209

jmpesp commented Dec 5, 2024

jmpesp commented Dec 5, 2024

askfongjojo commented Dec 12, 2024

jmpesp commented Dec 12, 2024

Read-only region clone attempting to contact expunged downstairs #7209

Read-only region clone attempting to contact expunged downstairs #7209

Comments

jmpesp commented Dec 5, 2024

jmpesp commented Dec 5, 2024

askfongjojo commented Dec 12, 2024

jmpesp commented Dec 12, 2024