Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read-only region clone attempting to contact expunged downstairs #7209

Closed
jmpesp opened this issue Dec 5, 2024 · 3 comments · Fixed by #7283
Closed

Read-only region clone attempting to contact expunged downstairs #7209

jmpesp opened this issue Dec 5, 2024 · 3 comments · Fixed by #7283
Milestone

Comments

@jmpesp
Copy link
Contributor

jmpesp commented Dec 5, 2024

The following region snapshot replacement is not making forward progress:

262313d2-72ef-482d-b750-1c9115d74d59:

                    started: 2024-12-05 01:33:05.919898 UTC
                      state: Allocating
            region snapshot: 3ec0e848-39de-495e-be0e-88241e11d0fb 2bf499e2-a4f4-4e2e-9977-875f4914a7ce 39618f4c-0040-4a8b-9faf-0391a560d960
              new region id: None
     in-progress steps left: 0

Looking at a corresponding region-snapshot-replacement-start saga invocation shows that the new_region_ensure node is failing:

d56b625d-1ada-4cd5-bde1-8318abbd951c | 2024-12-05T06:42:01.865Z |          |   5: region_snapshot_replacement_start.find_new_region         | succeeded     | "new_dataset_and_region" => [{"compression":null,"identity":{"id":"46d1afcc-cc3f-4b17-aafc-054dd4862d15","time_created":"2023-08-30T18:59:10.758515Z","time_modified":"2023-08-30T18:59:10.758515Z"},"ip":"fd00:1122:3344:106::5","kind":"Crucible","pool_id":"7f778610-7328-4554-98f6-b17f74f551c7","port":32345,"quota":null,"rcgen":1,"reservation":null,"size_used":341449900032,"time_deleted":null,"zone_name":null},{"block_size":512,"blocks_per_extent":131072,"dataset_id":"46d1afcc-cc3f-4b17-aafc-054dd4862d15","deleting":false,"extent_count":16,"identity":{"id":"b9a2d268-de4d-43d0-a8b8-69e1b2f47791","time_created":"2024-12-05T06:42:01.228607Z","time_modified":"2024-12-05T06:42:01.228607Z"},"port":null,"read_only":true,"volume_id":"9d6ff72d-fb95-42e2-9765-0ff3eb103636"}]
d56b625d-1ada-4cd5-bde1-8318abbd951c | 2024-12-05T06:42:01.871Z |          |   6: region_snapshot_replacement_start.new_region_ensure       | started       | 
d56b625d-1ada-4cd5-bde1-8318abbd951c | 2024-12-05T06:42:17.866Z |          |   6: region_snapshot_replacement_start.new_region_ensure       | failed        | "ensured_dataset_and_region" => {"ActionFailed":{"source_error":{"InternalError":{"internal_message":"Failed to create region, unexpected state: Failed"}}}}

This step is responsible for creating a new cloned read-only region that will be used to replace the region snapshot. It uses the region snapshot's region as the source of the clone, but if that region is on an expunged dataset then this clone will never succeed. Tracing through to the crucible agent logs shows this:

BRM42220006 # cat /pool/ext/cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crypt/debug/oxz_crucible_46d1afcc-cc3f-4b17-aafc-054dd4862d15/oxide-crucible-agent:default.log.1733391340 | looker -o long | less

06:42:02.123Z INFO crucible: Connecting to [fd00:1122:3344:124::22]:23001 to obtain our extent files.
    task = clone

Error: Failed to get source region definition: Communication Error: error sending request for url (http://[fd00:1122:3344:124::22]:23001/region-info)

BRM42220006 # ping fd00:1122:3344:124::22
ping: sendto No route to host

This sled is gone:

root@[fd00:1122:3344:105::3]:32221/omicron> select serial_number,sled_policy from sled where ip::TEXT like '%fd00:1122:3344:124%';
  serial_number | sled_policy
----------------+--------------
  BRM23230018   | expunged
(1 row)

The fix for this is to choose any other region that shares the snapshot id which is not expunged. Each region snapshot that shares that snapshot id will be exactly the same and can be used as a clone source.

@jmpesp
Copy link
Contributor Author

jmpesp commented Dec 5, 2024

Clone source is selected here:

let region_snapshot = osagactx
.datastore()
.region_snapshot_get(
params.request.old_dataset_id.into(),
params.request.old_region_id,
params.request.old_snapshot_id,
)
.await
.map_err(ActionError::action_failed)?;
let Some(region_snapshot) = region_snapshot else {
return Err(ActionError::action_failed(format!(
"region snapshot {} {} {} deleted!",
params.request.old_dataset_id,
params.request.old_region_id,
params.request.old_snapshot_id,
)));
};
let (new_dataset, new_region) = new_dataset_and_region;
// Currently, the repair port is set using a fixed offset above the
// downstairs port. Once this goes away, Nexus will require a way to query
// for the repair port!
let mut source_repair_addr: SocketAddrV6 =
match region_snapshot.snapshot_addr.parse() {
Ok(addr) => addr,
Err(e) => {
return Err(ActionError::action_failed(format!(
"error parsing region_snapshot.snapshot_addr: {e}"
)));
}
};
source_repair_addr.set_port(
source_repair_addr.port() + crucible_common::REPAIR_PORT_OFFSET,
);

@morlandi7 morlandi7 added this to the 12 milestone Dec 6, 2024
jmpesp added a commit to jmpesp/omicron that referenced this issue Dec 10, 2024
When performing region snapshot replacement, the associated start saga
chose the request's region snapshot as the clone source, but if that
region snapshot was backed by an expunged dataset then it may be gone.

This commit adds logic to choose another clone source, either another
region snapshot from the same snapshot, or one of the read-only regions
for that snapshot.

Basic sanity tests were added for ensuring that region replacements and
region snapshot replacements resulting from expungement can occur. It
was an oversight not to originally include these!

Fixes oxidecomputer#7209
@askfongjojo
Copy link

Is it possible that the region replacements that failed are using up disk space in crucible on rack2? We've noticed some of the crucible used bytes are getting higher in the recent weeks, e.g., on sled 11

BRM42220006 # zfs list -o used,avail,name | grep '/crucible/regions$'
2.33T   485G  oxp_025725fa-9e40-4b46-b018-c420408394ef/crucible/regions
2.80T  1.19G  oxp_32b5303f-f667-4345-84d2-c7eec63b91b2/crucible/regions
1.71T  1.10T  oxp_4366f80d-3902-4b93-8f2d-380008e805fc/crucible/regions
1.17T  1.63T  oxp_4d2f5aaf-eb14-4b1e-aa99-ae38ec844605/crucible/regions
1.57T  1.23T  oxp_534bcd4b-502f-4109-af6e-4b28a22c20f1/crucible/regions
1.40T  1.40T  oxp_7f778610-7328-4554-98f6-b17f74f551c7/crucible/regions
2.24T   575G  oxp_834c9aad-c53b-4357-bc3f-f422efa63848/crucible/regions
1.66T  1.15T  oxp_cf14d1b9-b4db-4594-b3ab-a9957e770ce9/crucible/regions
1.13T  1.68T  oxp_cf5f8849-0c5a-475b-8683-6d17da88d1d1/crucible/regions
2.01T   811G  oxp_d7665e0d-9354-4341-a76f-965d7c49f277/crucible/regions

Looking at the crucible zone that corresponds to oxp_32b5303f-f667-4345-84d2-c7eec63b91b2, I see many regions in failed state:

root@oxz_crucible_65b3db59:~# cat /data/crucible.json | jq -r '.regions | .. | objects | .id + "," + .state + "," + (.block_size|tostring) + "," + (.extent_size|tostring) + "," + (.extent_count|tostring)' | grep -v destroyed
,,null,null,null
006bca50-7241-41a2-a296-7fbaaf172e06,failed,512,131072,64
0084632a-a5ce-43a3-b9a0-489aaa7ca069,failed,512,131072,64
00bb7bcf-c31b-4be3-9f14-2c6ece1179e2,failed,512,131072,16
00dcbad7-6053-4ca6-bb11-cd802aa8939e,failed,512,131072,16
01382c30-e3bd-4c46-8868-f7bf4db635d4,failed,2048,32768,16
01832e4c-957e-44d1-bb01-70b770630acf,failed,512,131072,16
01897309-d5f1-47a6-9eb6-4d0ec1a929be,failed,512,131072,16
048233b2-22d5-46e6-b272-2423b8c3133c,failed,512,131072,16
048ded0b-fdb4-47a4-85e3-3595cfa8cb3a,failed,512,131072,16
05611463-cc99-4345-8995-fc38f7c25ce9,created,512,131072,144
05bcac31-5a48-48d7-932b-90888d170304,failed,2048,32768,16
05fd8535-3445-41c3-ac89-b0f05a118238,failed,512,131072,64
083a707f-d41a-440b-816a-97caeeca0689,failed,512,131072,64
090f4aba-273a-48d3-8941-63daace9bd98,failed,2048,32768,16
0971af29-1da5-49d0-aa35-a496af6ed6cd,failed,512,131072,64
09d05fc8-8cf2-408f-80ca-c7af55bbbc33,failed,2048,32768,16
0a98f029-058b-473b-bc7c-183bfaea6330,created,512,131072,160
0ae6abf3-47d9-4e2e-8ab4-d92c13e57b30,failed,512,131072,64
0b4b089e-61d4-4ea6-a58d-8ede71eb573f,failed,2048,32768,16
0c6241f6-4674-4b51-8b4c-6f8579546342,failed,2048,32768,16
0c9dd400-0b8e-47bf-af00-148ea3450e9b,failed,512,131072,16
0ce729e1-c846-437f-9cc5-827e81c2c85d,failed,512,131072,64
... 

root@oxz_crucible_65b3db59:~# cat /data/crucible.json | jq -r '.regions | .. | objects | .state' | grep failed | wc -l
     397

The actual usage of non-bogus regions should be around 1.6 TiB if I sum up all of the entries in "created" state

root@oxz_crucible_65b3db59:~# cat /data/crucible.json | jq -r '.regions | .. | objects | .id + "," + .state + "," + (.block_size|tostring) + "," + (.extent_size|tostring) + "," + (.extent_count|tostring)' | grep created        
05611463-cc99-4345-8995-fc38f7c25ce9,created,512,131072,144
0a98f029-058b-473b-bc7c-183bfaea6330,created,512,131072,160
16cbaedc-bf94-4d3b-a17a-764cffbf4fbd,created,512,131072,1600
33137857-d328-4102-a60c-ea9a57b93a9a,created,512,131072,160
4c11b3a1-652e-470f-b835-58a607cb55da,created,4096,16384,800
5704fb34-35e4-472d-ab97-ee25b9411dfd,created,512,131072,160
6efe0cf2-0d8c-474a-979c-144399c0f801,created,512,131072,640
8920bb85-d96c-424c-b837-d9f82c7883cd,created,2048,32768,16
98db14d2-035a-4558-8344-f2e600a20717,created,512,131072,64
a8f988d9-e0ea-4ff1-b45d-dfd45dc6ea91,created,512,131072,144
c21148af-f6ea-441b-a6c1-e3dc13b735de,created,512,131072,48
c7b4bcb7-37cf-41f5-99bb-5e34389950d6,created,512,131072,160
c8ba31fc-7b34-4508-91ab-47249d8eae91,created,512,131072,160
e727ef98-e0a1-4572-9688-dc0d93f67f9a,created,512,131072,480
f3c087b8-aac7-4725-b3c6-5148eed820c5,created,512,131072,3200
ffaea240-f349-4375-bb82-965b8e8c3859,created,512,131072,2048
ffde7910-5830-4b26-9054-82937ae78656,created,512,131072,16000

If the failed regions are indeed eating up disk space, we may need to find a way to clean them up. (This issue affects rack2 only.)

@jmpesp
Copy link
Contributor Author

jmpesp commented Dec 12, 2024

I think you're on to something. Looking at the same gimlet in rack2:

root@oxz_crucible_65b3db59:~# curl -sqSL 'http://[fd00:1122:3344:106::7]:32345/crucible/0/regions/fd86cf4b-cc6d-4e1c-a849-2b8669792f4e' | jq -r .state
failed

root@oxz_crucible_65b3db59:~# du -hs /data/regions/fd86cf4b-cc6d-4e1c-a849-2b8669792f4e/
8.16M   /data/regions/fd86cf4b-cc6d-4e1c-a849-2b8669792f4e

jmpesp added a commit to jmpesp/omicron that referenced this issue Dec 12, 2024
One of the common sharp edges of sagas is that the compensating action
of a node does _not_ run if the forward action fails. Said another way,
for this node:

    EXAMPLE -> "output" {
      + forward_action
      - forward_action_undo
    }

If `forward_action` fails, `forward_action_undo` is never executed.
Forward actions are therefore required to be atomic, in that they either
fully apply or don't apply at all.

Sagas with nodes that ensure multiple regions exist cannot be atomic
because they can partially fail (for example: what if only 2 out of 3
ensures succeed?). In order for the compensating action to be run, it
must exist as a separate node that has a no-op forward action:

    EXAMPLE_UNDO -> "not_used" {
      + noop
      - forward_action_undo
    }
    EXAMPLE -> "output" {
      + forward_action
    }

The region snapshot replacement start saga will only ever ensure that a
single region exists, so one might think they could get away with a
single node that combines the forward and compensating action - you'd be
mistaken! The Crucible agent's region ensure is not atomic in all cases:
if the region fails to create, it enters the `failed` state, but is not
deleted. Nexus must clean these up.

Fixes an issue that Angela saw where failed regions were taking up disk
space in rack2 (oxidecomputer#7209). Separate work will be needed to clean those up,
where this commit simply stops the accumulation.
jmpesp added a commit that referenced this issue Dec 17, 2024
One of the common sharp edges of sagas is that the compensating action
of a node does _not_ run if the forward action fails. Said another way,
for this node:

    EXAMPLE -> "output" {
      + forward_action
      - forward_action_undo }

If `forward_action` fails, `forward_action_undo` is never executed.
Forward actions are therefore required to be atomic, in that they either
fully apply or don't apply at all.

Sagas with nodes that ensure multiple regions exist cannot be atomic
because they can partially fail (for example: what if only 2 out of 3
ensures succeed?). In order for the compensating action to be run, it
must exist as a separate node that has a no-op forward action:

    EXAMPLE_UNDO -> "not_used" {
      + noop
      - forward_action_undo } EXAMPLE -> "output" { + forward_action }

The region snapshot replacement start saga will only ever ensure that a
single region exists, so one might think they could get away with a
single node that combines the forward and compensating action - you'd be
mistaken! The Crucible agent's region ensure is not atomic in all cases:
if the region fails to create, it enters the `failed` state, but is not
deleted. Nexus must clean these up.

Fixes an issue that Angela saw where failed regions were taking up disk
space in rack2 (#7209). This commit also includes an omdb command for
finding these orphaned regions and optionally cleaning them up.
jmpesp added a commit to jmpesp/omicron that referenced this issue Dec 19, 2024
When performing region snapshot replacement, the associated start saga
chose the request's region snapshot as the clone source, but if that
region snapshot was backed by an expunged dataset then it may be gone.

This commit adds logic to choose another clone source, either another
region snapshot from the same snapshot, or one of the read-only regions
for that snapshot.

Basic sanity tests were added for ensuring that region replacements and
region snapshot replacements resulting from expungement can occur. It
was an oversight not to originally include these! Rn order to support
these new sanity tests, the simulated pantry has to fake activating
volumes in the background. This commit also refactors the simulated
Pantry to have one Mutex around an "inner" struct instead of many
Mutexes.

Fixes oxidecomputer#7209
@jmpesp jmpesp closed this as completed in 460f038 Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants