Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement record based Crucible reference counting #6805

Merged

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Oct 9, 2024

Crucible volumes are created by layering read-write regions over a hierarchy of read-only resources. Originally only a region snapshot could be used as a read-only resource for a volume. With the introduction of read-only regions (created during the region snapshot replacement process) this is no longer true!

Read-only resources can be used by many volumes, and because of this they need to have a reference count so they can be deleted when they're not referenced anymore. The region_snapshot table uses a volume_references column, which counts how many uses there are. The region table does not have this column, and more over a simple integer works for reference counting but does not tell you what volume that use is from. This can be determined (see omdb's validate volume references command) but it's information that is tossed out, as Nexus knows what volumes use what resources! Instead, record what read-only resources a volume uses in a new table.

As part of the schema change to add the new volume_resource_usage table, a migration is included that will create the appropriate records for all region snapshots.

In testing, a few bugs were found: the worst being that read-only regions did not have their read_only column set to true. This would be a problem if read-only regions are created, but they're currently only created during region snapshot replacement. To detect if any of these regions were created, find all regions that were allocated for a snapshot volume:

SELECT id FROM region
WHERE volume_id IN (SELECT volume_id FROM snapshot);

A similar bug was found in the simulated Crucible agent.

This commit also reverts #6728, enabling region snapshot replacement again - it was disabled due to a lack of read-only region reference counting, so it can be enabled once again.

Crucible volumes are created by layering read-write regions over a
hierarchy of read-only resources. Originally only a region snapshot
could be used as a read-only resource for a volume. With the
introduction of read-only regions (created during the region snapshot
replacement process) this is no longer true!

Read-only resources can be used by many volumes, and because of this
they need to have a reference count so they can be deleted when they're
not referenced anymore. The region_snapshot table uses a
`volume_references` column, which counts how many uses there are. The
region table does not have this column, and more over a simple integer
works for reference counting but does not tell you _what_ volume that
use is from. This can be determined (see omdb's validate volume
references command) but it's information that is tossed out, as Nexus
knows what volumes use what resources! Instead, record what read-only
resources a volume uses in a new table.

As part of the schema change to add the new `volume_resource_usage`
table, a migration is included that will create the appropriate records
for all region snapshots.

In testing, a few bugs were found: the worst being that read-only
regions did not have their read_only column set to true. This would be a
problem if read-only regions are created, but they're currently only
created during region snapshot replacement. To detect if any of these
regions were created, find all regions that were allocated for a
snapshot volume:

    SELECT id FROM region
    WHERE volume_id IN (SELECT volume_id FROM snapshot);

A similar bug was found in the simulated Crucible agent.

This commit also reverts oxidecomputer#6728, enabling region snapshot replacement
again - it was disabled due to a lack of read-only region reference
counting, so it can be enabled once again.
@jmpesp jmpesp requested review from smklein and leftwo October 9, 2024 00:29
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have some more questions after I finish going through nexus/tests/integration_tests/volume_management.rs but I thought I should give you what I have so far.

nexus/db-model/src/volume_resource_usage.rs Outdated Show resolved Hide resolved
VolumeResourceUsage::ReadOnlyRegion {
region_id: record
.region_id
.expect("valid read-only region usage record"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and the .expects below, this message is printed when we panic, right? If so, should it be saying that we did not find a valid region usage record?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conservatively, this should maybe be a TryFrom implementation. As it exists today, someone could modify a column in the database and cause Nexus to panic with these .expects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, that's in 0a45763

nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
nexus/src/app/sagas/disk_create.rs Show resolved Hide resolved
'region_snapshot'
);

CREATE TABLE IF NOT EXISTS omicron.public.volume_resource_usage (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it exists elsewhere in this PR, but I think this table would benefit from some text explaining what it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, done in bd2ef84

Comment on lines +54 to +58
pub region_id: Option<Uuid>,

pub region_snapshot_dataset_id: Option<Uuid>,
pub region_snapshot_region_id: Option<Uuid>,
pub region_snapshot_snapshot_id: Option<Uuid>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea reading the DB schema that these were two groups of columns, and "either one or the other is non-null".

If that's the intent -- and it seems to be, based on VolumeResourceUsage as an enum -- maybe we could add a CHECK on the table validating this?

Something like:

CONSTRAINT exactly_one_usage_source CHECK (
  (
    (usage_type = 'readonlyregion') AND
    (region_id IS NOT NULL) AND
    (region_snapshot_dataset_id IS NULL AND region_snaphshot_region_id IS NULL AND region_snapshot_snapshot_id IS NULL)
  ) OR
  (
    (usage_type = 'regionsnapshot') AND
    (region_id NOT NULL) AND
    (region_snapshot_dataset_id IS NOT NULL AND region_snaphshot_region_id IS NOT NULL AND region_snapshot_snapshot_id IS NOT NULL)
  )
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, done in bd2ef84

VolumeResourceUsage::ReadOnlyRegion {
region_id: record
.region_id
.expect("valid read-only region usage record"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conservatively, this should maybe be a TryFrom implementation. As it exists today, someone could modify a column in the database and cause Nexus to panic with these .expects

@@ -264,7 +264,7 @@ impl DataStore {
block_size,
blocks_per_extent,
extent_count,
read_only: false,
read_only: maybe_snapshot_id.is_some(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug before this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note though that currently the only thing in Nexus that creates read-only regions is region snapshot replacement, so this wasn't a bug that was hit anywhere.

enum VolumeCreationError {
#[error("Error from Volume creation: {0}")]
Public(Error),
let maybe_volume: Option<Volume> = dsl::volume
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volume has a time_deleted column -- do we not care about that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not here, no - if the caller is trying to call volume_create with a volume that has the same ID as one that already exists or is soft-deleted, then we should disallow that.

Comment on lines 3060 to 3064
// This function may be called with a replacement volume
// that is completely blank, to be filled in later by this
// function. `volume_create` will have been called but will
// not have added any volume resource usage records, because
// it was blank!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this comment, in this location - namely, I'm not sure who is calling volume_create in this situation where we're saying "volume_create will have been called". Is this something we're doing within the body of this function, and I'm not seeing it? Or is this something someone else could concurrently be doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was bad wording! Changed in 58169fe, let me know if that reads better

// not have added any volume resource usage records, because
// it was blank!
//
// The indention leaving this transaction is that the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The indention leaving this transaction is that the
// The intention leaving this transaction is that the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also 58169fe

// We don't support a pure Region VCR at the volume
// level in the database, so this choice should
// never be encountered.
panic!("Region not supported as a top level volume");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a footgun to me, to have a pub fn with an undocumented panic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, removed in 3849fe8

OVERLAY(
OVERLAY(
MD5(volume.id::TEXT || dataset_id::TEXT || region_id::TEXT || snapshot_id::TEXT || snapshot_addr || volume_references::TEXT)
PLACING '4' from 13
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing? Why '4'? Why from 13?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code creates a deterministic V4 UUID from the other columns, which has the shape of

xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx
12345678 9111 1111 1112 222222222333
          012 3456 7890 123456789012

where the first hexadecimal digit in the third group always starts with a 4 means that M = 4, and I put a "random" value (read: volume references) in for N (the variant field).

But from https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-version-4:

Alternatively, an implementation MAY choose to randomly generate the exact required number of bits for random_a, random_b, and random_c (122 bits total) and then concatenate the version and variant in the required position.

I don't think I'm doing this right - position 17 shouldn't be "random":

var:
The 2-bit variant field as defined by Section 4.1, set to 0b10.

I can fix this tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 58169fe (plus a comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaving a comment so that Github stops bringing me to this text box. Please ignore!

MD5(volume.id::TEXT || dataset_id::TEXT || region_id::TEXT || snapshot_id::TEXT || snapshot_addr || volume_references::TEXT)
PLACING '4' from 13
)
PLACING TO_HEX(volume_references) from 17
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why from 17?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment.

The garbage collection of read/write regions must be separate from
read-only regions:

- read/write regions are garbage collected by either being deleted
  during a volume soft delete, or by appearing later during the "find
  deleted volume regions" section of the volume delete saga

- read-only regions are garbage collected only in the volume soft delete
  code, when there are no more references to them

`find_deleted_volume_regions` was changed to only operate on read/write
regions, and no longer returns the optional RegionSnapshot object: that
check was moved from the volume delete saga into the function, as it
didn't make sense that it was separated.

This commit also adds checks to validate that invariants related to
volumes are not violated during tests. One invalid test was deleted
(regions will never be deleted when they're in use!)

In order to properly test the separate region deletion routines, the
first part of the fixes for dealing with deleted volumes during region
snapshot replacement were brought in from that branch: these are the
changes to region_snapshot_replacement_step.rs and
region_snapshot_replacement_start.rs.
@jmpesp
Copy link
Contributor Author

jmpesp commented Oct 16, 2024

@smklein @leftwo have a look at 2ff34d5, which contains a fix for the region deletion bug I've been talking about.

@jmpesp
Copy link
Contributor Author

jmpesp commented Nov 1, 2024

@leftwo @smklein I'm all done responding to comments, please have another look!

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions, but I like it all

nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
return Ok(VolumeReplaceResult::ExistingVolumeDeleted);
};
if !old_region_in_vcr && new_region_in_vcr {
// It does seem like the replacement happened
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have just one of: !old_region_in_vcr or new_region_in_vcr, will those cases be captured below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean

        if !old_region_in_vcr && new_region_in_vcr {
            // It does seem like the replacement happened
            return Ok(VolumeReplaceResult::AlreadyHappened);
        } else if old_region_in_vcr && !new_region_in_vcr {
            // Replacement hasn't happened yet
        } else if old_region_in_vcr && new_region_in_vcr {
            // XXX wat
        } else if !old_region_in_vcr && !new_region_in_vcr {
            // XXX wat
        }

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

It seems like old_region_in_vcr && !new_region_in_vcr is what we expect, and the other two are things that should not happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, yeah. old_region_in_vcr && new_region_in_vcr means we performed a replacement (because new is there) but old is still there, meaning that the replacement couldn't have taken place, and !old_region_in_vcr && !new_region_in_vcr means that the thing we're being asked to replace isn't there to begin with, and a replacement hasn't taken place (because new isn't there).

Do you think I should have those cases in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted and I did put some comments in the code in f52fa63

volume_id.0,
replacement_usage,
);
/// Replace a read-only target in a Volume with a new region
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's true that we use "target" and "region" to mean the same thing, we should be consistent in a single sentence with the same name :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda foreshadowing! Eventually "target" here could mean either a region snapshot or a read-only region, because I think the read-only region replacement is going to live along side the region snapshot replacement logic.

nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
MD5(volume.id::TEXT || dataset_id::TEXT || region_id::TEXT || snapshot_id::TEXT || snapshot_addr || volume_references::TEXT)
PLACING '4' from 13
)
PLACING '8' from 17
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still feels somewhat dangerous to me, and I'm not sure we have tests for the validity nor uniqueness of the UUIDs being produced by this transition.

The goal here is to produce a volume_resource_usage record for each volume INNER JOIN region_snapshot that isn't being deleted, right? Couldn't you do this by doing:

-- End goal: Find all volume/region_snapshot combos that **don't** have a volume_resource_usage record
(volume INNER JOIN region_snapshot) LEFT JOIN (volume_resource_usage)
  ON
(
  volume_resource_usage.volume = volume.id AND
  volume_resource_usage.region_snapshot_region_id = region_snapshot.region_id
)

And then write a WHERE clause to ensure that volume_resource_usage.id is NULL for these cases?

Basically:

INSERT INTO  volume_resource_usage
-- <New records, using normal UUID generation>
FROM
-- <The clause above - `volume JOIN region_snapshot`, but also `JOIN`-ing with `volume_resource_usage` to check a record doesn't already exist>
WHERE
volume.time_deleted IS NULL AND
region_snapshot.deleting = false AND
volume_resource_usage.id IS NULL

This would still be idempotent! Re-running it should be a no-op on the second iteration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does work, see 2b511b1, thanks :) The test_migrate_.* tests confirm that it works.

@jmpesp jmpesp merged commit 7cf688c into oxidecomputer:main Nov 12, 2024
16 checks passed
@jmpesp jmpesp deleted the read_only_regions_need_ref_counting_too branch November 12, 2024 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants