Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly manage schema upgrade transactions #4096

Merged
merged 3 commits into from
Sep 15, 2023
Merged

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Sep 15, 2023

  • Previously, schema updates encouraged authors on each change to add their own transactions, validating that the "current" and "target" versions are correct.
  • This unfortunately is not handled particularly well in scripted SQL. I incorrectly thought that failing a transaction while batch_execute-ing it (e.g., via a CAST error) would cause the transaction to fail, and rollback. This is not true. In CockroachDB, an error is thrown, but the transaction is not closed. This was the cause of CI test failure on "main" in concurrent_nexus_instances_only_move_forward #4093 , where connections stuck in this mangled ongoing transaction state were placed back into the connection pool.
  • To fix this: Nexus now explicitly wraps each schema change in a transaction using Diesel, which ensures that "on success, they're committed, and on failure, they're rolled back").
  • Additionally, this PR upgrades all existing schema changes to conform to this "implied transaction from Nexus" policy, and makes it possible to upgrade using multiple transactions in a single version change.

Fixes #4093

@smklein
Copy link
Collaborator Author

smklein commented Sep 15, 2023

TL;DR:

  connection.batch_execute("BEGIN; <Perform any transaction that can conditionally fail>; COMMIT")

Cannot work, because Cockroachdb will not automatically issue ROLLBACK for a failed transaction. Some databases support this behavior behind a toggle named XACT_ABORT, but CockroachDB is not one of them. Even Cockroach docs basically suggest "retry if you can, then explicitly issue a rollback in application-level code".

ALTER TABLE omicron.public.ip_pool
DROP COLUMN IF EXISTS internal;

COMMIT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so much better for the migration author anyway. The version check was clearly something to automate.

previous version of CockroachDB to this version.
previous version of CockroachDB to this version. These migrations will always be placed
within one transaction per file.
** If more than one change is needed per version, any number of files starting with `up`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"More than one change" is easily misinterpreted — might be good to link to something that would tell you what needs to be in its own transaction. I guess it would be that very complicated CRDB doc. It might be cool if we had our own version of that doc that tried to distill the relevant bits into something reasonable — so far that might be stuck in PR comments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded on this section, and added a link to the CRDB docs. I do still think it's difficult to get right, but hopefully this clarifies things more than before.

** CockroachDB documentation recommends the following: "Execute schema changes... in a single
explicit transaction consisting of the single schema change statement".
Practically this means: If you want to change multiple tables, columns,
types, indices, or constraints, do so in separate files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too strong — you can do a bunch of column changes in a single alter table. I wouldn't want people to think they have to do one drop column per file. Maybe we can make the rule of thumb one statement per file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you find the source for this? I saw that there was an exception if you're creating a table, but where are we clear to do multiple column changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see https://www.cockroachlabs.com/docs/v23.1/alter-table#add-and-rename-columns-atomically , but I think that the "DROP COLUMN" case is still an exception, and any modifications to constraints and indices also probably need to be separate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... All this is to say: I think there may be exceptions, but I'd rather folks stick to "one DDL per file" until they have a citation from the CRDB docs telling them it's safe to do something else. This stuff is super hard to get right, and if we have to lean in any direction, I'd rather we be conservative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/oxidecomputer/omicron/blob/03b7588bc365a5afa8f0069347dadc80f8636454/schema/crdb/3.0.1/up1.sql

ALTER TABLE omicron.public.ip_pool
    DROP COLUMN IF EXISTS project_id,
    ADD COLUMN IF NOT EXISTS is_default BOOLEAN NOT NULL DEFAULT FALSE,
    DROP CONSTRAINT IF EXISTS project_implies_silo,
    DROP CONSTRAINT IF EXISTS internal_pools_have_null_silo_and_project;

...is that bad?

Copy link
Collaborator Author

@smklein smklein Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for what it's worth, I think a lot of the failures that CRDB docs are nervous about are "more real" in the context of live update, and are really bad if you're doing a DDL + adding an invariant (like a unique index, or constraint) concurrently, where you could insert data that violates that constraint before the upgrade finishes.

So when I say "bad", I want to answer from that context. "Bad" may be "not actually that bad" when we're doing our current "offline upgrade" scheme, where traffic is basically halted until a single Nexus node finishes issuing the update.

But:

  • As far as I can tell, the risk of the DROP COLUMN is that it may partially occur, if one of the other DDL statements fails and otherwise rolls back. This could result in a state where the IP pool project_id is... partially dropped? I don't actually know how this would manifest.
  • The ordering of "adding and removing constraints" is kinda fuzzy to me. https://www.cockroachlabs.com/docs/stable/online-schema-changes#schema-change-ddl-statements-inside-a-multi-statement-transaction-can-fail-while-other-statements-succeed talks at length about scenarios where similar looking statements can insert {data, constraint on that data} concurrently, but the evaluation is not done until the end of the transaction, and rollback cannot occur. In your specific case, I don't see how it could break, but CRDB also says "This error will occur in various scenarios, including but not limited to...", so I dunno about all potential errors that could occur.

All this is to say, the following might be safer, in different files. Is it required? I honestly don't know.

ALTER TABLE omicron.public.ip_pool
    DROP COLUMN IF EXISTS project_id,

ALTER TABLE omicron.public.ip_pool
    ADD COLUMN IF NOT EXISTS is_default BOOLEAN NOT NULL DEFAULT FALSE,

ALTER TABLE omicron.public.ip_pool
    DROP CONSTRAINT IF EXISTS project_implies_silo,

ALTER TABLE omicron.public.ip_pool
    DROP CONSTRAINT IF EXISTS internal_pools_have_null_silo_and_project;

@smklein smklein merged commit ca311de into main Sep 15, 2023
@smklein smklein deleted the better-schema-commitment branch September 15, 2023 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI test failure on "main" in concurrent_nexus_instances_only_move_forward
2 participants