Explicitly manage schema upgrade transactions #4096

smklein · 2023-09-15T06:46:18Z

Previously, schema updates encouraged authors on each change to add their own transactions, validating that the "current" and "target" versions are correct.
This unfortunately is not handled particularly well in scripted SQL. I incorrectly thought that failing a transaction while batch_execute-ing it (e.g., via a CAST error) would cause the transaction to fail, and rollback. This is not true. In CockroachDB, an error is thrown, but the transaction is not closed. This was the cause of CI test failure on "main" in concurrent_nexus_instances_only_move_forward #4093 , where connections stuck in this mangled ongoing transaction state were placed back into the connection pool.
To fix this: Nexus now explicitly wraps each schema change in a transaction using Diesel, which ensures that "on success, they're committed, and on failure, they're rolled back").
Additionally, this PR upgrades all existing schema changes to conform to this "implied transaction from Nexus" policy, and makes it possible to upgrade using multiple transactions in a single version change.

smklein · 2023-09-15T06:54:55Z

TL;DR:

  connection.batch_execute("BEGIN; <Perform any transaction that can conditionally fail>; COMMIT")

Cannot work, because Cockroachdb will not automatically issue ROLLBACK for a failed transaction. Some databases support this behavior behind a toggle named XACT_ABORT, but CockroachDB is not one of them. Even Cockroach docs basically suggest "retry if you can, then explicitly issue a rollback in application-level code".

david-crespo · 2023-09-15T15:42:18Z

schema/crdb/3.0.3/up.sql

 ALTER TABLE omicron.public.ip_pool
    DROP COLUMN IF EXISTS internal;
-
-COMMIT;


This is so much better for the migration author anyway. The version check was clearly something to automate.

david-crespo · 2023-09-15T15:56:13Z

schema/crdb/README.adoc

-  previous version of CockroachDB to this version.
+  previous version of CockroachDB to this version. These migrations will always be placed
+  within one transaction per file.
+** If more than one change is needed per version, any number of files starting with `up`


"More than one change" is easily misinterpreted — might be good to link to something that would tell you what needs to be in its own transaction. I guess it would be that very complicated CRDB doc. It might be cool if we had our own version of that doc that tried to distill the relevant bits into something reasonable — so far that might be stuck in PR comments.

I expanded on this section, and added a link to the CRDB docs. I do still think it's difficult to get right, but hopefully this clarifies things more than before.

david-crespo · 2023-09-15T16:34:11Z

schema/crdb/README.adoc

+** CockroachDB documentation recommends the following: "Execute schema changes... in a single
+   explicit transaction consisting of the single schema change statement".
+   Practically this means: If you want to change multiple tables, columns,
+   types, indices, or constraints, do so in separate files.


This seems too strong — you can do a bunch of column changes in a single alter table. I wouldn't want people to think they have to do one drop column per file. Maybe we can make the rule of thumb one statement per file?

Can you find the source for this? I saw that there was an exception if you're creating a table, but where are we clear to do multiple column changes?

For example:

https://www.cockroachlabs.com/docs/stable/online-schema-changes#:~:text=DROP%20COLUMN%20can%20result%20in%20data%20loss%20if%20one%20of%20the%20other%20schema%20changes%20in%20the%20transaction%20fails%20or%20is%20canceled.%20To%20work%20around%20this%2C%20move%20the%20DROP%20COLUMN%20statement%20to%20its%20own%20explicit%20transaction%20or%20run%20it%20in%20a%20single%20statement%20outside%20the%20existing%20transaction.

DROP COLUMN can result in data loss if one of the other schema changes in the transaction fails or is canceled. To work around this, move the DROP COLUMN statement to its own explicit transaction or run it in a single statement outside the existing transaction.

Ah, I see https://www.cockroachlabs.com/docs/v23.1/alter-table#add-and-rename-columns-atomically , but I think that the "DROP COLUMN" case is still an exception, and any modifications to constraints and indices also probably need to be separate?

... All this is to say: I think there may be exceptions, but I'd rather folks stick to "one DDL per file" until they have a citation from the CRDB docs telling them it's safe to do something else. This stuff is super hard to get right, and if we have to lean in any direction, I'd rather we be conservative.

https://github.com/oxidecomputer/omicron/blob/03b7588bc365a5afa8f0069347dadc80f8636454/schema/crdb/3.0.1/up1.sql

ALTER TABLE omicron.public.ip_pool DROP COLUMN IF EXISTS project_id, ADD COLUMN IF NOT EXISTS is_default BOOLEAN NOT NULL DEFAULT FALSE, DROP CONSTRAINT IF EXISTS project_implies_silo, DROP CONSTRAINT IF EXISTS internal_pools_have_null_silo_and_project;

...is that bad?

So, for what it's worth, I think a lot of the failures that CRDB docs are nervous about are "more real" in the context of live update, and are really bad if you're doing a DDL + adding an invariant (like a unique index, or constraint) concurrently, where you could insert data that violates that constraint before the upgrade finishes.

So when I say "bad", I want to answer from that context. "Bad" may be "not actually that bad" when we're doing our current "offline upgrade" scheme, where traffic is basically halted until a single Nexus node finishes issuing the update.

But:

As far as I can tell, the risk of the DROP COLUMN is that it may partially occur, if one of the other DDL statements fails and otherwise rolls back. This could result in a state where the IP pool project_id is... partially dropped? I don't actually know how this would manifest.

The ordering of "adding and removing constraints" is kinda fuzzy to me. https://www.cockroachlabs.com/docs/stable/online-schema-changes#schema-change-ddl-statements-inside-a-multi-statement-transaction-can-fail-while-other-statements-succeed talks at length about scenarios where similar looking statements can insert {data, constraint on that data} concurrently, but the evaluation is not done until the end of the transaction, and rollback cannot occur. In your specific case, I don't see how it could break, but CRDB also says "This error will occur in various scenarios, including but not limited to...", so I dunno about all potential errors that could occur.

All this is to say, the following might be safer, in different files. Is it required? I honestly don't know.

ALTER TABLE omicron.public.ip_pool DROP COLUMN IF EXISTS project_id, ALTER TABLE omicron.public.ip_pool ADD COLUMN IF NOT EXISTS is_default BOOLEAN NOT NULL DEFAULT FALSE, ALTER TABLE omicron.public.ip_pool DROP CONSTRAINT IF EXISTS project_implies_silo, ALTER TABLE omicron.public.ip_pool DROP CONSTRAINT IF EXISTS internal_pools_have_null_silo_and_project;

smklein added 2 commits September 14, 2023 23:42

Explicitly manage schema upgrade transactions

b5e04fb

cleanup

e35a27a

smklein requested review from davepacheco and david-crespo September 15, 2023 06:49

david-crespo reviewed Sep 15, 2023

View reviewed changes

Update docs

03b7588

david-crespo reviewed Sep 15, 2023

View reviewed changes

david-crespo approved these changes Sep 15, 2023

View reviewed changes

smklein merged commit ca311de into main Sep 15, 2023

smklein deleted the better-schema-commitment branch September 15, 2023 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly manage schema upgrade transactions #4096

Explicitly manage schema upgrade transactions #4096

smklein commented Sep 15, 2023 •

edited

Loading

smklein commented Sep 15, 2023

david-crespo Sep 15, 2023

david-crespo Sep 15, 2023

smklein Sep 15, 2023

david-crespo Sep 15, 2023

smklein Sep 15, 2023

smklein Sep 15, 2023

smklein Sep 15, 2023

smklein Sep 15, 2023

david-crespo Sep 15, 2023

smklein Sep 15, 2023 •

edited

Loading

Explicitly manage schema upgrade transactions #4096

Explicitly manage schema upgrade transactions #4096

Conversation

smklein commented Sep 15, 2023 • edited Loading

smklein commented Sep 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Sep 15, 2023 • edited Loading

Choose a reason for hiding this comment

smklein commented Sep 15, 2023 •

edited

Loading

smklein Sep 15, 2023 •

edited

Loading