-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicitly manage schema upgrade transactions #4096
Conversation
TL;DR: connection.batch_execute("BEGIN; <Perform any transaction that can conditionally fail>; COMMIT") Cannot work, because Cockroachdb will not automatically issue |
ALTER TABLE omicron.public.ip_pool | ||
DROP COLUMN IF EXISTS internal; | ||
|
||
COMMIT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so much better for the migration author anyway. The version check was clearly something to automate.
previous version of CockroachDB to this version. | ||
previous version of CockroachDB to this version. These migrations will always be placed | ||
within one transaction per file. | ||
** If more than one change is needed per version, any number of files starting with `up` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"More than one change" is easily misinterpreted — might be good to link to something that would tell you what needs to be in its own transaction. I guess it would be that very complicated CRDB doc. It might be cool if we had our own version of that doc that tried to distill the relevant bits into something reasonable — so far that might be stuck in PR comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expanded on this section, and added a link to the CRDB docs. I do still think it's difficult to get right, but hopefully this clarifies things more than before.
** CockroachDB documentation recommends the following: "Execute schema changes... in a single | ||
explicit transaction consisting of the single schema change statement". | ||
Practically this means: If you want to change multiple tables, columns, | ||
types, indices, or constraints, do so in separate files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems too strong — you can do a bunch of column changes in a single alter table
. I wouldn't want people to think they have to do one drop column
per file. Maybe we can make the rule of thumb one statement per file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you find the source for this? I saw that there was an exception if you're creating a table, but where are we clear to do multiple column changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
DROP COLUMN can result in data loss if one of the other schema changes in the transaction fails or is canceled. To work around this, move the DROP COLUMN statement to its own explicit transaction or run it in a single statement outside the existing transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see https://www.cockroachlabs.com/docs/v23.1/alter-table#add-and-rename-columns-atomically , but I think that the "DROP COLUMN" case is still an exception, and any modifications to constraints and indices also probably need to be separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... All this is to say: I think there may be exceptions, but I'd rather folks stick to "one DDL per file" until they have a citation from the CRDB docs telling them it's safe to do something else. This stuff is super hard to get right, and if we have to lean in any direction, I'd rather we be conservative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ALTER TABLE omicron.public.ip_pool
DROP COLUMN IF EXISTS project_id,
ADD COLUMN IF NOT EXISTS is_default BOOLEAN NOT NULL DEFAULT FALSE,
DROP CONSTRAINT IF EXISTS project_implies_silo,
DROP CONSTRAINT IF EXISTS internal_pools_have_null_silo_and_project;
...is that bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, for what it's worth, I think a lot of the failures that CRDB docs are nervous about are "more real" in the context of live update, and are really bad if you're doing a DDL + adding an invariant (like a unique index, or constraint) concurrently, where you could insert data that violates that constraint before the upgrade finishes.
So when I say "bad", I want to answer from that context. "Bad" may be "not actually that bad" when we're doing our current "offline upgrade" scheme, where traffic is basically halted until a single Nexus node finishes issuing the update.
But:
- As far as I can tell, the risk of the
DROP COLUMN
is that it may partially occur, if one of the other DDL statements fails and otherwise rolls back. This could result in a state where the IP poolproject_id
is... partially dropped? I don't actually know how this would manifest. - The ordering of "adding and removing constraints" is kinda fuzzy to me. https://www.cockroachlabs.com/docs/stable/online-schema-changes#schema-change-ddl-statements-inside-a-multi-statement-transaction-can-fail-while-other-statements-succeed talks at length about scenarios where similar looking statements can insert {data, constraint on that data} concurrently, but the evaluation is not done until the end of the transaction, and rollback cannot occur. In your specific case, I don't see how it could break, but CRDB also says "This error will occur in various scenarios, including but not limited to...", so I dunno about all potential errors that could occur.
All this is to say, the following might be safer, in different files. Is it required? I honestly don't know.
ALTER TABLE omicron.public.ip_pool
DROP COLUMN IF EXISTS project_id,
ALTER TABLE omicron.public.ip_pool
ADD COLUMN IF NOT EXISTS is_default BOOLEAN NOT NULL DEFAULT FALSE,
ALTER TABLE omicron.public.ip_pool
DROP CONSTRAINT IF EXISTS project_implies_silo,
ALTER TABLE omicron.public.ip_pool
DROP CONSTRAINT IF EXISTS internal_pools_have_null_silo_and_project;
batch_execute
-ing it (e.g., via aCAST
error) would cause the transaction to fail, and rollback. This is not true. In CockroachDB, an error is thrown, but the transaction is not closed. This was the cause of CI test failure on "main" inconcurrent_nexus_instances_only_move_forward
#4093 , where connections stuck in this mangled ongoing transaction state were placed back into the connection pool.Fixes #4093