Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/schemachanger: declarative state generated on gc_job_mixed is not backwards compatible #91528

Closed
fqazi opened this issue Nov 8, 2022 · 12 comments · Fixed by #92774
Closed
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@fqazi
Copy link
Collaborator

fqazi commented Nov 8, 2022

The mixed version corpus validation test for gc_job_mixed encountered a potential state that is not backwards compatible between master and 22.2. As a result when we attempt to plan on this state generated on a mixed version state from master:

failed to validate TestLogic_gc_job_mixed_TestLogic_gc_job_mixed_DROP TABLE db.kv_0 with error [[Column:{DescID: 108, ColumnID: 1}, ABSENT], ABSENT] should not yet be scheduled for this stage, ERROR: [[Column:{DescID: 160, ColumnID: 1}, ABSENT], ABSENT] should not yet be scheduled for this stage SQLSTATE: XX000 DETAIL: dependencies graphviz: https://cockroachdb.github.io/scplan/viz.html#H4sIAAAAAAAA/+xc727juBH/HD+F4AJFe9juWv+lXmIgG2fbPbRJEGfRD8HBoEXGZleWVIrabK5Y4F6rfZx7kkKiTVmMbFMWrTjFfow4nOH8ZjicGdKBeEZAMte0f/dO0mzK/gjCLKWITFI9/1zz3Sy+n4RgisKzfkoBRQsU0bT/Y/45su4f4oim+Bd01vf7bxjZ6SkF0xBpQQjS9KwfZxSR/vCUkuEphauvcwRg/jkNkunb8Yrv6TsKh6fvCtKCfFjlVfzBeQ1vEQQBRVCYT+FwdHt9o43O787fn48vtd9+/Q/87df/ahfn44vz0eVzKbsYQImpd2BWP3ttzrtCgWH5pbf61Bu+SecgQWf9KI5Q/+cC397JybdasxgVs1BAZojbxG5tkx8Ko1yBBUoTEKCGRoEoDQhOaEw+Qg6IbnvP2URggTgF3B+myFGk8vVjhMjB1I1L7hQOSRzTFiq7ilT+lCJyQ/AXHKIZSg+meyKIoHBoPKfKUkSu1p0CwAWOWqDkvX6UBlYN4TOgkmwa4qAFUv7rR0oCppabTh8oQmkEKJiCtHFoXU7biE5jhXTFCt3GIRojSnE0U63bko7EIarYdDJJQhCgeRxCRCb58CQ/WCaTNrgYL32ISqKxYcf4Ow7b1uFCN4/vyK1Runrktj1N9Pbp7uGCZI32L3To6qoy0JeDydadLg5dXVXietwO1fbUVZXrjoM5WoCG6OD0Zs3MFA4pyWqieVrw3ghgY51VZa5M5xtAmlfXSTFpJHkSqQZAVUL6MiewM9hxAlN7MHD8SY03KgbSUJWzqjup68BRWhwbqtLaA8TVOuVf6KA2VGW5xw1TW2dSlere5RMaglP8uRGXxqqoSmCvp/9EAd3rVImLqTtNXbAeKw6FqhLTizjMFtEHsMDhU0P1q2UYwQtAalgoNruqNJPp3VDjoJi0rkyNuWfnlBI8zSi6yhbbKBUjoyq1ZMiUYUcpOhWf+XpwTFSlngyTu6fkEJjg9CoLwzKibkzMt4OzIuKL3LW2h7VNT+HwY0Q3hYEYl0KNGpmPGNI5p3CsbSZrbkRV6bOaLV9zTOP0rxhCFO0w34bIUMNQ7S4wVaXN6iJDjc5RNc95xPDguKhKrdVFh8a+8P+66U1V+TyzzQg9gCykl18TgtIUx4eIAuhrUlZ9WYT/laFJ4cd/+OPBPVlVXq8mRFpvDd/yHdfwbfT731nmj4OaBpxk0MTp+CmlaLG+sqYRVmo9ii2iqjxRF3OlUKhE4YDA6QRHFJEIhJPFlyCYULxAKQWL5OAAqi1wVARnSbc+imRuhAK8AKFMbNfdwVZDNTfdcdVopdmsI4tGW9ajeDMdX20ohUK1zZxzjDtIC4+vaJR04aOIPNcYSmWUjtqYo6pI/BhB9PVwlWLOfXs3QK0zW6pqP4W41LVAduPyGUfl+Pju+vZydHDwVBWIN6w5WmDYGL0opQTgiLZGEKefiqqkVXhojKGqQq4Ab4+TRwKY2tvMSfIZHbqdvXwUvXz7DFHCHj5Hlnm//Db48/n78eXVHZsQ2dZqQBdHbD6iCyMuHzGEEWdtZHR7fXNzOVoNeXzIFCb5fMSqjjgDPmILIzofcYQRg4+4wkipqieMmGsj1WU7JQp+dZJVshO4cRAMAVKHA2eIczg6hgCpw9ExBNxcjo4h4OZydAwBN5ejYwi4uRwDQ8DN5YoaAm4uB8cQwSlHhCkcHFMAx+XgmAI4LgfHFLlxcEwBHI+DYwrgeBwcUwDH4+CYAjgeB8cUwLFKfYQBro4g3+JrFsRbXjkg7B2ujLAum+siLMvmqggms7kmvrh1/zTMd3bb2Hr/cxrEydvrmnp+a1Qd8AD4Q8Hgb/Hs8svqBrW3c/plWP2Jylbi9bdOO6P9YZ441XxZaZJr/Z6/f961vvOMzmOCfwGUNyF3TfnU9BEAx03ZT4lkfg+0nHNX/IDn74gCCKik0cZxRgK09ImPo20J9Dib/iMmn9eI9P3sxtaZa5iVTzDYFpOZrgv+P0IhomjE339I7oNywrpGTX1xLclYjw76PSAkfsy3+lkfYrCII9h/E8RhTM76BMFVwOiXz1Y0SOJEI3g2p9oUPcQEaRAlKIIoohpiBtIIWsRfQFiJREaHsswOZVkdyrI7lOXeb+e7PNE0HGmUgCgFQR6wVjLqOHrdrd5Rg1TBGUezNaGPmM41QCnRCHrgiWgu0t1D5HLpzTQsxLmDbsXt4+YtxHULprdPcNpPnJm7Svurn9eQld1VOo0qX/59T7ZeItkyX0+ydUEQoOgvFz/F0w8xKR1xt4dzkD/EZETi5Kd4KonaHr+q3wJHvh9G+++HZwlfEXhsv5MjmAW5fQ6ofWV1k8gyWd0ksquDQl0Sxjh2k0IyWU6HsrpJLpmsDneR26Fnux16tttNicZkdejzboc+73bo826HPu91eHJ4He4vryOf91g1Wst3OTVlcxBcMWVXsPyKoi0HV2clagsORmsOphoO0lGRzdXoU7JmpFJCxYhrtPzuhZXZLZbrtObgsdq7BQe/LQdv0NZsBQfPaMahuATmN0pt/b/g0GoJZksGVm5Kq/2bl9fQaKj+g5ld1M3/r8z3dsNLtBtqnmG8jnZDxR0lbnhW/jja4I/H0ZnYCIZx0IuuhhvzWd+DxcFuMi8mq5sLISarm8qGyeqmsmGy9mn/7yvLV9hjKTja3dQvTJaa+kXyoouJVFN2NhKppvqUE9k7+dY76Z1ENgscS8Y01pbPa96k9ClEZ30I0jmCxZzIYRtfitZlG1eKdrnxpGh9tnGkaPU8RS7fJ+0g1plTyxEbzCvliE32MkCO2GJX+3LENrublyNeXq7LEef2syVtoi83jByxz1xditgYsB67HLHOmuRyxAbrcssRFw1RSaMYFutAyxHbrIUsR+ywik2O2GUNYzlij3V85Yh9VrtKEZsD1t+VI9ZZK0KO2GAdVjlik3UN5Igt1k+VI7ZZQ1SO2GGNAzlil7U/5Yg91r+UI/ZZA1KK2BqwDqIc8bIlIEdssH6hDHHvW+9/AQAA//+FBjL+71wAAA== HINT: You have encountered an unexpected error.

We need to root cause and identify the first post commit phase in this plan has a non-plannable state on 22.2

For manual reproduction fetch the corpus:

  1. gsutil cp gs://cockroach-corpus/corpus-mixed-master/corpus .
  2. ./cockroach debug declarative-corpus-validate corpus # This command will fail with the error observed above

Jira issue: CRDB-21304

@fqazi fqazi added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-schema-deprecated Use T-sql-foundations instead labels Nov 8, 2022
@postamar
Copy link
Contributor

postamar commented Nov 9, 2022

@chengxiong-ruan can you take a look at this when you have a moment, please? @fqazi added these corpus tests and they have found a serious bug for the first time! Unfortunately, reproducing these failures is not straightforward and there is no tooling to help do so.

I'm deliberately relying on the fact that you know little about these tests, to ask @fqazi the right questions with a fresh perspective, the goal of this task is not particularly to fix this bug but for you to figure out how to make it easy to repro these test failures to identify the root cause. We'll deal with the bug later (probably me, because I probably was the one that introduced it)

@fqazi fqazi changed the title sql/schemachanger: declarative state generated on gc_job_mixed is not backwards comaptible sql/schemachanger: declarative state generated on gc_job_mixed is not backwards compatible Nov 9, 2022
@chengxiong-ruan
Copy link
Contributor

chengxiong-ruan commented Nov 18, 2022

I finally had a chance to repro this, sorry for the delay
Steps to repro:
0. be mindful that the current declarative-corpus-validate sub-command only panic on the first validation error. So with the specific corpus dump in this issue, there're actually many errors, not just 2 heh...

  1. create a mixed version cluster with v22.2 and a build from master (remember to set var developmentBranch = false in your local master!)
# create the cluster (2 nodes cluster is fine as well)
roachprod create local -n 3
# stage with latest v22.2 branch build and start
roachprod stage local cockroach 87ce2ca0f017e92e86b04f76de12347b47de1dd8
roachprod start local
# stop one node, replace cockroach binary with your binary built and restart the node
roachprod stop local:1
roachprod put local:1 ./cockroach
roachprod start local:1
  1. disable job adoption node #1 and node #2
roachprod run local:1-2 -- touch data/DISABLE_STARTING_BACKGROUND_JOBS
  1. run one example from the corpus validation failure (testdata/logic_test/drop_sequence) against node #1 and watch logging one node #3
CREATE SEQUENCE s6;
CREATE VIEW v AS SELECT nextval('s6');
DROP SEQUENCE s6 CASCADE;

Now you should see panics from the node #3 logging.
I can try to root cause this...hope the rules are not too complicated, but I'll ask for help..

@chengxiong-ruan
Copy link
Contributor

If we can have a tool to replay a logic test file until the failing statement, that'd be super cool that we don't need to figure what's going on with the statements before the failing statement.

@chengxiong-ruan
Copy link
Contributor

chengxiong-ruan commented Nov 21, 2022

Looks like we are moving columns from WRITE_ONLY to ABSENT directly in PreCommitPhase when dropping a table/view. While in v22.2, columns are moved to DELETE_ONLY in PreCommitPhase and moved to ABSENT in PostCommitPhase . This is causing mixed version incompatibility here when a v22.2 node picked up a v23.1 plan, because the ABSENT column status make it confused because it's expecting something earlier. The new behavior still satisfy this rule though. @postamar does this sound interesting to you?

@chengxiong-ruan
Copy link
Contributor

Looks like this new behavior was introduced by #90544

@chengxiong-ruan
Copy link
Contributor

@Xiang-Gu this might be interesting to you.

@ajwerner
Copy link
Contributor

If we can have a tool to replay a logic test file until the failing statement, that'd be super cool that we don't need to figure what's going on with the statements before the failing statement.

Another thing that might help debug this sort of failure might just be to have a string associated with each state in the corpus that tells us the test and line it came from. It's not as fancy, but it's a lot simpler to implement and almost as useful.

@Xiang-Gu
Copy link
Contributor

Something is suspicious here:

  1. We have seen incompatibility issues caused by dep-rule changes (e.g. rules: suppress a dep-rule in special case #87196 is a forward compatibility issue where newly introduced dep-rule is too strict when new node adopts job planned by old node). The error msg will be something like "failed to satisfy .... dep rule". However, this one's error msg comes from the planner when determining whether a node can be scheduled in the current stage (hasUnmeetableOutboundDeps).
  2. The column is already in ABSENT, its terminal state, when the old node adopts the schema change job. The planner supposedly shouldn't do anything about it; indeed, for any node without an op edge (a node in its terminal state is one such node), it should be skipped by the planner here.
    Yet the fact that we proceeded to call hasUnmeetableOutboundDeps on this column node is very surprising. Recall this function is implemented recursively, so my guess here is it's called on this node as a result of recursion -- maybe we initially called this function on another node that transitively connects to this node, so when we eventually called this function on such a terminal node, the error occurs. This sounds like a good direction to start investigating further to me: do we have guardrails to not recurse on neighbors that have already reached its terminal state, in addition to the use of epoch?
  3. The error message actually complains about the same thing on two column nodes. The error is a panic so I'm not sure why that would happen (This function hasUnmeetableOutboundDeps is implemented recursively so it might have something to do with this, but I am not sure).

I will try to look a bit more into this after I'm done with some other work at hand.

@Xiang-Gu
Copy link
Contributor

@chengxiong-ruan is right: The error is bc of the change in behavior on column element when dropping the table.

  • Previously (v22.2): The column goes from WRITE_ONLY to DELETE_ONLY in PrecommitPhase.
  • Now (master): The column goes from WRITE_ONLY to ABSENT in PrecommitPhase.

Detailed explanation:
In the new node, after the preCommitPhase, the table element will be in DROPPED status. This is carried as the initial status of this element when the old node adopted the schema change job, at which point the planner sees it's not at its terminal status and hence will try to decide when can we schedule this operation that transition this Table element from DROPPED to ABSENT. One step to determine whether we can schedule this op is to call hasUnmeetableOutboundDeps on the op edge's TO node (i.e. t.e.To(), which will be (elem=Table, TargetStatus=absent, currentStatus=absent)).

So far everything is fine bc ideally we should just allow such an operation to be scheduled. However, the logic inside hasUnmeetableOutboundDeps will transitively check whether any other nodes, connected to from this node via a SameStagePrecedence dep rule, also have unmeetable outbound deps (bc if so, the current node cannot be scheduled). We have such a SameStagePrecedence rule descriptor removed right before dependent and is applicable here from the Table element (of absent status) to the Column element (of absent status). Then, when we recurse on the Column element (target=absent, current=absent), we found out that such a node is already fulfilled, as a result of the PrecommitPhase in the new old. This's why we hit this error.

@ajwerner
Copy link
Contributor

@Xiang-Gu, if we were to backport #90378 into 22.2, would it solve the problem?

@postamar
Copy link
Contributor

@Xiang-Gu please look into this at your leisure. If this is not enough throw it back into triage please.

Xiang-Gu added a commit to Xiang-Gu/cockroach that referenced this issue Nov 30, 2022
Previously, there is a niche case where we might run into a backward
incompatible issue (see cockroachdb#91528). We decided to fix it by relaxing the
sanity check that caused the issue and backport the change to
relase-22.2, so this backward incompatibility issue is contained only
between release-22.2.0 and master, which we think should be a rare
occurrance in the wild.

Epic: None
Release note: None
@craig craig bot closed this as completed in fe683e3 Dec 8, 2022
@postamar
Copy link
Contributor

postamar commented Dec 9, 2022

Reopening to not forget about the backport.

@postamar postamar reopened this Dec 9, 2022
Xiang-Gu added a commit to Xiang-Gu/cockroach that referenced this issue Dec 12, 2022
Previously, there is a niche case where we might run into a backward
incompatible issue (see cockroachdb#91528). We decided to fix it by relaxing the
sanity check that caused the issue and backport the change to
relase-22.2, so this backward incompatibility issue is contained only
between release-22.2.0 and master, which we think should be a rare
occurrance in the wild.

Epic: None
Release note: None
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants