Reverting schema change job result in cluster inconsistent state #74324

colprog · 2021-12-30T08:49:30Z

Describe the problem

Cluster in a inconsistent state with no clear way to recover, backup/restore commands reports descriptor missing:

ERROR: failed to resolve targets specified in the BACKUP stmt: relation "Users" (77): invalid foreign key backreference: missing table=123: referenced table ID 123: descriptor not found

tried dropping the database, also fails:

ERROR: internal error: relation "Users" (77): invalid foreign key backreference: missing table=123: referenced table ID 123: descriptor not found
SQLSTATE: XX000
DETAIL: stack trace:
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/errors.go:84: init()
/usr/local/go/src/runtime/proc.go:6309: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:208: main()
/usr/local/go/src/runtime/asm_amd64.s:1371: goexit()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.

If you would rather not post publicly, please contact us directly
using the support form.

We appreciate your feedback.

Now we're stuck with malfunctioning cluster, new sql connection would fail since type discovery related queries like
SELECT pg_type.oid, enumlabel
FROM pg_enum
JOIN pg_type ON pg_type.oid=enumtypid;
would also fail

To Reproduce

apply schema change, deleted a column that's in a partial index in database A
cancel the job after a few hours of waiting
reverting takes another a few hours
drop the aforementioned partial index
24 hour after step 1, the cluster started reporting issue

Expected behavior
A way to recover from this state. A way to forcefully remove this database seems good enough

Additional data / screenshots
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue
creating this issue, but contact [email protected] to submit the log
files in private.

If applicable, add screenshots to help explain your problem.

Environment:

CockroachDB 21.2.3
Server OS: Linux
Client app: C# with dotnet ef

Additional context
What was the impact?

Add any other context about the problem here.

Jira issue: CRDB-12033

blathers-crl · 2021-12-30T08:49:32Z

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@cockroachdb/bulk-io (found keywords: backup,restore)
@cockroachdb/sql-experience (found keywords: pg_)
@cockroachdb/sql-schema (found keywords: schema change)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

blathers-crl · 2021-12-30T08:49:34Z

cc @cockroachdb/bulk-io

postamar · 2021-12-31T15:13:09Z

This misbehaviour, unfortunately, is caused by known limitations of the legacy schema changer. Reverting schema change jobs can leave the table descriptors in invalid states, effectively making the table inaccessible. That being said, I was surprised to find out that it's not possible to drop the table. This is most definitely not expected. We're looking into this.

In the long term, we're addressing this limitation by doing a massive overhaul of how schema changes are implemented. This effort is already underway but it will take a while before it bears fruit.

ajwerner · 2022-02-01T15:15:04Z

This is loosely related to #50651.

To do this we'd need some way to tell the resolution in the drop case to allow some invalid outgoing references. This is hard. For now, we'll say that we need to repair the graph.

In some cases there's a desire to drop a whole database which might be corrupted. In that case, once there are no cross-database references, we can just destroy all the descriptors and data without thinking too hard.

This will depend on cross-database reference removal. CC @postamar

ajwerner · 2022-02-01T15:15:30Z

The assignment here is to remember this discussion and the cross database reference work.

colprog added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 30, 2021

blathers-crl bot added A-disaster-recovery O-community Originated from the community X-blathers-triaged blathers was able to find an owner T-disaster-recovery labels Dec 30, 2021

blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Dec 31, 2021

postamar removed T-disaster-recovery A-disaster-recovery labels Dec 31, 2021

postamar self-assigned this Jan 4, 2022

postamar removed their assignment Jan 26, 2022

ajwerner assigned postamar Feb 1, 2022

postamar removed their assignment Mar 17, 2022

exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reverting schema change job result in cluster inconsistent state #74324

Reverting schema change job result in cluster inconsistent state #74324

colprog commented Dec 30, 2021 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Dec 30, 2021

blathers-crl bot commented Dec 30, 2021

postamar commented Dec 31, 2021

ajwerner commented Feb 1, 2022

ajwerner commented Feb 1, 2022

Reverting schema change job result in cluster inconsistent state #74324

Reverting schema change job result in cluster inconsistent state #74324

Comments

colprog commented Dec 30, 2021 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Dec 30, 2021

blathers-crl bot commented Dec 30, 2021

postamar commented Dec 31, 2021

ajwerner commented Feb 1, 2022

ajwerner commented Feb 1, 2022

colprog commented Dec 30, 2021 •

edited by cockroach-jira-scripts

Loading