Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverting schema change job result in cluster inconsistent state #74324

Open
colprog opened this issue Dec 30, 2021 · 5 comments
Open

Reverting schema change job result in cluster inconsistent state #74324

colprog opened this issue Dec 30, 2021 · 5 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) X-blathers-triaged blathers was able to find an owner

Comments

@colprog
Copy link

colprog commented Dec 30, 2021

Describe the problem

Cluster in a inconsistent state with no clear way to recover, backup/restore commands reports descriptor missing:

ERROR: failed to resolve targets specified in the BACKUP stmt: relation "Users" (77): invalid foreign key backreference: missing table=123: referenced table ID 123: descriptor not found

tried dropping the database, also fails:

ERROR: internal error: relation "Users" (77): invalid foreign key backreference: missing table=123: referenced table ID 123: descriptor not found
SQLSTATE: XX000
DETAIL: stack trace:
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/errors.go:84: init()
/usr/local/go/src/runtime/proc.go:6309: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:6286: doInit()
/usr/local/go/src/runtime/proc.go:208: main()
/usr/local/go/src/runtime/asm_amd64.s:1371: goexit()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.

If you would rather not post publicly, please contact us directly
using the support form.

We appreciate your feedback.

Now we're stuck with malfunctioning cluster, new sql connection would fail since type discovery related queries like
SELECT pg_type.oid, enumlabel
FROM pg_enum
JOIN pg_type ON pg_type.oid=enumtypid;
would also fail

To Reproduce

  1. apply schema change, deleted a column that's in a partial index in database A
  2. cancel the job after a few hours of waiting
  3. reverting takes another a few hours
  4. drop the aforementioned partial index
  5. 24 hour after step 1, the cluster started reporting issue

Expected behavior
A way to recover from this state. A way to forcefully remove this database seems good enough

Additional data / screenshots
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue
creating this issue, but contact [email protected] to submit the log
files in private.

If applicable, add screenshots to help explain your problem.

Environment:

  • CockroachDB 21.2.3
  • Server OS: Linux
  • Client app: C# with dotnet ef

Additional context
What was the impact?

Add any other context about the problem here.

Jira issue: CRDB-12033

@colprog colprog added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 30, 2021
@blathers-crl
Copy link

blathers-crl bot commented Dec 30, 2021

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/bulk-io (found keywords: backup,restore)
  • @cockroachdb/sql-experience (found keywords: pg_)
  • @cockroachdb/sql-schema (found keywords: schema change)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added A-disaster-recovery O-community Originated from the community X-blathers-triaged blathers was able to find an owner T-disaster-recovery labels Dec 30, 2021
@blathers-crl
Copy link

blathers-crl bot commented Dec 30, 2021

cc @cockroachdb/bulk-io

@postamar
Copy link
Contributor

This misbehaviour, unfortunately, is caused by known limitations of the legacy schema changer. Reverting schema change jobs can leave the table descriptors in invalid states, effectively making the table inaccessible. That being said, I was surprised to find out that it's not possible to drop the table. This is most definitely not expected. We're looking into this.

In the long term, we're addressing this limitation by doing a massive overhaul of how schema changes are implemented. This effort is already underway but it will take a while before it bears fruit.

@postamar postamar self-assigned this Jan 4, 2022
@postamar postamar removed their assignment Jan 26, 2022
@ajwerner
Copy link
Contributor

ajwerner commented Feb 1, 2022

This is loosely related to #50651.

To do this we'd need some way to tell the resolution in the drop case to allow some invalid outgoing references. This is hard. For now, we'll say that we need to repair the graph.

In some cases there's a desire to drop a whole database which might be corrupted. In that case, once there are no cross-database references, we can just destroy all the descriptors and data without thinking too hard.

This will depend on cross-database reference removal. CC @postamar

@ajwerner
Copy link
Contributor

ajwerner commented Feb 1, 2022

The assignment here is to remember this discussion and the cross database reference work.

@postamar postamar removed their assignment Mar 17, 2022
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

3 participants