-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverting schema change job result in cluster inconsistent state #74324
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here. I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
cc @cockroachdb/bulk-io |
This misbehaviour, unfortunately, is caused by known limitations of the legacy schema changer. Reverting schema change jobs can leave the table descriptors in invalid states, effectively making the table inaccessible. That being said, I was surprised to find out that it's not possible to drop the table. This is most definitely not expected. We're looking into this. In the long term, we're addressing this limitation by doing a massive overhaul of how schema changes are implemented. This effort is already underway but it will take a while before it bears fruit. |
This is loosely related to #50651. To do this we'd need some way to tell the resolution in the drop case to allow some invalid outgoing references. This is hard. For now, we'll say that we need to repair the graph. In some cases there's a desire to drop a whole database which might be corrupted. In that case, once there are no cross-database references, we can just destroy all the descriptors and data without thinking too hard. This will depend on cross-database reference removal. CC @postamar |
The assignment here is to remember this discussion and the cross database reference work. |
Describe the problem
Cluster in a inconsistent state with no clear way to recover, backup/restore commands reports descriptor missing:
tried dropping the database, also fails:
Now we're stuck with malfunctioning cluster, new sql connection would fail since type discovery related queries like
SELECT pg_type.oid, enumlabel
FROM pg_enum
JOIN pg_type ON pg_type.oid=enumtypid;
would also fail
To Reproduce
Expected behavior
A way to recover from this state. A way to forcefully remove this database seems good enough
Additional data / screenshots
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.
If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).
Note that log files can contain confidential information. Please continue
creating this issue, but contact [email protected] to submit the log
files in private.
If applicable, add screenshots to help explain your problem.
Environment:
Additional context
What was the impact?
Add any other context about the problem here.
Jira issue: CRDB-12033
The text was updated successfully, but these errors were encountered: