Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: DistSQL physical planner might use SQL instances of different DistSQL versions during upgrade #88927

Closed
yuzefovich opened this issue Sep 28, 2022 · 10 comments
Labels
A-multitenancy Related to multi-tenancy C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Sep 28, 2022

This issue tracks addressing this TODO.

In particular, in 22.1 in 350188b we enabled the usage of DistSQL in multi-tenant environments. The DistSQL physical planner is using sqlinstance.GetAllInstances to get all healthy SQL pods for a particular tenant and then might schedule a part of the distributed query plan on any of those pods. The thing is that it appears to not be guaranteed that all those SQL instances run the same binary, so during an upgrade the distributed queries might hit "incompatible DistSQL version" (version mismatch in flow request) errors.

Jira issue: CRDB-20040

@yuzefovich yuzefovich added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 28, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Oct 19, 2022
@mgartner
Copy link
Collaborator

Yahor says that the best way to solve this would be to update the sqlinstance provider to only return pods of a certain version, though he's not sure how much work this would be. @rafiss do you think this is feasible.

Another solution would be to prune to list of pods returned by the provider so that distsql only uses pods of a compatible version. This might be a simpler solution.

@blathers-crl blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Nov 22, 2022
@yuzefovich
Copy link
Member Author

We do have the same problem in the single tenant world too (tracked in #87199), so it would be nice to implement something similar to what suggested there, and that would improve things for this issue too. However, I believe the correct solution would be to improve the sqlinstance provider.

@rafiss
Copy link
Collaborator

rafiss commented Nov 23, 2022

I think the @cockroachdb/multi-tenant would have the best thoughts here since they've been adding more safeguards around mixed version clusters lately.

@knz
Copy link
Contributor

knz commented Nov 25, 2022

cc @ajstorm for tracking

@knz knz added A-multitenancy Related to multi-tenancy T-multitenant Issues owned by the multi-tenant virtual team labels Nov 25, 2022
@ajstorm ajstorm added T-multitenant Issues owned by the multi-tenant virtual team and removed T-multitenant Issues owned by the multi-tenant virtual team labels Nov 25, 2022
@ajstorm
Copy link
Collaborator

ajstorm commented Nov 25, 2022 via email

@rafiss rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Nov 29, 2022
@ajstorm
Copy link
Collaborator

ajstorm commented Jan 9, 2023

I agree with the thoughts above that improving sqlinstance would help. To do I'd think we'd want to plumb the version information through to the InstanceInfo struct so that we could filter out all of the SQL pods which are at incompatible versions.

I'll also point out that this problem seems to be much worse in MT that in single tenant. If you upgrade the binary of one of your SQL pods and try and run DistSQL queries, they seems to deterministically fail until all SQL pods are at the same binary version (in some tests I've been running, at least). This is more painful than the single tenant case where we're limited to gossip inaccuracies.

The one thing that's not clear to me is how big of a deal this is for 23.1. In the UA world, this behaviour seems bad, but we're not supporting upgrades to UA in 23.1. In 23.2, would we hit this problem in all mixed version clusters? Or will the gossip information bail us out considering that we have only one in-process SQL pod on each node? If it's all mixed version clusters, will we need to fix this in 23.1 (or backport something to that release) to avoid hitting this issue on upgrades to 23.2?

@exalate-issue-sync exalate-issue-sync bot removed the T-multitenant Issues owned by the multi-tenant virtual team label Jan 10, 2023
@ajstorm
Copy link
Collaborator

ajstorm commented Jan 23, 2023

For 23.1 (and perhaps even 22.2), and based on this thread, we'd like to add a retry loop such that if the attempt to run DistSQL fails (say, due to mismatched pod version), we'll rerun the query without DistSQL. This will get around having to call a consistent version of GetInstances every time we plan a query.

@yuzefovich FYI

@rytaft
Copy link
Collaborator

rytaft commented Jan 31, 2023

@yuzefovich is this the issue we were discussing that would require a more involved solution that is not backportable and could take up to two weeks? I moved this to 23.2 since I think this is the same issue, but please correct me if I'm wrong.

@yuzefovich
Copy link
Member Author

@yuzefovich is this the issue we were discussing that would require a more involved solution that is not backportable and could take up to two weeks? I moved this to 23.2 since I think this is the same issue, but please correct me if I'm wrong.

Yes, this is the one. I do hope to prototype something here in the coming weeks to get a better sense for the amount of work required.

@mgartner mgartner moved this to New Backlog in SQL Queries Jul 24, 2023
@yuzefovich
Copy link
Member Author

yuzefovich commented Sep 7, 2023

We introduced the mechanism to retry some distributed query errors without DistSQL in #105451. There also have been improvements from Jeff in #99941. I think this issue can now be closed, and the remaining work around this area is tracked in https://cockroachlabs.atlassian.net/browse/CRDB-26692 / #100578.

@github-project-automation github-project-automation bot moved this from New Backlog to Done in SQL Queries Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

No branches or pull requests

6 participants