-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: DistSQL physical planner might use SQL instances of different DistSQL versions during upgrade #88927
Comments
Yahor says that the best way to solve this would be to update the Another solution would be to prune to list of pods returned by the provider so that distsql only uses pods of a compatible version. This might be a simpler solution. |
We do have the same problem in the single tenant world too (tracked in #87199), so it would be nice to implement something similar to what suggested there, and that would improve things for this issue too. However, I believe the correct solution would be to improve the |
I think the @cockroachdb/multi-tenant would have the best thoughts here since they've been adding more safeguards around mixed version clusters lately. |
cc @ajstorm for tracking |
It's on my radar, and I may actually be working on something which makes
this easier to solve. Will update in a couple of weeks.
…On Fri, Nov 25, 2022 at 6:22 AM knz ***@***.***> wrote:
cc @ajstorm <https://github.com/ajstorm> for tracking
—
Reply to this email directly, view it on GitHub
<#88927 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEMXVOSYS5XPXOCA56IRMMLWKCOONANCNFSM6AAAAAAQX6THNI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I agree with the thoughts above that improving I'll also point out that this problem seems to be much worse in MT that in single tenant. If you upgrade the binary of one of your SQL pods and try and run DistSQL queries, they seems to deterministically fail until all SQL pods are at the same binary version (in some tests I've been running, at least). This is more painful than the single tenant case where we're limited to gossip inaccuracies. The one thing that's not clear to me is how big of a deal this is for 23.1. In the UA world, this behaviour seems bad, but we're not supporting upgrades to UA in 23.1. In 23.2, would we hit this problem in all mixed version clusters? Or will the gossip information bail us out considering that we have only one in-process SQL pod on each node? If it's all mixed version clusters, will we need to fix this in 23.1 (or backport something to that release) to avoid hitting this issue on upgrades to 23.2? |
For 23.1 (and perhaps even 22.2), and based on this thread, we'd like to add a retry loop such that if the attempt to run DistSQL fails (say, due to mismatched pod version), we'll rerun the query without DistSQL. This will get around having to call a consistent version of @yuzefovich FYI |
@yuzefovich is this the issue we were discussing that would require a more involved solution that is not backportable and could take up to two weeks? I moved this to 23.2 since I think this is the same issue, but please correct me if I'm wrong. |
Yes, this is the one. I do hope to prototype something here in the coming weeks to get a better sense for the amount of work required. |
We introduced the mechanism to retry some distributed query errors without DistSQL in #105451. There also have been improvements from Jeff in #99941. I think this issue can now be closed, and the remaining work around this area is tracked in https://cockroachlabs.atlassian.net/browse/CRDB-26692 / #100578. |
This issue tracks addressing this TODO.
In particular, in 22.1 in 350188b we enabled the usage of DistSQL in multi-tenant environments. The DistSQL physical planner is using
sqlinstance.GetAllInstances
to get all healthy SQL pods for a particular tenant and then might schedule a part of the distributed query plan on any of those pods. The thing is that it appears to not be guaranteed that all those SQL instances run the same binary, so during an upgrade the distributed queries might hit "incompatible DistSQL version" (version mismatch in flow request
) errors.Jira issue: CRDB-20040
The text was updated successfully, but these errors were encountered: