sql: DistSQL physical planner might use SQL instances of different DistSQL versions during upgrade #88927

yuzefovich · 2022-09-28T16:30:59Z

This issue tracks addressing this TODO.

In particular, in 22.1 in 350188b we enabled the usage of DistSQL in multi-tenant environments. The DistSQL physical planner is using sqlinstance.GetAllInstances to get all healthy SQL pods for a particular tenant and then might schedule a part of the distributed query plan on any of those pods. The thing is that it appears to not be guaranteed that all those SQL instances run the same binary, so during an upgrade the distributed queries might hit "incompatible DistSQL version" (version mismatch in flow request) errors.

Jira issue: CRDB-20040

The text was updated successfully, but these errors were encountered:

mgartner · 2022-11-22T19:54:08Z

Yahor says that the best way to solve this would be to update the sqlinstance provider to only return pods of a certain version, though he's not sure how much work this would be. @rafiss do you think this is feasible.

Another solution would be to prune to list of pods returned by the provider so that distsql only uses pods of a compatible version. This might be a simpler solution.

yuzefovich · 2022-11-22T20:02:11Z

We do have the same problem in the single tenant world too (tracked in #87199), so it would be nice to implement something similar to what suggested there, and that would improve things for this issue too. However, I believe the correct solution would be to improve the sqlinstance provider.

rafiss · 2022-11-23T20:09:32Z

I think the @cockroachdb/multi-tenant would have the best thoughts here since they've been adding more safeguards around mixed version clusters lately.

knz · 2022-11-25T11:22:03Z

cc @ajstorm for tracking

ajstorm · 2022-11-25T13:45:06Z

It's on my radar, and I may actually be working on something which makes this easier to solve. Will update in a couple of weeks.

…

On Fri, Nov 25, 2022 at 6:22 AM knz ***@***.***> wrote: cc @ajstorm <https://github.com/ajstorm> for tracking — Reply to this email directly, view it on GitHub <#88927 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEMXVOSYS5XPXOCA56IRMMLWKCOONANCNFSM6AAAAAAQX6THNI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ajstorm · 2023-01-09T23:38:43Z

I agree with the thoughts above that improving sqlinstance would help. To do I'd think we'd want to plumb the version information through to the InstanceInfo struct so that we could filter out all of the SQL pods which are at incompatible versions.

I'll also point out that this problem seems to be much worse in MT that in single tenant. If you upgrade the binary of one of your SQL pods and try and run DistSQL queries, they seems to deterministically fail until all SQL pods are at the same binary version (in some tests I've been running, at least). This is more painful than the single tenant case where we're limited to gossip inaccuracies.

The one thing that's not clear to me is how big of a deal this is for 23.1. In the UA world, this behaviour seems bad, but we're not supporting upgrades to UA in 23.1. In 23.2, would we hit this problem in all mixed version clusters? Or will the gossip information bail us out considering that we have only one in-process SQL pod on each node? If it's all mixed version clusters, will we need to fix this in 23.1 (or backport something to that release) to avoid hitting this issue on upgrades to 23.2?

ajstorm · 2023-01-23T18:52:51Z

For 23.1 (and perhaps even 22.2), and based on this thread, we'd like to add a retry loop such that if the attempt to run DistSQL fails (say, due to mismatched pod version), we'll rerun the query without DistSQL. This will get around having to call a consistent version of GetInstances every time we plan a query.

@yuzefovich FYI

rytaft · 2023-01-31T22:24:48Z

@yuzefovich is this the issue we were discussing that would require a more involved solution that is not backportable and could take up to two weeks? I moved this to 23.2 since I think this is the same issue, but please correct me if I'm wrong.

yuzefovich · 2023-02-01T18:27:09Z

@yuzefovich is this the issue we were discussing that would require a more involved solution that is not backportable and could take up to two weeks? I moved this to 23.2 since I think this is the same issue, but please correct me if I'm wrong.

Yes, this is the one. I do hope to prototype something here in the coming weeks to get a better sense for the amount of work required.

yuzefovich · 2023-09-07T18:25:42Z

We introduced the mechanism to retry some distributed query errors without DistSQL in #105451. There also have been improvements from Jeff in #99941. I think this issue can now be closed, and the remaining work around this area is tracked in https://cockroachlabs.atlassian.net/browse/CRDB-26692 / #100578.

yuzefovich added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 28, 2022

blathers-crl bot added the T-sql-queries SQL Queries Team label Oct 19, 2022

blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Nov 22, 2022

knz added A-multitenancy Related to multi-tenancy T-multitenant Issues owned by the multi-tenant virtual team labels Nov 25, 2022

ajstorm added T-multitenant Issues owned by the multi-tenant virtual team and removed T-multitenant Issues owned by the multi-tenant virtual team labels Nov 25, 2022

rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Nov 29, 2022

exalate-issue-sync bot removed the T-multitenant Issues owned by the multi-tenant virtual team label Jan 10, 2023

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to New Backlog in SQL Queries Jul 24, 2023

yuzefovich closed this as completed Sep 7, 2023

github-project-automation bot moved this from New Backlog to Done in SQL Queries Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: DistSQL physical planner might use SQL instances of different DistSQL versions during upgrade #88927

sql: DistSQL physical planner might use SQL instances of different DistSQL versions during upgrade #88927

yuzefovich commented Sep 28, 2022 •

edited by rytaft

Loading

mgartner commented Nov 22, 2022

yuzefovich commented Nov 22, 2022

rafiss commented Nov 23, 2022

knz commented Nov 25, 2022

ajstorm commented Nov 25, 2022 via email

ajstorm commented Jan 9, 2023 •

edited

Loading

ajstorm commented Jan 23, 2023

rytaft commented Jan 31, 2023

yuzefovich commented Feb 1, 2023

yuzefovich commented Sep 7, 2023 •

edited

Loading

sql: DistSQL physical planner might use SQL instances of different DistSQL versions during upgrade #88927

sql: DistSQL physical planner might use SQL instances of different DistSQL versions during upgrade #88927

Comments

yuzefovich commented Sep 28, 2022 • edited by rytaft Loading

mgartner commented Nov 22, 2022

yuzefovich commented Nov 22, 2022

rafiss commented Nov 23, 2022

knz commented Nov 25, 2022

ajstorm commented Nov 25, 2022 via email

ajstorm commented Jan 9, 2023 • edited Loading

ajstorm commented Jan 23, 2023

rytaft commented Jan 31, 2023

yuzefovich commented Feb 1, 2023

yuzefovich commented Sep 7, 2023 • edited Loading

yuzefovich commented Sep 28, 2022 •

edited by rytaft

Loading

ajstorm commented Jan 9, 2023 •

edited

Loading

yuzefovich commented Sep 7, 2023 •

edited

Loading