-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
105451: sql: re-execute distributed query as local for some errors r=yuzefovich a=yuzefovich This commit teaches the main query code path (i.e. ignoring sub- and post-queries) to retry distributed plans as local in some cases. In particular, we use this retry mechanism if: - the error is SQL retryable (i.e. it'll have a high chance of success during the local execution) - no data has been pushed to the result writer by the distributed query (this shouldn't be a frequent scenario since most SQL retryable errors are likely to occur during the plan setup / before any data can be produced by the query). This retry mechanism allows us to hide transient network problems, and - more importantly - in the multi-tenant model it allows us to go around the problem when "not ready" SQL instance is being used for DistSQL planning (e.g. the instance might have been brought down, but the cache on top of the system table hasn't been updated accordingly). I believe that no matter the improvements that we can make to the instance cache, there will also be a window (which should hopefully getting smaller - according to David T it's currently 45s but he hopes to bring it down to 7s or so) during which the instance cache is stale, so DistSQL planner could use "not ready" instances. The rationale for why it is ok to do this retry is that we create brand-new processors that aren't affiliated to the distributed plan that was just cleaned up. It's worth mentioning that the planNode tree couldn't haven been reused in this way, but if we needed to execute any planNodes directly, then we would have to run such a plan in a local fashion. In other words, the fact that we had a distributed plan initially guarantees that we don't have any planNodes to be concerned about. Possible downside to this approach is that it increases overall query latency, so ideally we wouldn't plan on "not ready" instances in the first place (and we have issues to improve there), but given that we now fully parallelize the setup of distributed plans, the latency increase should be bound, assuming that most retryable errors occur during the distributed plan setup. Addresses: #100578. Epic: None Release note: None 105715: dev: ignore output_base return r=dt a=dt Release note: none. Epic: none. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: David Taylor <[email protected]>
- Loading branch information
Showing
5 changed files
with
243 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters