-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsqlrun: schedule failed streams on the gateway #17497
Conversation
... to be returned by SetupFlow when the server doesn't support the requested version, instead of the "internal error" we used to return. Even though we won't be able to use the typed error for now because 1.0 didn't have it, I still think it's a good idea to have it from now on.
Before this patch, failure to schedule any remote flows was causing the query to fail. This is not great, as scheduling can fail because the remote node went down (but it must have gone down recently, otherwise we wouldn't even have scheduled a flow on it), but also because the remote node is running an older version. This patch makes distSQLPlanner.Run() schedule locally the flows that failed to schedule remotely. It does this by merging the processors in those flows with the local flow. Moving flows to the gateway means that any other remote processor that was planned to connect to a moved processor will have a bad time connecting; we fix it by adding a fallback to outboxes - all outboxes are programmed to fallback to connecting to the gateway if the connection to their primary target doesn't succeed in within a timeout.
…al flow on the gateway Before this patch, when a query was compiled to flows A,B,G, and A was successfully scheduled but B's scheduling errors, we didn't schedule the gateway flow G. If A was trying to connect to G, the process of stream establishment would have waited for the connection timeout before giving up. This patch improves this by registering a "poisoned entry" on G that informs anybody trying to connect that the flow is toast.
I still have to figure out the testing story, but sending out for opinions. |
This is cool stuff! Thanks for getting this done so quickly! For testing, we could add a testing knob to inject version mismatch errors and verify a query while turning it on and off for various nodes. I worked on a change at some point that introduced an "finished flow cache" which would fail-fast requests for those flows, but never made it into a PR. The idea was that if a flow fails and gets unregistered before other flows connect to it, we don't want those to wait for the timeout. The same could be used instead of the poisoning stuff. If you want to look at it, this is the commit: RaduBerinde@3eb5dd0 We can of course add this later, just thought I should point it out. Review status: 0 of 13 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed. pkg/sql/distsql_running.go, line 41 at r3 (raw file):
this can be longer? the connection timeouts are a few seconds I think pkg/sql/distsqlrun/flow_registry.go, line 205 at r3 (raw file):
[nit] I would call this "duration" (as in duration of the poisoning) pkg/sql/distsqlrun/flow_registry.go, line 229 at r3 (raw file):
[nit] extra blank line pkg/sql/distsqlrun/outbox.go, line 320 at r2 (raw file):
[nit] would be a bit more readable to use a pkg/sql/distsqlrun/outbox.go, line 323 at r2 (raw file):
Pretty smart stuff going on here with the timer! pkg/sql/distsqlrun/outbox.go, line 359 at r2 (raw file):
I think it's ok to have the same message (with empty fallback if there is none); it's a bit more explicit than the other one. pkg/sql/distsqlrun/outbox.go, line 365 at r2 (raw file):
Can this not have the same name as the method above? Comments from Reviewable |
Maybe rebase this? Also it will want a release note. |
This PR is dead, at least for now. The fundamental issue is that it's hard
to decide when to fallback because it's hard to conclude that you've
connected to the wrong node.
…On Fri, Apr 27, 2018 at 10:53 AM, kena ***@***.***> wrote:
Maybe rebase this? Also it will want a release note.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#17497 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXBcQgxFL82hW_yujocKKiiG-WrBCleks5tszDUgaJpZM4OwApb>
.
|
cc @asubiotto |
Closing this as obsolete. |
Before this patch, failure to schedule any remote flows was causing the
query to fail. This is not great, as scheduling can fail because the
remote node went down (but it must have gone down recently, otherwise we
wouldn't even have scheduled a flow on it), but also because the remote
node is running an older version.
This patch makes distSQLPlanner.Run() schedule locally the flows that
failed to schedule remotely. It does this by merging the processors in
those flows with the local flow.
Moving flows to the gateway means that any other remote processor that
was planned to connect to a moved processor will have a bad time
connecting; we fix it by adding a fallback to outboxes - all outboxes
are programmed to fallback to connecting to the gateway if the
connection to their primary target doesn't succeed in within a timeout.