-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsql: better handling of failures and dead nodes #15637
Comments
how is this looking for 1.1? |
We have a good idea of what we want to do here but nothing is implemented at this point. |
This is being addressed by #17497, which should merge in the 1.1 timeframe. |
The version mismatched problem has been address some. The node failed recently case remains open. |
Probably obvious but this is a high priority issue for us to fix. |
radu and andrei are thinking about this problem and trying to figure out what to do in the next release. |
This issue is preventing me from doing standard reporting. Is there an update? I'm getting |
@nstewart, does the query complete if you try it again? Or are you never able to complete your query? |
My script runs ~10 queries serially. Each query runs for about several minutes (but not hours). I've run the script several times over the weekend and it fails at different queries each time with I'm betting after I add retry loops this thing will eventually complete (I'm going to do this in the short term so I'm not blocked), but since these aren't serializable retries and there is only one, single-threaded client for this database, I wonder if that would be masking a bigger problem. |
Yes, surely doing that would be masking a problem. During your script's run, are nodes restarting a lot? |
Yes, I am running a 5 node cluster and I saw 1 node restart once and another restart 4 times. I saved the debug zip here: #28278 |
We have marked this issue as stale because it has been inactive for |
In 1.0, we avoid hosts that known to be dead, but we may fail the first attempt of a query if we don't know the host is dead.
This issue tracks implementing better handling of these situations. The idea is to support multiple flows for the same query on a single node, and to reschedule flows (which we failed to schedule) on the gateway node.
One important aspect of "error" in this discussion is version mismatch: if we can reschedule flows that can't run on a host because of its version, we can provide query availability (with decreased performance) during an upgrade even when there are incompatible changes in distsql.
The text was updated successfully, but these errors were encountered: