distsql: better handling of failures and dead nodes #15637

RaduBerinde · 2017-05-03T13:53:50Z

In 1.0, we avoid hosts that known to be dead, but we may fail the first attempt of a query if we don't know the host is dead.

This issue tracks implementing better handling of these situations. The idea is to support multiple flows for the same query on a single node, and to reschedule flows (which we failed to schedule) on the gateway node.

One important aspect of "error" in this discussion is version mismatch: if we can reschedule flows that can't run on a host because of its version, we can provide query availability (with decreased performance) during an upgrade even when there are incompatible changes in distsql.

cuongdo · 2017-07-07T15:15:56Z

how is this looking for 1.1?

RaduBerinde · 2017-07-07T15:38:00Z

We have a good idea of what we want to do here but nothing is implemented at this point.

rjnn · 2017-08-15T15:19:12Z

This is being addressed by #17497, which should merge in the 1.1 timeframe.

andreimatei · 2017-09-01T23:33:41Z

The version mismatched problem has been address some. The node failed recently case remains open.

vivekmenezes · 2017-09-11T15:58:39Z

Probably obvious but this is a high priority issue for us to fix.

vivekmenezes · 2017-10-26T18:14:01Z

radu and andrei are thinking about this problem and trying to figure out what to do in the next release.

nstewart · 2018-08-05T23:53:06Z

This issue is preventing me from doing standard reporting. Is there an update? I'm getting psycopg2.InternalError: [n3] communication error: rpc error: code = Canceled desc = context canceled and I see in my 5 node cluster a couple nodes restarted. Happy to create a new ticket but I see #19882 was closed in favor of this one.

jordanlewis · 2018-08-06T02:31:42Z

@nstewart, does the query complete if you try it again? Or are you never able to complete your query?

nstewart · 2018-08-06T10:27:06Z

My script runs ~10 queries serially. Each query runs for about several minutes (but not hours). I've run the script several times over the weekend and it fails at different queries each time with context canceled. So it looks like retrying does work in some cases.

I'm betting after I add retry loops this thing will eventually complete (I'm going to do this in the short term so I'm not blocked), but since these aren't serializable retries and there is only one, single-threaded client for this database, I wonder if that would be masking a bigger problem.

jordanlewis · 2018-08-06T11:55:58Z

Yes, surely doing that would be masking a problem. During your script's run, are nodes restarting a lot?

nstewart · 2018-08-06T12:07:30Z

Yes, I am running a 5 node cluster and I saw 1 node restart once and another restart 4 times. I saved the debug zip here: #28278

github-actions · 2021-06-09T02:06:25Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

RaduBerinde added this to the 1.1 milestone May 3, 2017

RaduBerinde self-assigned this May 3, 2017

RaduBerinde mentioned this issue Jul 28, 2017

distsql: ensure backward compatibility during upgrade #17277

Closed

rjnn assigned andreimatei and unassigned RaduBerinde Aug 15, 2017

andreimatei modified the milestones: 1.2, 1.1 Sep 1, 2017

This was referenced Nov 8, 2017

[WIP] distsql: retry on communication errors #19919

Closed

distsql: node shutdown/failure results in query returning an internal error #19882

Closed

andreimatei modified the milestones: 2.0, 2.1 Feb 26, 2018

vivekmenezes mentioned this issue Mar 12, 2018

sql: distsql plans against unavailable node #23601

Closed

knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 27, 2018

nstewart mentioned this issue Aug 5, 2018

Node failures with "snapshot intersects existing range" on v2.0.0 #28278

Closed

nstewart added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Aug 6, 2018

jordanlewis modified the milestones: 2.1, 2.2 Sep 26, 2018

jordanlewis mentioned this issue Oct 3, 2018

teamcity: failed tests on master: acceptance/bank/node-restart #29149

Closed

petermattis removed this from the 2.2 milestone Oct 5, 2018

jordanlewis added S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Apr 30, 2019

github-actions bot added the no-issue-activity label Jun 9, 2021

github-actions bot added the X-stale label Jun 20, 2021

github-actions bot closed this as completed Jun 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: better handling of failures and dead nodes #15637

distsql: better handling of failures and dead nodes #15637

RaduBerinde commented May 3, 2017

cuongdo commented Jul 7, 2017

RaduBerinde commented Jul 7, 2017

rjnn commented Aug 15, 2017

andreimatei commented Sep 1, 2017

vivekmenezes commented Sep 11, 2017

vivekmenezes commented Oct 26, 2017

nstewart commented Aug 5, 2018

jordanlewis commented Aug 6, 2018

nstewart commented Aug 6, 2018 •

edited

Loading

jordanlewis commented Aug 6, 2018

nstewart commented Aug 6, 2018

github-actions bot commented Jun 9, 2021

distsql: better handling of failures and dead nodes #15637

distsql: better handling of failures and dead nodes #15637

Comments

RaduBerinde commented May 3, 2017

cuongdo commented Jul 7, 2017

RaduBerinde commented Jul 7, 2017

rjnn commented Aug 15, 2017

andreimatei commented Sep 1, 2017

vivekmenezes commented Sep 11, 2017

vivekmenezes commented Oct 26, 2017

nstewart commented Aug 5, 2018

jordanlewis commented Aug 6, 2018

nstewart commented Aug 6, 2018 • edited Loading

jordanlewis commented Aug 6, 2018

nstewart commented Aug 6, 2018

github-actions bot commented Jun 9, 2021

nstewart commented Aug 6, 2018 •

edited

Loading