Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql: better handling of failures and dead nodes #15637

Closed
RaduBerinde opened this issue May 3, 2017 · 12 comments
Closed

distsql: better handling of failures and dead nodes #15637

RaduBerinde opened this issue May 3, 2017 · 12 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) X-stale

Comments

@RaduBerinde
Copy link
Member

In 1.0, we avoid hosts that known to be dead, but we may fail the first attempt of a query if we don't know the host is dead.

This issue tracks implementing better handling of these situations. The idea is to support multiple flows for the same query on a single node, and to reschedule flows (which we failed to schedule) on the gateway node.

One important aspect of "error" in this discussion is version mismatch: if we can reschedule flows that can't run on a host because of its version, we can provide query availability (with decreased performance) during an upgrade even when there are incompatible changes in distsql.

@RaduBerinde RaduBerinde added this to the 1.1 milestone May 3, 2017
@RaduBerinde RaduBerinde self-assigned this May 3, 2017
@cuongdo
Copy link
Contributor

cuongdo commented Jul 7, 2017

how is this looking for 1.1?

@RaduBerinde
Copy link
Member Author

We have a good idea of what we want to do here but nothing is implemented at this point.

@rjnn
Copy link
Contributor

rjnn commented Aug 15, 2017

This is being addressed by #17497, which should merge in the 1.1 timeframe.

@andreimatei
Copy link
Contributor

The version mismatched problem has been address some. The node failed recently case remains open.

@andreimatei andreimatei modified the milestones: 1.2, 1.1 Sep 1, 2017
@vivekmenezes
Copy link
Contributor

Probably obvious but this is a high priority issue for us to fix.

@vivekmenezes
Copy link
Contributor

radu and andrei are thinking about this problem and trying to figure out what to do in the next release.

@andreimatei andreimatei modified the milestones: 2.0, 2.1 Feb 26, 2018
@knz knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 27, 2018
@nstewart
Copy link
Contributor

nstewart commented Aug 5, 2018

This issue is preventing me from doing standard reporting. Is there an update? I'm getting psycopg2.InternalError: [n3] communication error: rpc error: code = Canceled desc = context canceled and I see in my 5 node cluster a couple nodes restarted. Happy to create a new ticket but I see #19882 was closed in favor of this one.

@jordanlewis
Copy link
Member

@nstewart, does the query complete if you try it again? Or are you never able to complete your query?

@nstewart
Copy link
Contributor

nstewart commented Aug 6, 2018

My script runs ~10 queries serially. Each query runs for about several minutes (but not hours). I've run the script several times over the weekend and it fails at different queries each time with context canceled. So it looks like retrying does work in some cases.

I'm betting after I add retry loops this thing will eventually complete (I'm going to do this in the short term so I'm not blocked), but since these aren't serializable retries and there is only one, single-threaded client for this database, I wonder if that would be masking a bigger problem.

@nstewart nstewart added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Aug 6, 2018
@jordanlewis
Copy link
Member

Yes, surely doing that would be masking a problem. During your script's run, are nodes restarting a lot?

@nstewart
Copy link
Contributor

nstewart commented Aug 6, 2018

Yes, I am running a 5 node cluster and I saw 1 node restart once and another restart 4 times. I saved the debug zip here: #28278

@jordanlewis jordanlewis added S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Apr 30, 2019
@github-actions
Copy link

github-actions bot commented Jun 9, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) X-stale
Projects
None yet
Development

No branches or pull requests

9 participants