debug: The problemranges debug page hangs if nodes aren't responsive #15342

a-robinson · 2017-04-25T20:26:54Z

On a cluster that's having some problems (e.g. the one from #15341), the problem ranges page is really tough to get to actually load. It just hangs for minutes on end, presumably because one or more of the nodes is being unresponsive. Assuming that this is the problem, the logic for rendering the page should just cut its losses at some point, return what it knows, and warn the user that the page is missing information from such-and-such nodes.

This doesn't appear to be deterministic though -- sometimes the page finishes loading in tens of seconds, and other times it continues hanging. It might just be a matter of the unhealthy nodes sometimes responding and sometimes not, but I'm not sure.

BramGruneir · 2017-04-25T21:33:45Z

One thing you can try as a temporary fix, is to look at just a specific node.
debug/problemranges?node_id=N

a-robinson · 2017-04-25T21:41:32Z

Yup, that did work much better.

BramGruneir · 2017-08-04T15:34:36Z

So I was looking into this. Right now, we only send requests to live nodes, and if there's an issue reading from the node liveness, we might not send any requests at all. This also means that nodes have a flaky liveness record, they may or may not be included in the request.

So I can forgo the liveness check and have it just try to hit every node with our standard timeout. And report the results of each attempted connection similar to how the new range page does it. See the last table in the pages #17433.

Not sure if that will solve the problem, but it should make it a bit easier to work with (and provide more info).

BramGruneir · 2017-08-30T19:57:23Z

So I've merged #17913 that should do a better job of dealing with down nodes on the problem range report.

Reopen this if there are new issues.

Bram

a-robinson assigned BramGruneir Apr 25, 2017

BramGruneir mentioned this issue Apr 25, 2017

server/status: debug pages tracking #14671

Closed

37 tasks

dianasaur323 added this to the 1.1 milestone Apr 26, 2017

a-robinson mentioned this issue May 5, 2017

gossip, server, ui: Node liveness records from dead nodes never time out #15609

Closed

BramGruneir added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 4, 2017

a-robinson mentioned this issue Aug 24, 2017

stability: make debug tools useful for wedged/underreplicated clusters #17904

Closed

BramGruneir closed this as completed Aug 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debug: The problemranges debug page hangs if nodes aren't responsive #15342

debug: The problemranges debug page hangs if nodes aren't responsive #15342

a-robinson commented Apr 25, 2017

BramGruneir commented Apr 25, 2017

a-robinson commented Apr 25, 2017

BramGruneir commented Aug 4, 2017

BramGruneir commented Aug 30, 2017

debug: The problemranges debug page hangs if nodes aren't responsive #15342

debug: The problemranges debug page hangs if nodes aren't responsive #15342

Comments

a-robinson commented Apr 25, 2017

BramGruneir commented Apr 25, 2017

a-robinson commented Apr 25, 2017

BramGruneir commented Aug 4, 2017

BramGruneir commented Aug 30, 2017