Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug: The problemranges debug page hangs if nodes aren't responsive #15342

Closed
a-robinson opened this issue Apr 25, 2017 · 4 comments
Closed
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Milestone

Comments

@a-robinson
Copy link
Contributor

On a cluster that's having some problems (e.g. the one from #15341), the problem ranges page is really tough to get to actually load. It just hangs for minutes on end, presumably because one or more of the nodes is being unresponsive. Assuming that this is the problem, the logic for rendering the page should just cut its losses at some point, return what it knows, and warn the user that the page is missing information from such-and-such nodes.

This doesn't appear to be deterministic though -- sometimes the page finishes loading in tens of seconds, and other times it continues hanging. It might just be a matter of the unhealthy nodes sometimes responding and sometimes not, but I'm not sure.

@BramGruneir
Copy link
Member

One thing you can try as a temporary fix, is to look at just a specific node.
debug/problemranges?node_id=N

@a-robinson
Copy link
Contributor Author

Yup, that did work much better.

@dianasaur323 dianasaur323 added this to the 1.1 milestone Apr 26, 2017
@BramGruneir BramGruneir added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 4, 2017
@BramGruneir
Copy link
Member

So I was looking into this. Right now, we only send requests to live nodes, and if there's an issue reading from the node liveness, we might not send any requests at all. This also means that nodes have a flaky liveness record, they may or may not be included in the request.

So I can forgo the liveness check and have it just try to hit every node with our standard timeout. And report the results of each attempted connection similar to how the new range page does it. See the last table in the pages #17433.

Not sure if that will solve the problem, but it should make it a bit easier to work with (and provide more info).

@BramGruneir
Copy link
Member

So I've merged #17913 that should do a better job of dealing with down nodes on the problem range report.

Reopen this if there are new issues.

Bram

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

3 participants