-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3 nodes, replicas going only to one node #25316
Comments
Thanks for your report, @thstart. This looks unexpected, could you send me a screenshot of the page at |
I also have the same problem 😢 |
@474420502 could you send the screenshot I asked for above from your cluster that has this problem? |
@474420502 could you |
@tschottdorf ok, now server.remote_debugging.mode = 'any'. |
Thanks! What strikes me as the most obvious problem here is a lack of network connectivity: In the logs, I see these pretty much right away:
This suggests that you need to set a proper some more ramblings from earlier investigation below, but there's nothing actionable there. This is even though gossip is healthy. Note that the first graph is asymmetric. For example, Looking at the node addresses, something is odd: This looks like you're running node 1 and node 3 on the same machine, and similarly node 2 and node 4. Or, you have set up nodes that have identical host names. I was able to pull the startup command line for those two from the logs: 1: [config] arguments: [cockroach start --insecure --http-port=5050 --port=8800 --store=data --host=cockroach01] |
@tschottdorf Thanks! The Problem is solved! 👍 http://14.17.96.14:5050 |
Thanks! We need to do a better job exposing this problem to the operators. |
@thstart -- hopefully you are running into the same problem, just let us know! |
I investigated and discovered the following. I was importing using n3. Checked everything until found out n3 was behind a firewall not accepting incoming connections. Still only n1 got replicas. n1 and n2 were fine regarding firewall allowing incoming connections but only n1 got the replicas. Because of time constraints I tore down the cluster and started from scratch and now it is working fine. |
Thanks @thstart! I'll leave this open until I've filed/found follow-up issues to surface these kinds of problems better. |
While generating NodeStatus summaries, check for metrics that indicate a severe store, node, or cluster-level problem. Flagged metrics are gossiped under a newly introduced key. These infos are in turn picked up by the newly introduced internal `gossip_alerts` table. In effect, operators can monitor the `crdb_internal.gossip_alerts` table on any node (though they'll want to do it on all nodes if there are network connectivity issues). Similarly, it'll be straightforward to plumb these warnings into the UI, though to the best of my knowledge the UI can't just query `crdb_internal`, and we may expose them in another more suitable location (this is trivial since Gossip is available pretty much everywhere). For starters, we only check the metrics for underreplicated and unavailable ranges, but there is no limitation on how elaborate these health checks can become. In fact, they aren't limited to sourcing solely from `NodeStatus`, and in light of cockroachdb#25316 we should consider alerting when nodes can't (bidirectionally) communicate. NB: I had originally envisioned polling the `node_metrics` table because that allows us to write the health checks in SQL. I had this code but ultimately deleted it as it seemed too roundabout and less extensible. Release note: None
While generating NodeStatus summaries, check for metrics that indicate a severe store, node, or cluster-level problem. Metrics can either be counters or gauges, and for the former we have to keep state so that we can notify only when the counter increments (not when it's nonzero). Flagged metrics are gossiped under a newly introduced key. These infos are in turn picked up by the newly introduced internal `gossip_alerts` table. In effect, operators can monitor the `crdb_internal.gossip_alerts` table on any node (though they'll want to do it on all nodes if there are network connectivity issues). Similarly, it'll be straightforward to plumb these warnings into the UI, though to the best of my knowledge the UI can't just query `crdb_internal`, and we may expose them in another more suitable location (this is trivial since Gossip is available pretty much everywhere). For starters, we only check the metrics for underreplicated and unavailable ranges as well as liveness errors, but there is no limitation on how elaborate these health checks can become. In fact, they aren't limited to sourcing solely from `NodeStatus`, and in light of cockroachdb#25316 we should consider alerting when nodes can't (bidirectionally) communicate, so that operators can easily diagnose DNS or firewall issues. NB: I had originally envisioned polling the `node_metrics` table because that would have allowed us to write the health checks in SQL. I had this code but ultimately deleted it as it seemed too roundabout and less extensible. Release note: None
While generating NodeStatus summaries, check for metrics that indicate a severe store, node, or cluster-level problem. Metrics can either be counters or gauges, and for the former we have to keep state so that we can notify only when the counter increments (not when it's nonzero). Flagged metrics are gossiped under a newly introduced key. These infos are in turn picked up by the newly introduced internal `gossip_alerts` table. In effect, operators can monitor the `crdb_internal.gossip_alerts` table on any node (though they'll want to do it on all nodes if there are network connectivity issues). Similarly, it'll be straightforward to plumb these warnings into the UI, though to the best of my knowledge the UI can't just query `crdb_internal`, and we may expose them in another more suitable location (this is trivial since Gossip is available pretty much everywhere). For starters, we only check the metrics for underreplicated and unavailable ranges as well as liveness errors, but there is no limitation on how elaborate these health checks can become. In fact, they aren't limited to sourcing solely from `NodeStatus`, and in light of cockroachdb#25316 we should consider alerting when nodes can't (bidirectionally) communicate, so that operators can easily diagnose DNS or firewall issues. NB: I had originally envisioned polling the `node_metrics` table because that would have allowed us to write the health checks in SQL. I had this code but ultimately deleted it as it seemed too roundabout and less extensible. Release note: None
25343: server: add health checks and distress gossip r=bdarnell a=tschottdorf While generating NodeStatus summaries, check for metrics that indicate a severe store, node, or cluster-level problem. Metrics can either be counters or gauges, and for the former we have to keep state so that we can notify only when the counter increments (not when it's nonzero). Flagged metrics are gossiped under a newly introduced key. These infos are in turn picked up by the newly introduced internal `gossip_alerts` table. In effect, operators can monitor the `crdb_internal.gossip_alerts` table on any node (though they'll want to do it on all nodes if there are network connectivity issues). Similarly, it'll be straightforward to plumb these warnings into the UI, though to the best of my knowledge the UI can't just query `crdb_internal`, and we may expose them in another more suitable location (this is trivial since Gossip is available pretty much everywhere). For starters, we only check the metrics for underreplicated and unavailable ranges as well as liveness errors, but there is no limitation on how elaborate these health checks can become. In fact, they aren't limited to sourcing solely from `NodeStatus`, and in light of #25316 we should consider alerting when nodes can't (bidirectionally) communicate, so that operators can easily diagnose DNS or firewall issues. NB: I had originally envisioned polling the `node_metrics` table because that would have allowed us to write the health checks in SQL. I had this code but ultimately deleted it as it seemed too roundabout and less extensible. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
This is now tracked as part of #18850. |
hello, what is the recommended way to properly detect this situation? I tried /_status/vars but that also says liveness_livenodes=3 while only 2 nodes are replicating and 1 node is not due to a firewall issue. How to detect non-working nodes? |
@dimovnike this is unfortunately a blind spot today. One way to see it would be under PS this problem is now tracked as part of #18850. I'll refer to your comment there, but please post further comments over there as well to save me the work :) Thanks. |
Is this a question, feature request, or bug report?
QUESTION
I have 3 nodes. Importing large database through node 3. All replicas going only to node 1.
Have you checked our documentation at https://cockroachlabs.com/docs/stable/? If you could not find an answer there, please consider asking your question in our community forum at https://forum.cockroachlabs.com/, as it would benefit other members of our community.
Prefer live chat? Message our engineers on our Gitter channel at https://gitter.im/cockroachdb/cockroach.
BUG REPORT
log file for each node in your cluster. On most unix-based systems
running with defaults, this boils down to the output of
==========================
commands:
n1:
/usr/local/bin/cockroach
start
--insecure
--store=/TC.CockRoachDB.store
--host=my ip address 1
--http-port=9000
--cache=25%
--max-sql-memory=25%
n2:
/usr/local/bin/cockroach
start
--insecure
/TC.CockRoachDB.store
--host=my ip address 2
--http-port=9000
--cache=25%
--max-sql-memory=25%
--join=my ip address 1:26257
n3:
/usr/local/bin/cockroach
start
--insecure
/TC.CockRoachDB.store
--host=my ip address 3
--http-port=9000
--cache=25%
--max-sql-memory=25%
--join=my ip address 1:26257
==========================
web console:
ID n1
ADDRESS my ip address 1
UPTIME 16 hours
BYTES 2.7 GiB
REPLICAS 270
MEM USAGE 5.9 GiB
VERSION v2.0.1
ID n2
ADDRESS my ip address 2
UPTIME 16 hours
BYTES 70.9 MiB
REPLICAS 8
MEM USAGE 320.7 MiB
VERSION v2.0.1
ID n3
ADDRESS my ip address 3
UPTIME 15 hours
BYTES 5.9 MiB
REPLICAS 0
MEM USAGE 463.5 MiB
VERSION v2.0.1
installed v.2.01 from scratch. on node n3 began importing data using:
UPSERT INTO
What did you expect to see?
same amount of replicas on all nodes.
What did you see instead?
importing via n3 there is no data on this node. there is no data on n2.
the only data I see is on node n1.
The text was updated successfully, but these errors were encountered: