Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3 nodes, replicas going only to one node #25316

Closed
thstart opened this issue May 4, 2018 · 15 comments
Closed

3 nodes, replicas going only to one node #25316

thstart opened this issue May 4, 2018 · 15 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community

Comments

@thstart
Copy link

thstart commented May 4, 2018

Is this a question, feature request, or bug report?

QUESTION
I have 3 nodes. Importing large database through node 3. All replicas going only to node 1.

Have you checked our documentation at https://cockroachlabs.com/docs/stable/? If you could not find an answer there, please consider asking your question in our community forum at https://forum.cockroachlabs.com/, as it would benefit other members of our community.

Prefer live chat? Message our engineers on our Gitter channel at https://gitter.im/cockroachdb/cockroach.

BUG REPORT

  1. Please supply the header (i.e. the first few lines) of your most recent
    log file for each node in your cluster. On most unix-based systems
    running with defaults, this boils down to the output of

==========================
commands:

n1:
/usr/local/bin/cockroach
start
--insecure
--store=/TC.CockRoachDB.store
--host=my ip address 1
--http-port=9000
--cache=25%
--max-sql-memory=25%

n2:
/usr/local/bin/cockroach
start
--insecure
/TC.CockRoachDB.store
--host=my ip address 2
--http-port=9000
--cache=25%
--max-sql-memory=25%
--join=my ip address 1:26257

n3:
/usr/local/bin/cockroach
start
--insecure
/TC.CockRoachDB.store
--host=my ip address 3
--http-port=9000
--cache=25%
--max-sql-memory=25%
--join=my ip address 1:26257

==========================
web console:

ID n1
ADDRESS my ip address 1
UPTIME 16 hours
BYTES 2.7 GiB
REPLICAS 270
MEM USAGE 5.9 GiB
VERSION v2.0.1

ID n2
ADDRESS my ip address 2
UPTIME 16 hours
BYTES 70.9 MiB
REPLICAS 8
MEM USAGE 320.7 MiB
VERSION v2.0.1

ID n3
ADDRESS my ip address 3
UPTIME 15 hours
BYTES 5.9 MiB
REPLICAS 0
MEM USAGE 463.5 MiB
VERSION v2.0.1

  1. Please describe the issue you observed:
  • What did you do?

installed v.2.01 from scratch. on node n3 began importing data using:
UPSERT INTO

  • What did you expect to see?
    same amount of replicas on all nodes.

  • What did you see instead?
    importing via n3 there is no data on this node. there is no data on n2.
    the only data I see is on node n1.

cockroachdb01

@rytaft rytaft added C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community A-sql-mutations Mutation statements: UPDATE/INSERT/UPSERT/DELETE. A-kv-distribution Relating to rebalancing and leasing. labels May 4, 2018
@tbg
Copy link
Member

tbg commented May 5, 2018

Thanks for your report, @thstart. This looks unexpected, could you send me a screenshot of the page at /#/reports/range/21 (from any node, but preferably n1)? I'm particularly interested in the "simulated allocator output".

@474420502
Copy link

I also have the same problem 😢

@tbg
Copy link
Member

tbg commented May 6, 2018

@474420502 could you send the screenshot I asked for above from your cluster that has this problem?

@474420502
Copy link

@tbg
Copy link
Member

tbg commented May 6, 2018

@474420502 could you set cluster setting server.remote_debugging.mode = 'any'; so that I can access the debug pages? Please note that you don't want to do this if there's sensitive data in that cluster, as enabling debugging can in principle expose some of it.

@474420502
Copy link

@tschottdorf ok, now server.remote_debugging.mode = 'any'.

@tbg
Copy link
Member

tbg commented May 6, 2018

Thanks! What strikes me as the most obvious problem here is a lack of network connectivity:

image

image

In the logs, I see these pretty much right away:

grpc: addrConn.createTransport failed to connect to {cockroach02:8802 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup cockroach02 on 114.114.114.114:53: no such host". Reconnecting...

This suggests that you need to set a proper --advertise-host flag on each node so that they can connect to each other. Or you make the hostnames cockroachXX resolvable. Can you try either option and report back? Thanks!

some more ramblings from earlier investigation below, but there's nothing actionable there.


This is even though gossip is healthy. Note that the first graph is asymmetric. For example, n2 can talk to n1 but n1 cannot talk to n2. Only n3 can go both ways, and in fact n3 managed to get four replicas at some point, which may be related to that.

Looking at the node addresses, something is odd:

image

This looks like you're running node 1 and node 3 on the same machine, and similarly node 2 and node 4. Or, you have set up nodes that have identical host names. I was able to pull the startup command line for those two from the logs:

1: [config] arguments: [cockroach start --insecure --http-port=5050 --port=8800 --store=data --host=cockroach01]
3: [config] arguments: [cockroach start --insecure --http-port=5051 --join=cockroach01:8800 --port=8803 --store=data --host=cockroach01]

@474420502
Copy link

@tschottdorf Thanks! The Problem is solved! 👍 http://14.17.96.14:5050

@tbg
Copy link
Member

tbg commented May 6, 2018

Thanks! We need to do a better job exposing this problem to the operators.

@tbg
Copy link
Member

tbg commented May 6, 2018

@thstart -- hopefully you are running into the same problem, just let us know!

@thstart
Copy link
Author

thstart commented May 6, 2018

I investigated and discovered the following. I was importing using n3. Checked everything until found out n3 was behind a firewall not accepting incoming connections. Still only n1 got replicas. n1 and n2 were fine regarding firewall allowing incoming connections but only n1 got the replicas. Because of time constraints I tore down the cluster and started from scratch and now it is working fine.

@tbg
Copy link
Member

tbg commented May 6, 2018

Thanks @thstart! I'll leave this open until I've filed/found follow-up issues to surface these kinds of problems better.

tbg added a commit to tbg/cockroach that referenced this issue May 7, 2018
While generating NodeStatus summaries, check for metrics that indicate a
severe store, node, or cluster-level problem.

Flagged metrics are gossiped under a newly introduced key. These infos
are in turn picked up by the newly introduced internal `gossip_alerts`
table.

In effect, operators can monitor the `crdb_internal.gossip_alerts`
table on any node (though they'll want to do it on all nodes if there
are network connectivity issues). Similarly, it'll be straightforward
to plumb these warnings into the UI, though to the best of my knowledge
the UI can't just query `crdb_internal`, and we may expose them in
another more suitable location (this is trivial since Gossip is
available pretty much everywhere).

For starters, we only check the metrics for underreplicated and
unavailable ranges, but there is no limitation on how elaborate these
health checks can become. In fact, they aren't limited to sourcing
solely from `NodeStatus`, and in light of cockroachdb#25316 we should consider
alerting when nodes can't (bidirectionally) communicate.

NB: I had originally envisioned polling the `node_metrics` table
because that allows us to write the health checks in SQL. I had
this code but ultimately deleted it as it seemed too roundabout
and less extensible.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue May 8, 2018
While generating NodeStatus summaries, check for metrics that indicate a
severe store, node, or cluster-level problem. Metrics can either be
counters or gauges, and for the former we have to keep state so that we
can notify only when the counter increments (not when it's nonzero).

Flagged metrics are gossiped under a newly introduced key. These infos
are in turn picked up by the newly introduced internal `gossip_alerts`
table.

In effect, operators can monitor the `crdb_internal.gossip_alerts`
table on any node (though they'll want to do it on all nodes if there
are network connectivity issues). Similarly, it'll be straightforward
to plumb these warnings into the UI, though to the best of my knowledge
the UI can't just query `crdb_internal`, and we may expose them in
another more suitable location (this is trivial since Gossip is
available pretty much everywhere).

For starters, we only check the metrics for underreplicated and
unavailable ranges as well as liveness errors, but there is no
limitation on how elaborate these health checks can become. In fact,
they aren't limited to sourcing solely from `NodeStatus`, and in light
of cockroachdb#25316 we should consider alerting when nodes can't (bidirectionally)
communicate, so that operators can easily diagnose DNS or firewall
issues.

NB: I had originally envisioned polling the `node_metrics` table because
that would have allowed us to write the health checks in SQL. I had this
code but ultimately deleted it as it seemed too roundabout and less
extensible.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue May 9, 2018
While generating NodeStatus summaries, check for metrics that indicate a
severe store, node, or cluster-level problem. Metrics can either be
counters or gauges, and for the former we have to keep state so that we
can notify only when the counter increments (not when it's nonzero).

Flagged metrics are gossiped under a newly introduced key. These infos
are in turn picked up by the newly introduced internal `gossip_alerts`
table.

In effect, operators can monitor the `crdb_internal.gossip_alerts`
table on any node (though they'll want to do it on all nodes if there
are network connectivity issues). Similarly, it'll be straightforward
to plumb these warnings into the UI, though to the best of my knowledge
the UI can't just query `crdb_internal`, and we may expose them in
another more suitable location (this is trivial since Gossip is
available pretty much everywhere).

For starters, we only check the metrics for underreplicated and
unavailable ranges as well as liveness errors, but there is no
limitation on how elaborate these health checks can become. In fact,
they aren't limited to sourcing solely from `NodeStatus`, and in light
of cockroachdb#25316 we should consider alerting when nodes can't (bidirectionally)
communicate, so that operators can easily diagnose DNS or firewall
issues.

NB: I had originally envisioned polling the `node_metrics` table because
that would have allowed us to write the health checks in SQL. I had this
code but ultimately deleted it as it seemed too roundabout and less
extensible.

Release note: None
craig bot pushed a commit that referenced this issue May 9, 2018
25343: server: add health checks and distress gossip r=bdarnell a=tschottdorf

While generating NodeStatus summaries, check for metrics that indicate a
severe store, node, or cluster-level problem. Metrics can either be
counters or gauges, and for the former we have to keep state so that we
can notify only when the counter increments (not when it's nonzero).

Flagged metrics are gossiped under a newly introduced key. These infos
are in turn picked up by the newly introduced internal `gossip_alerts`
table.

In effect, operators can monitor the `crdb_internal.gossip_alerts`
table on any node (though they'll want to do it on all nodes if there
are network connectivity issues). Similarly, it'll be straightforward
to plumb these warnings into the UI, though to the best of my knowledge
the UI can't just query `crdb_internal`, and we may expose them in
another more suitable location (this is trivial since Gossip is
available pretty much everywhere).

For starters, we only check the metrics for underreplicated and
unavailable ranges as well as liveness errors, but there is no
limitation on how elaborate these health checks can become. In fact,
they aren't limited to sourcing solely from `NodeStatus`, and in light
of #25316 we should consider alerting when nodes can't (bidirectionally)
communicate, so that operators can easily diagnose DNS or firewall
issues.

NB: I had originally envisioned polling the `node_metrics` table because
that would have allowed us to write the health checks in SQL. I had this
code but ultimately deleted it as it seemed too roundabout and less
extensible.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@knz knz removed the A-sql-mutations Mutation statements: UPDATE/INSERT/UPSERT/DELETE. label May 15, 2018
@tbg
Copy link
Member

tbg commented Jun 6, 2018

This is now tracked as part of #18850.

@tbg tbg closed this as completed Jun 6, 2018
@dimovnike
Copy link

hello, what is the recommended way to properly detect this situation? I tried /_status/vars but that also says liveness_livenodes=3 while only 2 nodes are replicating and 1 node is not due to a firewall issue. How to detect non-working nodes?

@tbg
Copy link
Member

tbg commented Oct 18, 2018

@dimovnike this is unfortunately a blind spot today. One way to see it would be under /debug/ in the admin UI; there's a network connectivity page. But ideally this would give you a yellow cluster status.

PS this problem is now tracked as part of #18850. I'll refer to your comment there, but please post further comments over there as well to save me the work :) Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community
Projects
None yet
Development

No branches or pull requests

6 participants