-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
locality-advertise-addr is not wokring #42741
Comments
I suspect that this may be related to the gossip bootstrap persistence feature: cockroach/pkg/gossip/gossip.go Line 845 in a7f0af0
This uses This feature was built with the assumption that the primary/public address for the node would be reachable from anywhere; the locality-specific address would just be an optimization. We haven't done much testing in cases where the primary address is sometimes unreachable and so it's possible that we're relying on the primary address in some bootstrapping cases (or maybe it's not just bootstrapping, which would be a more significant bug in this feature). It looks like you're (intentionally) in a single region/AZ for now, but what is your plan when you go to multiple regions? Will there be a public IP that works across regions or will you be using multiple private IPs? Assuming the former is your goal (which is what we usually see), you'll need to adjust your firewall to get there, and making that adjustment now should get things working (although we'll need to confirm that after bootstrapping it is transitioning onto the more efficient private IPs).
This is redundant - it's a list of rules with first-match-wins, so you only want to specify the level that determines access to the private IP (typically |
+1 we're experiencing this issue as well. Similar setup where our default advertise-addr is used for external clusters, and we set a locality-advertise-addr for in-cluster use. We're attempting to connect with istio + ingress/egress for multicluster connection, so not having the loop back from internal usage would be nice. We also may add/connect clusters at will, so it's nice to default to a globally available address for all localities, except the local one. |
+1 the deployment yaml is as following.
|
Looked at the code briefly, and I'm wondering if it's the gossip protocol itself that is the issue here. If I have nodes A, B, C with flags
Then after bootstrapping the nodes A, B, C they all share the local hostname. Then I turn up 3 new nodes XYZ, and with a join flag pointing to node C. Wouldn't node C share Do you know where in the code the addresses of other nodes are shared specifically? Perusing the gossip package I didn't see it, unless it's treated as data and it requests it from the node's database directly via sql? |
Hi Is there a reason you're using Thanks, |
We run a multi cloud setup, over the public internet, so we give internal nodes the k8s service name, and external get a full hostname |
Sorry about the radio silence. I looked into this and while I haven't been able to reproduce the problem yet, I wanted to share what I've tried as I might be missing a vital ingredient. Locally, I am starting a three node cluster, following in spirit what @steeling described here:
Note how the nodes all advertise an "unreachable" address, but advertise a usable one via The cluster will show up as green in the UI, and the latency page will work (note that this is on the 20.1-alpha, so the UI looks different, but it works just the same under the hood): I restarted the cluster and brought it up with the same command line invocation, and it recovered just fine. I'm not even seeing any connection attempts to the bogus address. One thing that's maybe silly - but I do want to point it out - is that the network latency page has some bug where it sometimes won't show complete results when it is first used right after a cluster comes up. It's unclear to me why that is, but just refreshing the page once does fix it for me. I think it has to do with the addresses taking a little bit of time to percolate between all of the nodes, and the latency page not working properly until that has happened. This doesn't however explain the log messages folks have been posting higher up in this thread. |
I did however confirm that gossip bootstrap persistence is useless in this case, as all it does is write the bogus addresses down. This confirms @bdarnell's comment here and basically disables gossip bootstrap persistence in this example. However, we also see that at least in my setup, the I also echo @bdarnell's comment that this feature was built around the expectation that cockroach/pkg/gossip/client.go Lines 184 to 190 in c097a16
indicates that when a node gossips to another node, it will claim that the request originated from In summary, I would love for someone to tweak my example to highlight the problem others here are experiencing. |
Zendesk ticket #4692 has been linked to this issue. |
Ok, I was able to reproduce the issue. So the issue surfaces when we run the above configuration (which is typical for trying to join with multiclusters), in a more test-like environment with only a single cluster, and a non-existant hostname. For example if we run --advertise-addr=my-fake-addr.com, then the whole thing fails to bootstrap itself, because it attempts to resolve these addresses (only on initial bootstrap it seems), even when they are not in the right locality |
I just stumbled upon this. The node needs to be able to communicate with itself on whatever is specified with --advertise-addr. I added firewall rules in GCP to allow the node to connect to it's own external IP address. |
What's the status of this issue? Is this still an error in the current release version? In order to cut down on some bandwidth costs, it would be great to use an internal address for a node in the same locality group, and fallback to a public address for communication between nodes in different locality groups. Would appreciate any update here! |
Hi Morgan, thanks for the request. We will possibly look into this for v20.2. In the meantime, I would recommend you apply either of the following solutions, separate or in combination:
Could you check if any of this is applicable in your environment? I appreciate that these methods are slightly more complex to set up. That is why we are not losing sight of this limitation and still plan to improve CockroachDB accordingly. |
We have marked this issue as stale because it has been inactive for |
still relevant |
This came up again today, created a new docs issue since the one from 4 years ago has not been touched: https://cockroachlabs.atlassian.net/browse/DOC-9161 |
Running a 9 nodes cluster on GCP with 19.2.0.
./cockroach start --cache=25% --max-sql-memory=35% --background --locality=cloud=gcp,region=us-east1,datacenter=us-east1-c --store=path=/mnt/d1,attrs=ssd,size=90% --log-dir=log --certs-dir=certs --max-disk-temp-storage=100GB --locality-advertise-addr=cloud=gcp@{Private IP},region=us-east1@{Private IP},datacenter=us-east1-c@{Private IP} --join={N1 Private IP},{N2 Private IP},{Nx Prive IP} --advertise-addr={Public IP}
Start all nodes, and it looks like all nodes are healthy
However, in the network diagnostics pages
Confirmed that all nodes are in the same region
If I shutdown the cluster and restart, on the network diagnostics pages it will become
On the problematic node, there will be spam with these log entries
W191125 17:58:47.883009 19657 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N3}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp {Public IP N3}:26257: i/o timeout". Reconnecting... I191125 17:58:48.663426 20622 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n4] circuitbreaker: gossip [::]:26257->{Public IP N9}:26257 tripped: initial connection heartbeat failed: operation "rpc heartbeat" timed out after 6s: rpc error: code = DeadlineExceeded desc = context deadline exceeded I191125 17:58:48.663437 20622 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n4] circuitbreaker: gossip [::]:26257->{Public IP N9}:26257 event: BreakerTripped W191125 17:58:48.883192 19657 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N3}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting... I191125 17:58:52.045207 187 server/status/runtime.go:498 [n4] runtime stats: 5.0 GiB RSS, 363 goroutines, 174 MiB/60 MiB/271 MiB GO alloc/idle/total, 4.1 GiB/4.8 GiB CGO alloc/total, 91.6 CGO/sec, 14.8/0.8 %(u/s)time, 0.0 %gc (1x), 606 KiB/456 KiB (r/w)net W191125 17:58:52.057512 182 server/node.go:745 [n4] [n4,s4]: unable to compute metrics: [n4,s4]: system config not yet available W191125 17:58:52.217886 161 storage/replica_range_lease.go:554 can't determine lease status due to node liveness error: node not in the liveness table github.com/cockroachdb/cockroach/pkg/storage.init.ializers /go/src/github.com/cockroachdb/cockroach/pkg/storage/node_liveness.go:44 runtime.main /usr/local/go/src/runtime/proc.go:188 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337 W191125 17:58:57.217893 162 storage/replica_range_lease.go:554 can't determine lease status due to node liveness error: node not in the liveness table github.com/cockroachdb/cockroach/pkg/storage.init.ializers /go/src/github.com/cockroachdb/cockroach/pkg/storage/node_liveness.go:44 runtime.main /usr/local/go/src/runtime/proc.go:188 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337 W191125 17:58:58.008692 20241 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N2}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp {Public IP N2}:26257: i/o timeout". Reconnecting... I191125 17:58:58.361063 19445 storage/store_snapshot.go:978 [n4,raftsnapshot,s4,r262/3:/Table/60/2/"5{9aaca…-b073e…}] sending LEARNER snapshot fcabe123 at applied index 2404159 I191125 17:58:58.517305 155 storage/store_remove_replica.go:129 [n4,s4,r262/3:/Table/60/2/"5{9aaca…-b073e…}] removing replica r262/3 W191125 17:58:59.008852 20241 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N2}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting... W191125 17:58:59.008859 20391 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N8}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp {Public IP N8}:26257: i/o timeout". Reconnecting... I191125 17:58:59.010597 21293 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n4] circuitbreaker: gossip [::]:26257->{Public IP N3}:26257 tripped: initial connection heartbeat failed: operation "rpc heartbeat" timed out after 6s: rpc error: code = DeadlineExceeded desc = context deadline exceeded I191125 17:58:59.010610 21293 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n4] circuitbreaker: gossip [::]:26257->{Public IP N3}:26257 event: BreakerTripped
If I start all nodes with --advertise-addr={Private IP}, everything back to normal.
Jira issue: CRDB-5327
The text was updated successfully, but these errors were encountered: