locality-advertise-addr is not wokring #42741

leomkkwan · 2019-11-25T18:01:57Z

Running a 9 nodes cluster on GCP with 19.2.0.

./cockroach start --cache=25% --max-sql-memory=35% --background --locality=cloud=gcp,region=us-east1,datacenter=us-east1-c --store=path=/mnt/d1,attrs=ssd,size=90% --log-dir=log --certs-dir=certs --max-disk-temp-storage=100GB --locality-advertise-addr=cloud=gcp@{Private IP},region=us-east1@{Private IP},datacenter=us-east1-c@{Private IP} --join={N1 Private IP},{N2 Private IP},{Nx Prive IP} --advertise-addr={Public IP}

Start all nodes, and it looks like all nodes are healthy

However, in the network diagnostics pages

Confirmed that all nodes are in the same region

If I shutdown the cluster and restart, on the network diagnostics pages it will become

On the problematic node, there will be spam with these log entries

W191125 17:58:47.883009 19657 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N3}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp {Public IP N3}:26257: i/o timeout". Reconnecting... I191125 17:58:48.663426 20622 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n4] circuitbreaker: gossip [::]:26257->{Public IP N9}:26257 tripped: initial connection heartbeat failed: operation "rpc heartbeat" timed out after 6s: rpc error: code = DeadlineExceeded desc = context deadline exceeded I191125 17:58:48.663437 20622 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n4] circuitbreaker: gossip [::]:26257->{Public IP N9}:26257 event: BreakerTripped W191125 17:58:48.883192 19657 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N3}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting... I191125 17:58:52.045207 187 server/status/runtime.go:498 [n4] runtime stats: 5.0 GiB RSS, 363 goroutines, 174 MiB/60 MiB/271 MiB GO alloc/idle/total, 4.1 GiB/4.8 GiB CGO alloc/total, 91.6 CGO/sec, 14.8/0.8 %(u/s)time, 0.0 %gc (1x), 606 KiB/456 KiB (r/w)net W191125 17:58:52.057512 182 server/node.go:745 [n4] [n4,s4]: unable to compute metrics: [n4,s4]: system config not yet available W191125 17:58:52.217886 161 storage/replica_range_lease.go:554 can't determine lease status due to node liveness error: node not in the liveness table github.com/cockroachdb/cockroach/pkg/storage.init.ializers /go/src/github.com/cockroachdb/cockroach/pkg/storage/node_liveness.go:44 runtime.main /usr/local/go/src/runtime/proc.go:188 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337 W191125 17:58:57.217893 162 storage/replica_range_lease.go:554 can't determine lease status due to node liveness error: node not in the liveness table github.com/cockroachdb/cockroach/pkg/storage.init.ializers /go/src/github.com/cockroachdb/cockroach/pkg/storage/node_liveness.go:44 runtime.main /usr/local/go/src/runtime/proc.go:188 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337 W191125 17:58:58.008692 20241 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N2}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp {Public IP N2}:26257: i/o timeout". Reconnecting... I191125 17:58:58.361063 19445 storage/store_snapshot.go:978 [n4,raftsnapshot,s4,r262/3:/Table/60/2/"5{9aaca…-b073e…}] sending LEARNER snapshot fcabe123 at applied index 2404159 I191125 17:58:58.517305 155 storage/store_remove_replica.go:129 [n4,s4,r262/3:/Table/60/2/"5{9aaca…-b073e…}] removing replica r262/3 W191125 17:58:59.008852 20241 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N2}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting... W191125 17:58:59.008859 20391 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N8}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp {Public IP N8}:26257: i/o timeout". Reconnecting... I191125 17:58:59.010597 21293 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n4] circuitbreaker: gossip [::]:26257->{Public IP N3}:26257 tripped: initial connection heartbeat failed: operation "rpc heartbeat" timed out after 6s: rpc error: code = DeadlineExceeded desc = context deadline exceeded I191125 17:58:59.010610 21293 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n4] circuitbreaker: gossip [::]:26257->{Public IP N3}:26257 event: BreakerTripped

If I start all nodes with --advertise-addr={Private IP}, everything back to normal.

Jira issue: CRDB-5327

The text was updated successfully, but these errors were encountered:

bdarnell · 2019-11-26T15:56:30Z

I suspect that this may be related to the gossip bootstrap persistence feature:

cockroach/pkg/gossip/gossip.go

Line 845 in a7f0af0

added := g.maybeAddBootstrapAddressLocked(desc.Address, desc.NodeID)

This uses desc.Addr which is the primary address for the node without considering localities. But the persisted info is merged with the --join flag so it should be able to self-heal and I'm not sure why it's not. Maybe there's something else that's missing the locality-aware lookup.

This feature was built with the assumption that the primary/public address for the node would be reachable from anywhere; the locality-specific address would just be an optimization. We haven't done much testing in cases where the primary address is sometimes unreachable and so it's possible that we're relying on the primary address in some bootstrapping cases (or maybe it's not just bootstrapping, which would be a more significant bug in this feature).

It looks like you're (intentionally) in a single region/AZ for now, but what is your plan when you go to multiple regions? Will there be a public IP that works across regions or will you be using multiple private IPs? Assuming the former is your goal (which is what we usually see), you'll need to adjust your firewall to get there, and making that adjustment now should get things working (although we'll need to confirm that after bootstrapping it is transitioning onto the more efficient private IPs).

--locality-advertise-addr=cloud=gcp@{Private IP},region=us-east1@{Private IP},datacenter=us-east1-c@{Private IP}

This is redundant - it's a list of rules with first-match-wins, so you only want to specify the level that determines access to the private IP (typically region=us-east1). This single-match limitation also means that you may want to label it region=gcp-us-east1 to guard against region name collisions if you ever span multiple clouds.

steeling · 2020-01-22T06:30:09Z

+1 we're experiencing this issue as well. Similar setup where our default advertise-addr is used for external clusters, and we set a locality-advertise-addr for in-cluster use.

We're attempting to connect with istio + ingress/egress for multicluster connection, so not having the loop back from internal usage would be nice.

We also may add/connect clusters at will, so it's nice to default to a globally available address for all localities, except the local one.

rleiwang · 2020-01-22T21:05:19Z

+1
I am experiencing this issue too. Deployed 3 nodes cluster on GKE through helm stable/cockcorachdb chart. This cluster is for internal use only, no ingress.

the deployment yaml is as following.

      - args:
        - shell
        - -ecx
        - exec /cockroach/cockroach start --join=${STATEFULSET_NAME}-0.${STATEFULSET_FQDN}:26257,${STATEFULSET_NAME}-1.${STATEFULSET_FQDN}:26257,${STATEFULSET_NAME}-2.${STATEFULSET_FQDN}:26257
          --advertise-host=$(hostname).${STATEFULSET_FQDN} --logtostderr=INFO --insecure
          --http-port=8080 --port=26257 --cache=25% --max-disk-temp-storage=0 --max-offset=500ms
          --max-sql-memory=25%
        env:
        - name: STATEFULSET_NAME
          value: bw-cockroachdb
        - name: STATEFULSET_FQDN
          value: bw-cockroachdb.demo.svc.cluster.local
        - name: COCKROACH_CHANNEL
          value: kubernetes-helm
        image: cockroachdb/cockroach:v19.2.2

steeling · 2020-01-23T09:12:58Z

Looked at the code briefly, and I'm wondering if it's the gossip protocol itself that is the issue here.

If I have nodes A, B, C with flags

--advertise-addr="$(hostname -f)" --locality=abc --advertise-locality-addr="xyz@${ORDINAL}.mydomain.com"

Then after bootstrapping the nodes A, B, C they all share the local hostname.

Then I turn up 3 new nodes XYZ, and with a join flag pointing to node C. Wouldn't node C share
the addresses of A, B, C that it has, which are supposed to be unique to A, B, C?

Do you know where in the code the addresses of other nodes are shared specifically? Perusing the gossip package I didn't see it, unless it's treated as data and it requests it from the node's database directly via sql?

mattcrdb · 2020-02-13T15:58:32Z

Hi Steeling,

Is there a reason you're using --advertise-locality-addr? Are you able to use --advertise-addr=<public address>?

Thanks,
Matt

steeling · 2020-02-13T20:24:36Z

We run a multi cloud setup, over the public internet, so we give internal nodes the k8s service name, and external get a full hostname

tbg · 2020-02-20T09:00:24Z

Sorry about the radio silence. I looked into this and while I haven't been able to reproduce the problem yet, I wanted to share what I've tried as I might be missing a vital ingredient. Locally, I am starting a three node cluster, following in spirit what @steeling described here:

./cockroach start --insecure --logtostderr=INFO --background --advertise-addr=doesnotexist --locality region=abc [email protected]:26257

./cockroach start --insecure --logtostderr=INFO --background --advertise-addr=doesnotexist --locality region=abc [email protected]:26258 --join 127.0.0.1:26257 --store cockroach-data2 --http-addr :8081 --listen-addr :26258

./cockroach start --insecure --logtostderr=INFO --background --advertise-addr=doesnotexist --locality region=abc [email protected]:26259 --join 127.0.0.1:26257 --store cockroach-data3 --http-addr :8082 --listen-addr :26259

Note how the nodes all advertise an "unreachable" address, but advertise a usable one via --locality-advertise-addr. Note also how the latter two nodes join only to n1, so for n3 to be able to connect to n2, n1 necessarily has to share n2's locality-advertise-addr (as opposed to the bogus unreachable one).

The cluster will show up as green in the UI, and the latency page will work (note that this is on the 20.1-alpha, so the UI looks different, but it works just the same under the hood):

I restarted the cluster and brought it up with the same command line invocation, and it recovered just fine. I'm not even seeing any connection attempts to the bogus address.

One thing that's maybe silly - but I do want to point it out - is that the network latency page has some bug where it sometimes won't show complete results when it is first used right after a cluster comes up. It's unclear to me why that is, but just refreshing the page once does fix it for me. I think it has to do with the addresses taking a little bit of time to percolate between all of the nodes, and the latency page not working properly until that has happened. This doesn't however explain the log messages folks have been posting higher up in this thread.

tbg · 2020-02-20T09:22:13Z

I did however confirm that gossip bootstrap persistence is useless in this case, as all it does is write the bogus addresses down. This confirms @bdarnell's comment here and basically disables gossip bootstrap persistence in this example. However, we also see that at least in my setup, the --join flags are enough (as they ought to be; I've argued elsewhere that gossip bootstrap persistence should be removed)

I also echo @bdarnell's comment that this feature was built around the expectation that --advertise-addr is reachable by all nodes in the cluster (and that using the locality-advertised address is just an optimization). We see this through all of the Gossip code, for example

cockroach/pkg/gossip/client.go

Lines 184 to 190 in c097a16

    
           args := Request{ 
        
           	NodeID:          g.NodeID.Get(), 
        
           	Addr:            g.mu.is.NodeAddr, 
        
           	Delta:           delta, 
        
           	HighWaterStamps: g.mu.is.getHighWaterStamps(), 
        
           	ClusterID:       g.clusterID.Get(), 
        
           }

indicates that when a node gossips to another node, it will claim that the request originated from --advertise-addr, meaning that the recipient will be unaware of the "real" address at which the origin node can be reached, at least at the level of Gossip. To think of examples where there is a concrete problem is not trivial. The node descriptor (which contains the locality-aware addresses) is gossiped at an interval, so as long as gossip stabilizes (as it should using the join flags, if they're set up correctly), the locality addresses should become available to the code that uses them quickly (i.e. seconds).

In summary, I would love for someone to tweak my example to highlight the problem others here are experiencing.

RoachietheSupportRoach · 2020-02-21T16:45:39Z

Zendesk ticket #4692 has been linked to this issue.

steeling · 2020-03-03T20:21:27Z

Ok, I was able to reproduce the issue. So the issue surfaces when we run the above configuration (which is typical for trying to join with multiclusters), in a more test-like environment with only a single cluster, and a non-existant hostname.

For example if we run --advertise-addr=my-fake-addr.com, then the whole thing fails to bootstrap itself, because it attempts to resolve these addresses (only on initial bootstrap it seems), even when they are not in the right locality

stickenhoffen · 2020-04-02T00:47:26Z

I just stumbled upon this. The node needs to be able to communicate with itself on whatever is specified with --advertise-addr. I added firewall rules in GCP to allow the node to connect to it's own external IP address.

morgangallant · 2020-05-01T00:28:57Z

What's the status of this issue? Is this still an error in the current release version? In order to cut down on some bandwidth costs, it would be great to use an internal address for a node in the same locality group, and fallback to a public address for communication between nodes in different locality groups.

Would appreciate any update here!

knz · 2020-05-07T12:30:49Z

Hi Morgan, thanks for the request.
We understand the use case, and unfortunately the behavior of --locality-advertise-addr is not going to help much here (given the unfortunate restriction of --advertise-addr).

We will possibly look into this for v20.2. In the meantime, I would recommend you apply either of the following solutions, separate or in combination:

use VPC peering to make the private IP addresses of each node available from every other node. This way, your cloud routing layer will automatically select whether to bill traffic to local or cross-DC bandwidth.
use IP routing/firewalling rules in your OS to set up reverse-NAT: when the OS detect a node is attempting to connect to the internal address of a node that's in a different DC, it redirects the request to that other node's public address. There would need to be an inverse rule (port forwarding) on the other side.

Could you check if any of this is applicable in your environment?

I appreciate that these methods are slightly more complex to set up. That is why we are not losing sight of this limitation and still plan to improve CockroachDB accordingly.

github-actions · 2023-09-19T11:09:52Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

knz · 2023-09-19T15:41:19Z

still relevant

daniel-crlabs · 2023-11-03T13:36:54Z

This came up again today, created a new docs issue since the one from 4 years ago has not been touched: https://cockroachlabs.atlassian.net/browse/DOC-9161

jseldess mentioned this issue Nov 26, 2019

Clarify docs on --locality-advertise-addr cockroachdb/docs#5941

Open

ricardocrdb added the O-community Originated from the community label Dec 9, 2019

knz added the C-investigation Further steps needed to qualify. C-label will change. label May 4, 2020

ajwerner mentioned this issue May 13, 2020

Consistency Checker stuck (unable to dial n2: breaker open) - v19.2.6 #48761

Closed

irfansharif mentioned this issue Sep 2, 2020

server: always create a liveness record before starting up #53805

Closed

jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021

knz added A-server-networking Pertains to network addressing,routing,initialization A-server-architecture Relates to the internal APIs and src org for server code labels Jul 29, 2021

knz mentioned this issue Oct 18, 2022

cli: cockroach mt start-sql does not support the --locality-advertise-addr argument #90172

Closed

github-actions bot added the no-issue-activity label Sep 19, 2023

knz added X-nostale Marks an issue/pr that should be ignored by the stale bot and removed no-issue-activity labels Sep 19, 2023

exalate-issue-sync bot mentioned this issue Nov 3, 2023

Update cockroach-start documentation of the locality-advertise-addr flag cockroachdb/docs#18039

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

locality-advertise-addr is not wokring #42741

locality-advertise-addr is not wokring #42741

leomkkwan commented Nov 25, 2019 •

edited by cockroach-jira-scripts

Loading

bdarnell commented Nov 26, 2019

steeling commented Jan 22, 2020

rleiwang commented Jan 22, 2020

steeling commented Jan 23, 2020

mattcrdb commented Feb 13, 2020

steeling commented Feb 13, 2020

tbg commented Feb 20, 2020

tbg commented Feb 20, 2020

RoachietheSupportRoach commented Feb 21, 2020

steeling commented Mar 3, 2020

stickenhoffen commented Apr 2, 2020

morgangallant commented May 1, 2020

knz commented May 7, 2020

github-actions bot commented Sep 19, 2023

knz commented Sep 19, 2023

daniel-crlabs commented Nov 3, 2023

locality-advertise-addr is not wokring #42741

locality-advertise-addr is not wokring #42741

Comments

leomkkwan commented Nov 25, 2019 • edited by cockroach-jira-scripts Loading

bdarnell commented Nov 26, 2019

steeling commented Jan 22, 2020

rleiwang commented Jan 22, 2020

steeling commented Jan 23, 2020

mattcrdb commented Feb 13, 2020

steeling commented Feb 13, 2020

tbg commented Feb 20, 2020

tbg commented Feb 20, 2020

RoachietheSupportRoach commented Feb 21, 2020

steeling commented Mar 3, 2020

stickenhoffen commented Apr 2, 2020

morgangallant commented May 1, 2020

knz commented May 7, 2020

github-actions bot commented Sep 19, 2023

knz commented Sep 19, 2023

daniel-crlabs commented Nov 3, 2023

leomkkwan commented Nov 25, 2019 •

edited by cockroach-jira-scripts

Loading