Support host names in YugaByte DB #128

rkarthik007 · 2018-03-27T15:57:16Z

Scenario

Seeing this issue with shrinking the cluster, when running using yb-docker-ctl and Kubernetes.

Steps to repro

Download the latest yb-docker-ctl for a local docker setup of YugaByte.
Start a docker cluster: bin/yb-docker-ctl create

Start a workload (this creates a table):

java -jar ./yb-sample-apps.jar --workload CassandraKeyValue \
                               --nodes localhost:9042 \
                               --num_threads_write 1 \
                               --num_threads_read 4 \
                               --value_size 4096

Add a node: bin/yb-docker-ctl add_node
Wait for the load to stabilize: http://localhost:7000/tablet-servers
Remove a node:
```
bin/yb-docker-ctl remove_node 4
```

Details

Docker will stop that container but also remove the DNS entry for that node name (in this case, yb-tserver-n4). Subsequently, the Raft groups that still have that node as a peer and list it by hostname end up failing to resolve the address for this peer.

Here is a code snippet where the failure happens:
src/yb/consensus/consensus_peers.cc:457

Status CreateConsensusServiceProxyForHost(const shared_ptr<Messenger>& messenger,
                                          const HostPort& hostport,
                                          gscoped_ptr<ConsensusServiceProxy>* new_proxy) {
  std::vector<Endpoint> addrs;
  RETURN_NOT_OK(hostport.ResolveAddresses(&addrs));
  if (addrs.size() > 1) {
    LOG(WARNING) << "Peer address '" << hostport.ToString() << "' "
                 << "resolves to " << addrs.size() << " different addresses. Using "
                 << addrs[0];
  }
  new_proxy->reset(new ConsensusServiceProxy(messenger, addrs[0]));
  return Status::OK();
}

Another node wins the election upon the node removal - the winning leader then loops through its current RaftConfig set of peers (including the dead one, as this could be a temporary failure) and tries to setup a new Proxy to it, which goes down this path of name resolution and fails.

Thanks to @bmatican for investigating/writing up a lot of this.

The text was updated successfully, but these errors were encountered:

rkarthik007 · 2018-03-27T16:06:41Z

Possible solutions:

Resolve host names as early as possible and use IP addresses from that point onwards.
Treat DNS resolution timeout as host being unreachable.

Preferred option

Option 1 is not preferable because the failure in resolution of single peer leads to the failure of the whole Raft config update. This means that even though we could have moved forward, we would not do so till the peer is removed from the Raft quorum.

Option 2 details

Option #2 is very preferable as it works well with kubernetes, but we would need a way to safely re-resolve the dns or remove the node from the list after a while even if that node ends up alive. Basically, if the node that was removed is added back in a few seconds, we should re-resolve it at a later point.

Assuming we leave the peer connection object as a shell without the ip address but with just the dns name: if the node that failed dns resolution comes back, we may never refresh the connection to this host. The master also will not remove the shell host from the tablet raft groups, because the shell host will already appear as a valid part of various quorums.

@spolitov's proposal:

We could do an async DNS resolution, so it would not block current thread and other peers could be resolved. Currently the DNS resolution happens while holding the lock, and that is not a good practice. Even in a good setup, DNS resolution could add a significant latency. This is ok when we are refreshing peers when becoming a new master, but not ok at steady state if we want to re-resolve on the fly.

cc @mbautin @ttyusupov @kmuthukk

rkarthik007 · 2018-03-27T18:04:37Z

cc @robertpang who is looking at something similar.

bmatican · 2018-04-26T16:55:06Z

This is still happening even after @spolitov fixed the DNS resolution and turned it async. The problem now is that, when DNS resolution fails, PeerManager::UpdateRaftConfig ends up not running the Peer::NewRemotePeer code, which would be responsible for creating a new Peer and calling Peer::Init which would set it up as tracked in the PeerMessageQueue::TrackPeer. That is important, because the PeerMessageQueue::RequestForPeer is the one responsible for kicking out dead peers after follower_unavailable_considered_failed_sec elapses...

So in short, no DNS resolution -> no Peer -> no book-keeping on communication -> no removal of dead peer from RaftConfig...

I've also figured out a way to repro this locally, with yb-ctl instead of needing to generate our tar.gz and setup a docker image.

Setting up /etc/hosts:

127.0.0.1 yb-master-n1
127.0.0.2 yb-master-n2
127.0.0.3 yb-master-n3

127.0.0.1 yb-tserver-n1
127.0.0.2 yb-tserver-n2
# 127.0.0.3 yb-tserver-n3
127.0.0.4 yb-tserver-n4
127.0.0.5 yb-tserver-n5

Then with a bit of tweak to yb-ctl to use those hostnames instead of ips:

diff --git a/bin/yb-ctl b/bin/yb-ctl
index d857a6c..09e41f1 100755
--- a/bin/yb-ctl
+++ b/bin/yb-ctl
@@ -85,8 +85,9 @@ class ExitWithError(Exception):
     pass


-def get_local_ip(index):
-    return "127.0.0.{}".format(index)
+def get_local_ip(daemon_type, index):
+    # return "127.0.0.{}".format(index)
+    return "yb-{}-n{}".format(daemon_type, index)


 class DaemonId:
@@ -107,7 +108,7 @@ class DaemonId:
         return self.daemon_type == DAEMON_TYPE_TSERVER

     def get_ip_address(self):
-        return get_local_ip(self.index)
+        return get_local_ip(self.daemon_type, self.index)


 class ClusterOptions:
@@ -450,7 +451,9 @@ class ClusterControl:
             "--redis_proxy_bind_address {}".format(daemon_id.get_ip_address()),
             "--cql_proxy_bind_address {}".format(daemon_id.get_ip_address()),
             "--pgsql_proxy_bind_address {}".format(daemon_id.get_ip_address()),
-            "--local_ip_for_outbound_sockets {}".format(daemon_id.get_ip_address()),
+            # "--local_ip_for_outbound_sockets {}".format(daemon_id.get_ip_address()),
+            "--local_ip_for_outbound_sockets 127.0.0.{}".format(daemon_id.index),
+            "--follower_unavailable_considered_failed_sec=15",
             # TODO ENG-2876: Enable this in master asvwell.
             "--use_cassandra_authentication={}".format(
                 str(self.options.use_cassandra_authentication).lower())
@@ -711,6 +714,7 @@ class ClusterControl:
             cmd_setup_redis_table = [yb_admin_binary_path,
                                      "--master_addresses",
                                      self.options.master_addresses,
+                                     "--yb_num_shards_per_tserver", str(self.options.num_shards_per_tserver),
                                      "setup_redis_table"]
             result = ""
             try:

it can repro with

./bin/yb-ctl create
./bin/yb-ctl setup_redis
./bin/yb-ctl add_node
# wait for tablets to balance
# comment out node 3 from /etc/hosts
./bin/yb-ctl remove_node 3
# wait the 15s from the follower_unavailable_considered_failed_sec flag

spolitov · 2018-05-08T06:38:56Z

Fixed

spolitov · 2018-05-08T18:25:52Z

Commit 9ba8436

Add Oracle regexp_like(), regexp_count(), regexp_instr() and regexp_substr() functions.

PG-215: Doc Fixed links to usage examples,

Add support for testing of pgTap update scripts. This commit adds several new make targets: - make uninstall-all: remove ALL installed pgtap code. Unlike `make unintall`, this removes pgtap*, not just our defined targets. Useful when testing multiple versions of pgtap. - make regress: run installcheck then print any diffs from expected output. - make updatecheck: install an older version of pgTap from PGXN (controlled by $UPDATE_FROM; 0.95.0 by default), update to the current version via ALTER EXTENSION, then run installcheck. - make results: runs `make test` and copies all result files to test/expected/. DO NOT RUN THIS UNLESS YOU'RE CERTAIN ALL YOUR TESTS ARE PASSING! In addition to these changes, `make installcheck` now runs as many tests as possible in parallel. This is much faster than running them sequentially. The degree of parallelism can be controlled via `$PARALLEL_CONN`. Setting `$PARALLEL_CONN` to `1` will go back to a serial test schedule.

rkarthik007 assigned spolitov Mar 27, 2018

rkarthik007 added the kind/enhancement This is an enhancement of an existing feature label Mar 27, 2018

rkarthik007 mentioned this issue Apr 6, 2018

docker cluster errors #119

Closed

rkarthik007 added this to the v1.0 milestone Apr 11, 2018

bmatican mentioned this issue Apr 25, 2018

Enhance MiniCluster testing to use hostnames and force DNS resolution #240

Open

spolitov closed this as completed May 8, 2018

bmatican mentioned this issue May 30, 2018

DNS resolution failures in Raft when binding to hostnames #313

Closed

jasonyb pushed a commit that referenced this issue Mar 18, 2024

Merge pull request #128 from darold/master

b1a0c21

Add Oracle regexp_like(), regexp_count(), regexp_instr() and regexp_substr() functions.

jasonyb pushed a commit that referenced this issue Jun 11, 2024

Merge pull request #128 from nastena1606/PG-215-Doc-Fix-links-master

3f937e3

PG-215: Doc Fixed links to usage examples,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support host names in YugaByte DB #128

Support host names in YugaByte DB #128

rkarthik007 commented Mar 27, 2018 •

edited

Loading

rkarthik007 commented Mar 27, 2018 •

edited

Loading

rkarthik007 commented Mar 27, 2018

bmatican commented Apr 26, 2018

spolitov commented May 8, 2018

spolitov commented May 8, 2018

Support host names in YugaByte DB #128

Support host names in YugaByte DB #128

Comments

rkarthik007 commented Mar 27, 2018 • edited Loading

Scenario

Steps to repro

Details

rkarthik007 commented Mar 27, 2018 • edited Loading

Possible solutions:

Preferred option

Option 2 details

rkarthik007 commented Mar 27, 2018

bmatican commented Apr 26, 2018

spolitov commented May 8, 2018

spolitov commented May 8, 2018

rkarthik007 commented Mar 27, 2018 •

edited

Loading

rkarthik007 commented Mar 27, 2018 •

edited

Loading