-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support host names in YugaByte DB #128
Comments
Possible solutions:
Preferred optionOption 1 is not preferable because the failure in resolution of single peer leads to the failure of the whole Raft config update. This means that even though we could have moved forward, we would not do so till the peer is removed from the Raft quorum. Option 2 detailsOption #2 is very preferable as it works well with kubernetes, but we would need a way to safely re-resolve the dns or remove the node from the list after a while even if that node ends up alive. Basically, if the node that was removed is added back in a few seconds, we should re-resolve it at a later point. Assuming we leave the peer connection object as a shell without the ip address but with just the dns name: if the node that failed dns resolution comes back, we may never refresh the connection to this host. The master also will not remove the shell host from the tablet raft groups, because the shell host will already appear as a valid part of various quorums. @spolitov's proposal: We could do an async DNS resolution, so it would not block current thread and other peers could be resolved. Currently the DNS resolution happens while holding the lock, and that is not a good practice. Even in a good setup, DNS resolution could add a significant latency. This is ok when we are refreshing peers when becoming a new master, but not ok at steady state if we want to re-resolve on the fly. |
cc @robertpang who is looking at something similar. |
This is still happening even after @spolitov fixed the DNS resolution and turned it async. The problem now is that, when DNS resolution fails, So in short, no DNS resolution -> no I've also figured out a way to repro this locally, with Setting up
Then with a bit of tweak to yb-ctl to use those hostnames instead of ips:
it can repro with
|
Fixed |
Commit 9ba8436 |
Add Oracle regexp_like(), regexp_count(), regexp_instr() and regexp_substr() functions.
PG-215: Doc Fixed links to usage examples,
Add support for testing of pgTap update scripts. This commit adds several new make targets: - make uninstall-all: remove ALL installed pgtap code. Unlike `make unintall`, this removes pgtap*, not just our defined targets. Useful when testing multiple versions of pgtap. - make regress: run installcheck then print any diffs from expected output. - make updatecheck: install an older version of pgTap from PGXN (controlled by $UPDATE_FROM; 0.95.0 by default), update to the current version via ALTER EXTENSION, then run installcheck. - make results: runs `make test` and copies all result files to test/expected/. DO NOT RUN THIS UNLESS YOU'RE CERTAIN ALL YOUR TESTS ARE PASSING! In addition to these changes, `make installcheck` now runs as many tests as possible in parallel. This is much faster than running them sequentially. The degree of parallelism can be controlled via `$PARALLEL_CONN`. Setting `$PARALLEL_CONN` to `1` will go back to a serial test schedule.
Scenario
Seeing this issue with shrinking the cluster, when running using yb-docker-ctl and Kubernetes.
Steps to repro
Details
Docker will stop that container but also remove the DNS entry for that node name (in this case,
yb-tserver-n4
). Subsequently, the Raft groups that still have that node as a peer and list it by hostname end up failing to resolve the address for this peer.Here is a code snippet where the failure happens:
src/yb/consensus/consensus_peers.cc:457
Another node wins the election upon the node removal - the winning leader then loops through its current RaftConfig set of peers (including the dead one, as this could be a temporary failure) and tries to setup a new Proxy to it, which goes down this path of name resolution and fails.
Thanks to @bmatican for investigating/writing up a lot of this.
The text was updated successfully, but these errors were encountered: