-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS resolution fails when cluster members are offline #833
Comments
Hi @ekampp, Thank you for reporting this issue. It has actually been discovered internally too and we are going to fix it. An update will be posted here later on. For the time being you might want to try the latest 4.2.1 version, which fixed another issue, but may result in a more reliable behaviour (especially in combination with the retry mechanism in the tx functions). |
@injectives, thank you for the quick and positive feedback. To be crystal clear, when you said "For the time being, you might want to try the latest 4.2.1 version," did you mean java driver or neo4j version? |
I meant the Java Driver :) |
This update fixes the following issue: neo4j#833 The driver will make sure to resolve all initial router domain names to IPs and will try them all until one is successful. Additionally, the following has been added: - a new config item to make the domain name resolution configurable, which is currently used for testing - the testkit backend has been updated to support the domain name resolution configuration and a new test has been added to testkit to cover the issue described above - rediscovery unit tests have been updated Adding connection timeout support
This update fixes the following issue: neo4j#833 The driver will make sure to resolve all initial router domain names to IPs and will try them all until one is successful. Additionally, the following has been added: - a new config item to make the domain name resolution configurable, which is currently used for testing - the testkit backend has been updated to support the domain name resolution configuration and a new test has been added to testkit to cover the issue described above - rediscovery unit tests have been updated Adding connection timeout support
This update fixes the following issue: neo4j#833 The driver will resolve all initial router domain names to IPs and will try them all until one is successful. Additional items: - a new config option to make the domain name resolution configurable, which is currently used for testing - the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above) - the testkit backend has been updated to support connection timeout driver configuration - several tests have been updated
This update fixes the following issue: neo4j#833 The driver will resolve all initial router domain names to IPs and will try them all until one is successful. Additional items: - a new config option to make the domain name resolution configurable, which is currently used for testing - the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above) - the testkit backend has been updated to support connection timeout driver configuration - several tests have been updated
This update fixes the following issue: neo4j#833 The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is: - resolve the domain name to all IPs - attempt getting a routing table from all of them until first one succeeds by: - getting a connection - trying to get a successful routing table response Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for. This update delivers the following changes: - connection pools for routers are IP address based, which allows for deterministic connection retrieval - the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed - the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing) - the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above) - the testkit backend has been updated to support connection timeout driver configuration - several tests have been updated to adopt the new changes
This update fixes the following issue: neo4j#833 The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is: - resolve the domain name to all IPs - attempt getting a routing table from all of them until first one succeeds by: - getting a connection - trying to get a successful routing table response Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for. This update delivers the following changes: - connection pools for routers are IP address based, which allows for deterministic connection retrieval - the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed - the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing) - the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above) - the testkit backend has been updated to support connection timeout driver configuration - several tests have been updated to adopt the new changes
This update fixes the following issue: neo4j#833 The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is: - resolve the domain name to all IPs - attempt getting a routing table from all of them until first one succeeds by: - getting a connection - trying to get a successful routing table response Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for. This update delivers the following changes: - connection pools for routers are IP address based, which allows for deterministic connection retrieval - the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed - the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing) - the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above) - the testkit backend has been updated to support connection timeout driver configuration - several tests have been updated to adopt the new changes
This update fixes the following issue: neo4j#833 The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is: - resolve the domain name to all IPs - attempt getting a routing table from all of them until first one succeeds by: - getting a connection - trying to get a successful routing table response Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for. This update delivers the following changes: - connection pools for routers are IP address based, which allows for deterministic connection retrieval - the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed - the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing) - the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above) - the testkit backend has been updated to support connection timeout driver configuration - several tests have been updated to adopt the new changes
This update fixes the following issue: neo4j#833 The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is: - resolve the domain name to all IPs - attempt getting a routing table from all of them until first one succeeds by: - getting a connection - trying to get a successful routing table response Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for. This update delivers the following changes: - connection pools for routers are IP address based, which allows for deterministic connection retrieval - the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed - the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing) - the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above) - the testkit backend has been updated to support connection timeout driver configuration - several tests have been updated to adopt the new changes
This update fixes the following issue: neo4j#833 The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is: - resolve the domain name to all IPs - attempt getting a routing table from all of them until first one succeeds by: - getting a connection - trying to get a successful routing table response Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for. This update delivers the following changes: - connection pools for routers are IP address based, which allows for deterministic connection retrieval - the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed - the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing) - the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above) - the testkit backend has been updated to support connection timeout driver configuration - several tests have been updated to adopt the new changes
* Fix for initial routers DNS resolution This update fixes the following issue: #833 The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is: - resolve the domain name to all IPs - attempt getting a routing table from all of them until first one succeeds by: - getting a connection - trying to get a successful routing table response Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for. This update delivers the following changes: - connection pools for routers are IP address based, which allows for deterministic connection retrieval - the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed - the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing) - the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above) - the testkit backend has been updated to support connection timeout driver configuration - several tests have been updated to adopt the new changes * Updating test name
* Fix for initial routers DNS resolution This update fixes the following issue: neo4j#833 The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is: - resolve the domain name to all IPs - attempt getting a routing table from all of them until first one succeeds by: - getting a connection - trying to get a successful routing table response Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for. This update delivers the following changes: - connection pools for routers are IP address based, which allows for deterministic connection retrieval - the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed - the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing) - the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above) - the testkit backend has been updated to support connection timeout driver configuration - several tests have been updated to adopt the new changes * Updating test name
* Fix for initial routers DNS resolution This update fixes the following issue: neo4j#833 The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is: - resolve the domain name to all IPs - attempt getting a routing table from all of them until first one succeeds by: - getting a connection - trying to get a successful routing table response Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for. This update delivers the following changes: - connection pools for routers are IP address based, which allows for deterministic connection retrieval - the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed - the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing) - the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above) - the testkit backend has been updated to support connection timeout driver configuration - several tests have been updated to adopt the new changes * Updating test name
* Imported testkit directory * Migrating tests to testkit (#832) * Migrating tests to testkit Short summary of this update: - removed migrated tests - verifyConnectivity support - resolver support - consume support Test mapping (dest: stub/routing.py): - shouldHandleAcquireReadSession -> test_should_read_successfully_from_reader_using_session_run - shouldHandleAcquireReadTransaction -> test_should_read_successfully_from_reader_using_tx_function - shouldHandleAcquireReadSessionAndTransaction -> test_should_read_successfully_from_reader_using_tx_run - shouldRoundRobinReadServers -> test_should_round_robin_readers_when_reading_using_session_run - shouldRoundRobinReadServersWhenUsingTransaction -> test_should_round_robin_readers_when_reading_using_tx_run - shouldThrowSessionExpiredIfReadServerDisappears -> test_should_fail_when_reading_from_unexpectedly_interrupting_reader_using_session_run - shouldThrowSessionExpiredIfReadServerDisappearsWhenUsingTransaction -> test_should_fail_when_reading_from_unexpectedly_interrupting_reader_using_tx_run - shouldThrowSessionExpiredIfWriteServerDisappears -> test_should_fail_when_writing_on_unexpectedly_interrupting_writer_using_session_run - shouldThrowSessionExpiredIfWriteServerDisappearsWhenUsingTransaction -> test_should_fail_when_writing_on_unexpectedly_interrupting_writer_using_tx_run - shouldHandleAcquireWriteSession -> test_should_write_successfully_on_writer_using_session_run - shouldHandleAcquireWriteTransaction -> test_should_write_successfully_on_writer_using_tx_function - shouldHandleAcquireWriteSessionAndTransaction -> test_should_write_successfully_on_writer_using_tx_run - shouldRoundRobinWriteSessions -> test_should_round_robin_writers_when_writing_using_session_run - shouldRoundRobinWriteSessionsInTransaction -> test_should_round_robin_writers_when_writing_using_tx_run - shouldFailOnNonDiscoverableServer -> test_should_fail_discovery_when_router_fails_with_procedure_not_found_code - shouldFailRandomFailureInGetServers -> test_should_fail_discovery_when_router_fails_with_unknown_code - shouldHandleLeaderSwitchWhenWriting -> test_should_fail_when_writing_on_writer_that_returns_not_a_leader_code - shouldHandleLeaderSwitchWhenWritingWithoutConsuming -> test_should_fail_when_writing_without_explicit_consumption_on_writer_that_returns_not_a_leader_code - shouldHandleLeaderSwitchWhenWritingInTransaction -> test_should_fail_when_writing_on_writer_that_returns_not_a_leader_code_using_tx_run - shouldUseWriteSessionModeAndInitialBookmark -> test_should_use_write_session_mode_and_initial_bookmark_when_writing_using_tx_run - shouldUseReadSessionModeAndInitialBookmark -> test_should_use_read_session_mode_and_initial_bookmark_when_reading_using_tx_run - shouldPassBookmarkFromTransactionToTransaction -> test_should_pass_bookmark_from_tx_to_tx_using_tx_run - shouldRetryReadTransactionUntilSuccess -> test_should_retry_read_tx_until_success - shouldRetryWriteTransactionUntilSuccess -> test_should_retry_write_tx_until_success - shouldRetryReadTransactionAndPerformRediscoveryUntilSuccess -> test_should_retry_read_tx_and_rediscovery_until_success - shouldRetryWriteTransactionAndPerformRediscoveryUntilSuccess -> test_should_retry_write_tx_and_rediscovery_until_success - shouldUseInitialRouterForRediscoveryWhenAllOtherRoutersAreDead -> test_should_use_initial_router_for_discovery_when_others_unavailable - shouldInvokeProcedureGetRoutingTableWhenServerVersionPermits -> test_should_successfully_read_from_readable_router_using_tx_function - shouldSendEmptyRoutingContextInHelloMessage -> test_should_send_empty_hello - shouldServeReadsButFailWritesWhenNoWritersAvailable -> test_should_serve_reads_and_fail_writes_when_no_writers_available - shouldAcceptRoutingTableWithoutWritersAndThenRediscover -> test_should_accept_routing_table_without_writers_and_then_rediscover - shouldTreatRoutingTableWithSingleRouterAsValid -> test_should_accept_routing_table_with_single_router - shouldSendMultipleBookmarks -> test_should_successfully_send_multiple_bookmarks - shouldForgetAddressOnDatabaseUnavailableError -> test_should_forget_address_on_database_unavailable_error - shouldUseResolverDuringRediscoveryWhenExistingRoutersFail -> test_should_use_resolver_during_rediscovery_when_existing_routers_fail - shouldRevertToInitialRouterIfKnownRouterThrowsProtocolErrors -> test_should_revert_to_initial_router_if_known_router_throws_protocol_errors * Removing redundant stub server scripts * Migrating tests to testkit part 2 (#839) - shouldSendRoutingContextToServer -> test_should_successfully_get_routing_table_with_context - shouldSendRoutingContextInHelloMessage -> test_should_successfully_get_routing_table_with_context - shouldHandleLeaderSwitchAndRetryWhenWritingInTxFunction -> test_should_write_successfully_on_leader_switch_using_tx_function - shouldSendInitialBookmark -> test_should_use_write_session_mode_and_initial_bookmark_when_writing_using_tx_run - shouldRetryWriteTransactionUntilSuccessWithWhenLeaderIsRemoved -> test_should_retry_write_until_success_with_leader_change_using_tx_function - shouldRetryWriteTransactionUntilSuccessWithWhenLeaderIsRemovedV3 -> test_should_retry_write_until_success_with_leader_shutdown_during_tx_using_tx_function * Migrating tests to testkit part 3 (#840) Adding support for supportsMultiDB call. And exporting the following tests to testkit: - shouldServerWithBoltV4SupportMultiDb -> test_should_successfully_check_if_support_for_multi_db_is_available - shouldServerWithBoltV3NotSupportMultiDb -> test_should_successfully_check_if_support_for_multi_db_is_available Removing redundant scripts * Stub tests migration part 4 (#847) Removed RoutingDriverMultidatabaseBoltKitIT Migrated tests: - shouldDiscoverForDatabase -> test_should_read_successfully_from_reader_using_session_run (this tests seems to cover the same use-case and uses a non-default DB for 4+ versions) - shouldRetryOnEmptyDiscoveryResult -> test_should_read_successfully_on_empty_discovery_result_using_session_run - shouldThrowRoutingErrorIfDatabaseNotFound -> test_should_fail_with_routing_failure_on_db_not_found_discovery_failure - shouldBeAbleToServeReachableDatabase -> test_should_read_successfully_from_reachable_db_after_trying_unreachable_db (message check has been removed) - shouldPassSystemBookmarkWhenGettingRoutingTableForMultiDB -> test_should_pass_system_bookmark_when_getting_rt_for_multi_db (seems to be applicable to V4 only, also the stub server doesn't seem to check bookmarks) - shouldIgnoreSystemBookmarkWhenGettingRoutingTable -> test_should_ignore_system_bookmark_when_getting_rt_for_multi_db - shouldDriverVerifyConnectivity -> test_should_successfully_get_routing_table_with_context (pre-existing test that already tests the connectivity) Also removed redundant scripts and added code support to DriverError * Fix for initial routers DNS resolution (#849) * Fix for initial routers DNS resolution This update fixes the following issue: #833 The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is: - resolve the domain name to all IPs - attempt getting a routing table from all of them until first one succeeds by: - getting a connection - trying to get a successful routing table response Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for. This update delivers the following changes: - connection pools for routers are IP address based, which allows for deterministic connection retrieval - the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed - the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing) - the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above) - the testkit backend has been updated to support connection timeout driver configuration - several tests have been updated to adopt the new changes * Updating test name * Fixed SSL handling (#851) This update fixes a number of SSL-related tests in testkit and CausalClusteringIT.shouldDropBrokenOldConnections test. The connection pooling strategy has been updated to use the same connection pool when the connection host is unambiguous. Removed hardcoded domain name resolution from the BoltServerAddress and moved the logic to ChannelConnectorImpl that uses the DomainNameResolver. * Updated .gitignore
@ekampp, we have just released a new Neo4j Java Driver version 4.2.4 that fixes this issue. Please let us know if you experience this problem again with the new version. |
@injectives, I will let you know if anything pops up again. Thanks for your help! |
Hi team.
First off, thanks for a super nice tool, and sorry if we're just not understanding it correctly. We have recently migrated to JRuby to leverage the java driver for a variety of reasons.
We realized that neo4j-java-driver is incorrectly operating with DNS resolution mode.
Steps to reproduce
The following connection URI was used:
The DNS query for neo4j.infra.prod.internal returns 3 addresses:
When all aforementioned members are available, everything is fine. The driver connects to one of them successfully and gets the neo4j routing table:
However, if one of those members is down and the
gethostbyname()
library call returns its address at the top of the resulting list, then the neo4j-java-driver will fail to bootstrap:Expected behavior
When a node member goes away, the connection should roll over to the next online member.
Actual behavior
The driver is not trying to connect to other hosts returned in the list.
Logs
logs.txt
The text was updated successfully, but these errors were encountered: