Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolution fails when cluster members are offline #833

Closed
ekampp opened this issue Feb 25, 2021 · 5 comments
Closed

DNS resolution fails when cluster members are offline #833

ekampp opened this issue Feb 25, 2021 · 5 comments
Assignees

Comments

@ekampp
Copy link

ekampp commented Feb 25, 2021

Hi team.

First off, thanks for a super nice tool, and sorry if we're just not understanding it correctly. We have recently migrated to JRuby to leverage the java driver for a variety of reasons.

We realized that neo4j-java-driver is incorrectly operating with DNS resolution mode.

  • Neo4j version: Enterprise 4.2.2
  • Neo4j Mode: A cluster with 5 members/Casual cluster with 5 core 0 read-replica
  • Driver version: neo4j-java-driver-4.2.0.beta.0-java
  • Operating system: Ubuntu 20.04.2 (official neo4j AMI)

Steps to reproduce

The following connection URI was used:

NEO4J_URL=neo4j://<user>:<password>@neo4j.infra.prod.internal:7687

The DNS query for neo4j.infra.prod.internal returns 3 addresses:

neo4j.infra.prod.internal has address 10.26.219.30
neo4j.infra.prod.internal has address 10.26.221.188
neo4j.infra.prod.internal has address 10.26.204.91

When all aforementioned members are available, everything is fine. The driver connects to one of them successfully and gets the neo4j routing table:

INFO: Closing connection pool towards neo4j.infra.prod.internal(10.26.204.91):7687, it has no active connections and is not in the routing table registry.

However, if one of those members is down and the gethostbyname() library call returns its address at the top of the resulting list, then the neo4j-java-driver will fail to bootstrap:

Caused by: org.neo4j.driver.exceptions.ServiceUnavailableException: Unable to connect to neo4j.infra.prod.internal(10.26.221.188):7687, ensure the database is running and that there is a working network connection
to it. 

Expected behavior

When a node member goes away, the connection should roll over to the next online member.

Actual behavior

The driver is not trying to connect to other hosts returned in the list.

Logs

logs.txt

@injectives
Copy link
Contributor

Hi @ekampp,

Thank you for reporting this issue. It has actually been discovered internally too and we are going to fix it. An update will be posted here later on.

For the time being you might want to try the latest 4.2.1 version, which fixed another issue, but may result in a more reliable behaviour (especially in combination with the retry mechanism in the tx functions).

@ekampp
Copy link
Author

ekampp commented Feb 26, 2021

@injectives, thank you for the quick and positive feedback.

To be crystal clear, when you said "For the time being, you might want to try the latest 4.2.1 version," did you mean java driver or neo4j version?

@injectives
Copy link
Contributor

I meant the Java Driver :)

@injectives injectives self-assigned this Mar 10, 2021
injectives added a commit to injectives/neo4j-java-driver that referenced this issue Mar 10, 2021
This update fixes the following issue: neo4j#833

The driver will make sure to resolve all initial router domain names to IPs and will try them all until one is successful.

Additionally, the following has been added:
- a new config item to make the domain name resolution configurable, which is currently used for testing
- the testkit backend has been updated to support the domain name resolution configuration and a new test has been added to testkit to cover the issue described above
- rediscovery unit tests have been updated

Adding connection timeout support
injectives added a commit to injectives/neo4j-java-driver that referenced this issue Mar 10, 2021
This update fixes the following issue: neo4j#833

The driver will make sure to resolve all initial router domain names to IPs and will try them all until one is successful.

Additionally, the following has been added:
- a new config item to make the domain name resolution configurable, which is currently used for testing
- the testkit backend has been updated to support the domain name resolution configuration and a new test has been added to testkit to cover the issue described above
- rediscovery unit tests have been updated

Adding connection timeout support
injectives added a commit to injectives/neo4j-java-driver that referenced this issue Mar 10, 2021
This update fixes the following issue: neo4j#833

The driver will resolve all initial router domain names to IPs and will try them all until one is successful.

Additional items:
- a new config option to make the domain name resolution configurable, which is currently used for testing
- the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above)
- the testkit backend has been updated to support connection timeout driver configuration
- several tests have been updated
injectives added a commit to injectives/neo4j-java-driver that referenced this issue Mar 10, 2021
This update fixes the following issue: neo4j#833

The driver will resolve all initial router domain names to IPs and will try them all until one is successful.

Additional items:
- a new config option to make the domain name resolution configurable, which is currently used for testing
- the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above)
- the testkit backend has been updated to support connection timeout driver configuration
- several tests have been updated
injectives added a commit to injectives/neo4j-java-driver that referenced this issue Mar 15, 2021
This update fixes the following issue: neo4j#833

The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is:
- resolve the domain name to all IPs
- attempt getting a routing table from all of them until first one succeeds by:
  - getting a connection
  - trying to get a successful routing table response

Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for.

This update delivers the following changes:
- connection pools for routers are IP address based, which allows for deterministic connection retrieval
- the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed
- the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing)
- the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above)
- the testkit backend has been updated to support connection timeout driver configuration
- several tests have been updated to adopt the new changes
injectives added a commit to injectives/neo4j-java-driver that referenced this issue Mar 15, 2021
This update fixes the following issue: neo4j#833

The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is:
- resolve the domain name to all IPs
- attempt getting a routing table from all of them until first one succeeds by:
  - getting a connection
  - trying to get a successful routing table response

Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for.

This update delivers the following changes:
- connection pools for routers are IP address based, which allows for deterministic connection retrieval
- the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed
- the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing)
- the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above)
- the testkit backend has been updated to support connection timeout driver configuration
- several tests have been updated to adopt the new changes
injectives added a commit to injectives/neo4j-java-driver that referenced this issue Mar 15, 2021
This update fixes the following issue: neo4j#833

The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is:
- resolve the domain name to all IPs
- attempt getting a routing table from all of them until first one succeeds by:
  - getting a connection
  - trying to get a successful routing table response

Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for.

This update delivers the following changes:
- connection pools for routers are IP address based, which allows for deterministic connection retrieval
- the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed
- the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing)
- the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above)
- the testkit backend has been updated to support connection timeout driver configuration
- several tests have been updated to adopt the new changes
injectives added a commit to injectives/neo4j-java-driver that referenced this issue Mar 15, 2021
This update fixes the following issue: neo4j#833

The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is:
- resolve the domain name to all IPs
- attempt getting a routing table from all of them until first one succeeds by:
  - getting a connection
  - trying to get a successful routing table response

Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for.

This update delivers the following changes:
- connection pools for routers are IP address based, which allows for deterministic connection retrieval
- the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed
- the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing)
- the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above)
- the testkit backend has been updated to support connection timeout driver configuration
- several tests have been updated to adopt the new changes
injectives added a commit to injectives/neo4j-java-driver that referenced this issue Mar 15, 2021
This update fixes the following issue: neo4j#833

The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is:
- resolve the domain name to all IPs
- attempt getting a routing table from all of them until first one succeeds by:
  - getting a connection
  - trying to get a successful routing table response

Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for.

This update delivers the following changes:
- connection pools for routers are IP address based, which allows for deterministic connection retrieval
- the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed
- the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing)
- the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above)
- the testkit backend has been updated to support connection timeout driver configuration
- several tests have been updated to adopt the new changes
injectives added a commit to injectives/neo4j-java-driver that referenced this issue Mar 15, 2021
This update fixes the following issue: neo4j#833

The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is:
- resolve the domain name to all IPs
- attempt getting a routing table from all of them until first one succeeds by:
  - getting a connection
  - trying to get a successful routing table response

Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for.

This update delivers the following changes:
- connection pools for routers are IP address based, which allows for deterministic connection retrieval
- the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed
- the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing)
- the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above)
- the testkit backend has been updated to support connection timeout driver configuration
- several tests have been updated to adopt the new changes
injectives added a commit that referenced this issue Mar 17, 2021
* Fix for initial routers DNS resolution

This update fixes the following issue: #833

The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is:
- resolve the domain name to all IPs
- attempt getting a routing table from all of them until first one succeeds by:
  - getting a connection
  - trying to get a successful routing table response

Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for.

This update delivers the following changes:
- connection pools for routers are IP address based, which allows for deterministic connection retrieval
- the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed
- the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing)
- the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above)
- the testkit backend has been updated to support connection timeout driver configuration
- several tests have been updated to adopt the new changes

* Updating test name
injectives added a commit to injectives/neo4j-java-driver that referenced this issue Mar 17, 2021
* Fix for initial routers DNS resolution

This update fixes the following issue: neo4j#833

The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is:
- resolve the domain name to all IPs
- attempt getting a routing table from all of them until first one succeeds by:
  - getting a connection
  - trying to get a successful routing table response

Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for.

This update delivers the following changes:
- connection pools for routers are IP address based, which allows for deterministic connection retrieval
- the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed
- the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing)
- the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above)
- the testkit backend has been updated to support connection timeout driver configuration
- several tests have been updated to adopt the new changes

* Updating test name
injectives added a commit to injectives/neo4j-java-driver that referenced this issue Mar 22, 2021
* Fix for initial routers DNS resolution

This update fixes the following issue: neo4j#833

The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is:
- resolve the domain name to all IPs
- attempt getting a routing table from all of them until first one succeeds by:
  - getting a connection
  - trying to get a successful routing table response

Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for.

This update delivers the following changes:
- connection pools for routers are IP address based, which allows for deterministic connection retrieval
- the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed
- the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing)
- the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above)
- the testkit backend has been updated to support connection timeout driver configuration
- several tests have been updated to adopt the new changes

* Updating test name
injectives added a commit that referenced this issue Mar 23, 2021
* Imported testkit directory

* Migrating tests to testkit (#832)

* Migrating tests to testkit

Short summary of this update:
- removed migrated tests
- verifyConnectivity support
- resolver support
- consume support

Test mapping (dest: stub/routing.py):
- shouldHandleAcquireReadSession -> test_should_read_successfully_from_reader_using_session_run
- shouldHandleAcquireReadTransaction -> test_should_read_successfully_from_reader_using_tx_function
- shouldHandleAcquireReadSessionAndTransaction -> test_should_read_successfully_from_reader_using_tx_run
- shouldRoundRobinReadServers -> test_should_round_robin_readers_when_reading_using_session_run
- shouldRoundRobinReadServersWhenUsingTransaction -> test_should_round_robin_readers_when_reading_using_tx_run
- shouldThrowSessionExpiredIfReadServerDisappears -> test_should_fail_when_reading_from_unexpectedly_interrupting_reader_using_session_run
- shouldThrowSessionExpiredIfReadServerDisappearsWhenUsingTransaction -> test_should_fail_when_reading_from_unexpectedly_interrupting_reader_using_tx_run
- shouldThrowSessionExpiredIfWriteServerDisappears -> test_should_fail_when_writing_on_unexpectedly_interrupting_writer_using_session_run
- shouldThrowSessionExpiredIfWriteServerDisappearsWhenUsingTransaction -> test_should_fail_when_writing_on_unexpectedly_interrupting_writer_using_tx_run
- shouldHandleAcquireWriteSession -> test_should_write_successfully_on_writer_using_session_run
- shouldHandleAcquireWriteTransaction -> test_should_write_successfully_on_writer_using_tx_function
- shouldHandleAcquireWriteSessionAndTransaction -> test_should_write_successfully_on_writer_using_tx_run
- shouldRoundRobinWriteSessions -> test_should_round_robin_writers_when_writing_using_session_run
- shouldRoundRobinWriteSessionsInTransaction -> test_should_round_robin_writers_when_writing_using_tx_run
- shouldFailOnNonDiscoverableServer -> test_should_fail_discovery_when_router_fails_with_procedure_not_found_code
- shouldFailRandomFailureInGetServers -> test_should_fail_discovery_when_router_fails_with_unknown_code
- shouldHandleLeaderSwitchWhenWriting -> test_should_fail_when_writing_on_writer_that_returns_not_a_leader_code
- shouldHandleLeaderSwitchWhenWritingWithoutConsuming -> test_should_fail_when_writing_without_explicit_consumption_on_writer_that_returns_not_a_leader_code
- shouldHandleLeaderSwitchWhenWritingInTransaction -> test_should_fail_when_writing_on_writer_that_returns_not_a_leader_code_using_tx_run
- shouldUseWriteSessionModeAndInitialBookmark -> test_should_use_write_session_mode_and_initial_bookmark_when_writing_using_tx_run
- shouldUseReadSessionModeAndInitialBookmark -> test_should_use_read_session_mode_and_initial_bookmark_when_reading_using_tx_run
- shouldPassBookmarkFromTransactionToTransaction -> test_should_pass_bookmark_from_tx_to_tx_using_tx_run
- shouldRetryReadTransactionUntilSuccess -> test_should_retry_read_tx_until_success
- shouldRetryWriteTransactionUntilSuccess -> test_should_retry_write_tx_until_success
- shouldRetryReadTransactionAndPerformRediscoveryUntilSuccess -> test_should_retry_read_tx_and_rediscovery_until_success
- shouldRetryWriteTransactionAndPerformRediscoveryUntilSuccess -> test_should_retry_write_tx_and_rediscovery_until_success
- shouldUseInitialRouterForRediscoveryWhenAllOtherRoutersAreDead -> test_should_use_initial_router_for_discovery_when_others_unavailable
- shouldInvokeProcedureGetRoutingTableWhenServerVersionPermits -> test_should_successfully_read_from_readable_router_using_tx_function
- shouldSendEmptyRoutingContextInHelloMessage -> test_should_send_empty_hello
- shouldServeReadsButFailWritesWhenNoWritersAvailable -> test_should_serve_reads_and_fail_writes_when_no_writers_available
- shouldAcceptRoutingTableWithoutWritersAndThenRediscover -> test_should_accept_routing_table_without_writers_and_then_rediscover
- shouldTreatRoutingTableWithSingleRouterAsValid -> test_should_accept_routing_table_with_single_router
- shouldSendMultipleBookmarks -> test_should_successfully_send_multiple_bookmarks
- shouldForgetAddressOnDatabaseUnavailableError -> test_should_forget_address_on_database_unavailable_error
- shouldUseResolverDuringRediscoveryWhenExistingRoutersFail -> test_should_use_resolver_during_rediscovery_when_existing_routers_fail
- shouldRevertToInitialRouterIfKnownRouterThrowsProtocolErrors -> test_should_revert_to_initial_router_if_known_router_throws_protocol_errors

* Removing redundant stub server scripts

* Migrating tests to testkit part 2 (#839)

- shouldSendRoutingContextToServer -> test_should_successfully_get_routing_table_with_context
- shouldSendRoutingContextInHelloMessage -> test_should_successfully_get_routing_table_with_context
- shouldHandleLeaderSwitchAndRetryWhenWritingInTxFunction -> test_should_write_successfully_on_leader_switch_using_tx_function
- shouldSendInitialBookmark -> test_should_use_write_session_mode_and_initial_bookmark_when_writing_using_tx_run
- shouldRetryWriteTransactionUntilSuccessWithWhenLeaderIsRemoved -> test_should_retry_write_until_success_with_leader_change_using_tx_function
- shouldRetryWriteTransactionUntilSuccessWithWhenLeaderIsRemovedV3 -> test_should_retry_write_until_success_with_leader_shutdown_during_tx_using_tx_function

* Migrating tests to testkit part 3 (#840)

Adding support for supportsMultiDB call.

And exporting the following tests to testkit:
- shouldServerWithBoltV4SupportMultiDb -> test_should_successfully_check_if_support_for_multi_db_is_available
- shouldServerWithBoltV3NotSupportMultiDb -> test_should_successfully_check_if_support_for_multi_db_is_available

Removing redundant scripts

* Stub tests migration part 4 (#847)

Removed RoutingDriverMultidatabaseBoltKitIT

Migrated tests:
- shouldDiscoverForDatabase -> test_should_read_successfully_from_reader_using_session_run (this tests seems to cover the same use-case and uses a non-default DB for 4+ versions)
- shouldRetryOnEmptyDiscoveryResult -> test_should_read_successfully_on_empty_discovery_result_using_session_run
- shouldThrowRoutingErrorIfDatabaseNotFound -> test_should_fail_with_routing_failure_on_db_not_found_discovery_failure
- shouldBeAbleToServeReachableDatabase -> test_should_read_successfully_from_reachable_db_after_trying_unreachable_db (message check has been removed)
- shouldPassSystemBookmarkWhenGettingRoutingTableForMultiDB -> test_should_pass_system_bookmark_when_getting_rt_for_multi_db (seems to be applicable to V4 only, also the stub server doesn't seem to check bookmarks)
- shouldIgnoreSystemBookmarkWhenGettingRoutingTable -> test_should_ignore_system_bookmark_when_getting_rt_for_multi_db
- shouldDriverVerifyConnectivity -> test_should_successfully_get_routing_table_with_context (pre-existing test that already tests the connectivity)

Also removed redundant scripts and added code support to DriverError

* Fix for initial routers DNS resolution (#849)

* Fix for initial routers DNS resolution

This update fixes the following issue: #833

The desired behaviour for getting a routing table from the initial router (either on bootstrap or when all known routers have failed) is:
- resolve the domain name to all IPs
- attempt getting a routing table from all of them until first one succeeds by:
  - getting a connection
  - trying to get a successful routing table response

Prior to this change, the connection pools were created for host and port pairs. When domain name of the host resolves to multiple IP addresses, such pools provide connections to those IPs as a group. While this works for readers and writers, it negatively impacts the routing table fetching process as there is no guarantee which IP address the provided connection is setup for.

This update delivers the following changes:
- connection pools for routers are IP address based, which allows for deterministic connection retrieval
- the resolved IP address set is kept up-to-date (in case known router IPs change) to make sure that the unused connection pools are flushed
- the domain name resolution logic has been made configurable (it is private at the moment and is used to facilitate testing)
- the testkit backend has been updated to support the domain name resolution configuration (a new test has been added to testkit to cover the issue described above)
- the testkit backend has been updated to support connection timeout driver configuration
- several tests have been updated to adopt the new changes

* Updating test name

* Fixed SSL handling (#851)

This update fixes a number of SSL-related tests in testkit and CausalClusteringIT.shouldDropBrokenOldConnections test.
The connection pooling strategy has been updated to use the same connection pool when the connection host is unambiguous.
Removed hardcoded domain name resolution from the BoltServerAddress and moved the logic to ChannelConnectorImpl that uses the DomainNameResolver.

* Updated .gitignore
@injectives
Copy link
Contributor

@ekampp, we have just released a new Neo4j Java Driver version 4.2.4 that fixes this issue.

Please let us know if you experience this problem again with the new version.

@ekampp
Copy link
Author

ekampp commented Mar 26, 2021

@injectives, I will let you know if anything pops up again. Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants