Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use file-based discovery not MockUncasedHostsProvider #33554

Merged

Conversation

DaveCTurner
Copy link
Contributor

Today we use a special unicast hosts provider, the MockUncasedHostsProvider,
in many integration tests, to deal with the dynamic nature of the allocation of
ports to nodes. However #33241 allows us to use file-based discovery to achieve
the same goal, so the special test-only MockUncasedHostsProvider is no longer
required.

This change removes MockUncasedHostProvider and replaces it with file-based
discovery in tests based on EsIntegTestCase.

Today we use a special unicast hosts provider, the `MockUncasedHostsProvider`,
in many integration tests, to deal with the dynamic nature of the allocation of
ports to nodes. However elastic#33241 allows us to use file-based discovery to achieve
the same goal, so the special test-only `MockUncasedHostsProvider` is no longer
required.

This change removes `MockUncasedHostProvider` and replaces it with file-based
discovery in tests based on `EsIntegTestCase`.
@DaveCTurner DaveCTurner added >non-issue v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v6.5.0 labels Sep 10, 2018
@DaveCTurner DaveCTurner requested a review from ywelsch September 10, 2018 06:37
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

import static org.elasticsearch.transport.TcpTransport.PORT;

@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST, numDataNodes = 0, numClientNodes = 0)
public class SettingsBasedHostProviderIT extends ESIntegTestCase {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we were effectively testing the settings-based host provider via all the other integ tests, but not after this change, I thought it prudent to add this.

logger.info("configuring discovery with {} at {}", discoveryFileContents, configPaths);
for (final Path configPath : configPaths) {
Files.createDirectories(configPath);
Files.write(configPath.resolve(UNICAST_HOSTS_FILE), discoveryFileContents); // TODO do we need to do this atomically?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB open question: do we need to do this atomically?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think so? We never call this concurrently for the same destination file?

@DaveCTurner DaveCTurner removed the request for review from ywelsch September 10, 2018 14:09
@DaveCTurner
Copy link
Contributor Author

Marking this as WIP as it needs more work on BWC tests than I realised.

@DaveCTurner
Copy link
Contributor Author

@elasticmachine retest this please
@dnhatn you mentioned you were looking at sporadic failures in GatewayIndexStateIT - this looks like another instance of this.

@@ -705,6 +705,8 @@ public Node start() throws NodeValidationException {
assert localNodeFactory.getNode() != null;
assert transportService.getLocalNode().equals(localNodeFactory.getNode())
: "transportService has a different local node than the factory provided";
onTransportServiceStarted();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid adding these changes here in Node (and MockNode), but instead after initializing the node (but before starting it), call node.injector().getInstance(TransportService.class) to get the TransportService and then register a lifecycle listener on that (addLifecycleListener) which implements afterStart.

@DaveCTurner DaveCTurner force-pushed the 2018-09-09-unmock-host-provider branch from 4b82985 to 924768f Compare September 11, 2018 19:39
@jasontedor
Copy link
Member

@dnhatn I had such a failure earlier too, in my exposing CCR to the transport client PR. You can check the build history there.

@dnhatn
Copy link
Member

dnhatn commented Sep 11, 2018

@DaveCTurner and @jasontedor Thanks for the ping. I opened #33613 and muted the test.

@DaveCTurner
Copy link
Contributor Author

BWC test failures were unrelated to this change and cleared up after merging a more recent master. @ywelsch thanks for the addLifecycleListener suggestion, I've made that change, and this is ready for a review.

@DaveCTurner DaveCTurner requested a review from ywelsch September 12, 2018 07:00
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits. Looks good otherwise.

@@ -705,6 +705,7 @@ public Node start() throws NodeValidationException {
assert localNodeFactory.getNode() != null;
assert transportService.getLocalNode().equals(localNodeFactory.getNode())
: "transportService has a different local node than the factory provided";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST, numDataNodes = 0, numClientNodes = 0)
public class SettingsBasedHostProviderIT extends ESIntegTestCase {

private Consumer<Builder> configureDiscovery;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels a little hacky to me. Can't you explicitly pass the extra settings to the nodes when you're starting them up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came about due to a need to remove a setting from the builder, but that's no longer necessary.

logger.info("configuring discovery with {} at {}", discoveryFileContents, configPaths);
for (final Path configPath : configPaths) {
Files.createDirectories(configPath);
Files.write(configPath.resolve(UNICAST_HOSTS_FILE), discoveryFileContents); // TODO do we need to do this atomically?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think so? We never call this concurrently for the same destination file?

@@ -291,7 +299,7 @@ public void testNodeJoinWithoutSecurityExplicitlyEnabled() throws Exception {
.put("path.home", home)
.put(TestZenDiscovery.USE_MOCK_PINGS.getKey(), false)
.put(DiscoveryModule.DISCOVERY_TYPE_SETTING.getKey(), "test-zen")
.put(DiscoveryModule.DISCOVERY_HOSTS_PROVIDER_SETTING.getKey(), "test-zen")
.putList(DiscoveryModule.DISCOVERY_HOSTS_PROVIDER_SETTING.getKey(), "file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also just use settings based discovery here (no need to write a file then)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we like file-based discovery. Fixed :(

}

void countDown() {
logger.info("transport service started: {} of {} remaining", countDownLatch.getCount() - 1, initialCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if INFO logging is too verbose here. If start up fails, we should already have failures. So what's the benefit of this?

Copy link
Contributor Author

@DaveCTurner DaveCTurner Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really came about from attempts to track down the need for this:

onTransportServiceStarted.run(); // reusing an existing node implies its transport service already started

We discussed and decided not to use a countdown at all - see 31c1d7d.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -223,6 +221,8 @@
private ServiceDisruptionScheme activeDisruptionScheme;
private Function<Client, Client> clientWrapper;

private final Object discoveryFileMutex = new Object();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to just move this next to the rebuildUnicastHostFiles method as it's only used there.

@DaveCTurner DaveCTurner merged commit 5a3fd8e into elastic:master Sep 13, 2018
@DaveCTurner DaveCTurner deleted the 2018-09-09-unmock-host-provider branch September 13, 2018 05:37
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Sep 13, 2018
Today we use a special unicast hosts provider, the `MockUncasedHostsProvider`,
in many integration tests, to deal with the dynamic nature of the allocation of
ports to nodes. However elastic#33241 allows us to use file-based discovery to achieve
the same goal, so the special test-only `MockUncasedHostsProvider` is no longer
required.

This change removes `MockUncasedHostProvider` and replaces it with file-based
discovery in tests based on `EsIntegTestCase`.
@DaveCTurner DaveCTurner removed the WIP label Sep 13, 2018
@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Sep 13, 2018

6.x is not passing CI so I opened a separate PR for the backport: #33658.

DaveCTurner added a commit that referenced this pull request Sep 18, 2018
Today we use a special unicast hosts provider, the `MockUncasedHostsProvider`,
in many integration tests, to deal with the dynamic nature of the allocation of
ports to nodes. However #33241 allows us to use file-based discovery to achieve
the same goal, so the special test-only `MockUncasedHostsProvider` is no longer
required.

This change removes `MockUncasedHostProvider` and replaces it with file-based
discovery in tests based on `EsIntegTestCase`.

Backport of #33554 to 6.x.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 31, 2018
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts
files each time a node starts its transport service. This does mean that a
number of nodes can start and perform their first pinging round without any
unicast hosts which, if the timing is unlucky and a lot of nodes are all
started at the same time, can lead to a split brain as in elastic#35052.

Prior to elastic#33554 this was unlikely to happen since the MockUncasedHostsProvider
would always have yielded the existing hosts, so the timing would have to have
been implausibly unlucky. Since elastic#33554, however, it's more likely because the
race occurs between the start of the first round of pinging and the writing of
the unicast hosts file. It is realistic that new nodes will be configured with
the existing nodes from startup, so this change reinstates that behaviour

Closes elastic#35052.
DaveCTurner added a commit that referenced this pull request Oct 31, 2018
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts
files each time a node starts its transport service. This does mean that a
number of nodes can start and perform their first pinging round without any
unicast hosts which, if the timing is unlucky and a lot of nodes are all
started at the same time, can lead to a split brain as in #35052.

Prior to #33554 this was unlikely to happen since the MockUncasedHostsProvider
would always have yielded the existing hosts, so the timing would have to have
been implausibly unlucky. Since #33554, however, it's more likely because the
race occurs between the start of the first round of pinging and the writing of
the unicast hosts file. It is realistic that new nodes will be configured with
the existing nodes from startup, so this change reinstates that behaviour.

Closes #35052.
DaveCTurner added a commit that referenced this pull request Oct 31, 2018
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts
files each time a node starts its transport service. This does mean that a
number of nodes can start and perform their first pinging round without any
unicast hosts which, if the timing is unlucky and a lot of nodes are all
started at the same time, can lead to a split brain as in #35052.

Prior to #33554 this was unlikely to happen since the MockUncasedHostsProvider
would always have yielded the existing hosts, so the timing would have to have
been implausibly unlucky. Since #33554, however, it's more likely because the
race occurs between the start of the first round of pinging and the writing of
the unicast hosts file. It is realistic that new nodes will be configured with
the existing nodes from startup, so this change reinstates that behaviour.

Closes #35052.
DaveCTurner added a commit that referenced this pull request Oct 31, 2018
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts
files each time a node starts its transport service. This does mean that a
number of nodes can start and perform their first pinging round without any
unicast hosts which, if the timing is unlucky and a lot of nodes are all
started at the same time, can lead to a split brain as in #35052.

Prior to #33554 this was unlikely to happen since the MockUncasedHostsProvider
would always have yielded the existing hosts, so the timing would have to have
been implausibly unlucky. Since #33554, however, it's more likely because the
race occurs between the start of the first round of pinging and the writing of
the unicast hosts file. It is realistic that new nodes will be configured with
the existing nodes from startup, so this change reinstates that behaviour.

Closes #35052.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >non-issue v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants