Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply cluster states in system context #53785

Merged

Conversation

DaveCTurner
Copy link
Contributor

Today cluster states are sometimes (rarely) applied in the default context
rather than system context, which means that any appliers which capture their
contexts cannot do things like remote transport actions when security is
enabled.

There are at least two ways that we end up applying the cluster state in the
default context:

  1. locally applying a cluster state that indicates that the master has failed
  2. the elected master times out while waiting for a response from another node

This commit ensures that cluster states are always applied in the system
context.

Mitigates #53751

Today cluster states are sometimes (rarely) applied in the default context
rather than system context, which means that any appliers which capture their
contexts cannot do things like remote transport actions when security is
enabled.

There are at least two ways that we end up applying the cluster state in the
default context:

1. locally applying a cluster state that indicates that the master has failed
2. the elected master times out while waiting for a response from another node

This commit ensures that cluster states are always applied in the system
context.

Mitigates elastic#53751
@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.7.0 v7.6.3 labels Mar 19, 2020
@DaveCTurner DaveCTurner requested a review from jasontedor March 19, 2020 10:46
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

try {
final ThreadContext threadContext = threadPool.getThreadContext();
try (ThreadContext.StoredContext ignored = threadContext.stashContext()) {
threadContext.markAsSystemContext();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change. I would have preferred to avoid an explicit markAsSystemContext and instead to start everything in the system context and then preserve the context everywhere. It all fell to pieces a bit since system context is not propagated across transport messages and I timed out while trying to come up with a reliable way to assert that things are happening in the right context.

@@ -1163,15 +1163,15 @@ public TestClusterNode currentMaster(ClusterState state) {
TestClusterNode(DiscoveryNode node) throws IOException {
this.node = node;
final Environment environment = createEnvironment(node.getName());
masterService = new FakeThreadPoolMasterService(node.getName(), "test", deterministicTaskQueue::scheduleNow);
threadPool = deterministicTaskQueue.getThreadPool(runnable -> CoordinatorTests.onNodeLog(node, runnable));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the rest of the changes, like this one, are to ensure that we use the same ThreadPool instance for entering system context in the master service and then for asserting that publications are sent in system context. Before this change, we were creating multiple threadpool instances which was ok since we were ignoring their stateful behaviour.

NamedXContentRegistry xContentRegistry, Environment environment,
NodeEnvironment nodeEnvironment, NamedWriteableRegistry namedWriteableRegistry,
IndexNameExpressionResolver indexNameExpressionResolver) {
clusterService.addStateApplier(event -> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the key assertions. It seemed a bit vacuous to put them directly in the ClusterApplierService since that's where we enter system context too, but that's another option...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good.

@DaveCTurner
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@DaveCTurner DaveCTurner merged commit c1dc523 into elastic:master Mar 19, 2020
@DaveCTurner DaveCTurner deleted the 2020-03-19-system-context-in-applier branch March 19, 2020 14:13
DaveCTurner added a commit that referenced this pull request Mar 19, 2020
Today cluster states are sometimes (rarely) applied in the default context
rather than system context, which means that any appliers which capture their
contexts cannot do things like remote transport actions when security is
enabled.

There are at least two ways that we end up applying the cluster state in the
default context:

1. locally applying a cluster state that indicates that the master has failed
2. the elected master times out while waiting for a response from another node

This commit ensures that cluster states are always applied in the system
context.

Mitigates #53751
DaveCTurner added a commit that referenced this pull request Mar 19, 2020
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Mar 19, 2020
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Mar 20, 2020
@DaveCTurner
Copy link
Contributor Author

It turned out that this was not the right approach, but we only discovered this when trying to backport it. I immediately reverted the change in 7.x (7d3ac4f) and will revert the master change in #53842.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants