Timed out cluster state publication is applied in an empty context #53751

DaveCTurner · 2020-03-18T17:10:02Z

Today the elected master waits for all other nodes to acknowledge a cluster state publication before applying it locally, although it will time out if the other nodes are not all fast enough. The timeout is performed by a delayed action scheduled with ThreadPool#schedule at the start of the publication.

ThreadPool#schedule does not preserve the context of the caller, however, which means that the cluster state is applied with an empty context rather than being in the system context. This means that any cluster state appliers which use the context of the application (e.g. capture it for future use, or try and send transport messages) will not work correctly if security is enabled.

One such case was introduced in #48430: retention lease syncs now run in the context in which the IndexService was created, which happens during cluster state application. Thus if the elected master is also a data node, and the cluster state publication that assigns a shard to it times out after committing, then the retention lease syncs will fail.

If affected, the workaround is to restart the elected master.

This exposes a gap in the CoordinatorTests framework which does not properly simulate how thread contexts behave.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-18T17:10:04Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

Today cluster states are sometimes (rarely) applied in the default context rather than system context, which means that any appliers which capture their contexts cannot do things like remote transport actions when security is enabled. There are at least two ways that we end up applying the cluster state in the default context: 1. locally applying a cluster state that indicates that the master has failed 2. the elected master times out while waiting for a response from another node This commit ensures that cluster states are always applied in the system context. Mitigates elastic#53751

Today cluster states are sometimes (rarely) applied in the default context rather than system context, which means that any appliers which capture their contexts cannot do things like remote transport actions when security is enabled. There are at least two ways that we end up applying the cluster state in the default context: 1. locally applying a cluster state that indicates that the master has failed 2. the elected master times out while waiting for a response from another node This commit ensures that cluster states are always applied in the system context. Mitigates #53751

ywelsch · 2020-07-23T07:40:15Z

Closed by #57792

DaveCTurner added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Mar 18, 2020

DaveCTurner added blocker v7.6.2 labels Mar 18, 2020

DaveCTurner mentioned this issue Mar 19, 2020

Apply cluster states in system context #53785

Merged

DaveCTurner mentioned this issue Mar 19, 2020

Apply cluster states in system context #53819

Closed

jasontedor mentioned this issue Mar 20, 2020

Execute retention lease syncs under system context #53838

Merged

jasontedor removed the blocker label Mar 20, 2020

jimczi removed the v7.6.2 label Mar 24, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020

ywelsch closed this as completed Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timed out cluster state publication is applied in an empty context #53751

Timed out cluster state publication is applied in an empty context #53751

DaveCTurner commented Mar 18, 2020 •

edited

Loading

elasticmachine commented Mar 18, 2020

ywelsch commented Jul 23, 2020

Timed out cluster state publication is applied in an empty context #53751

Timed out cluster state publication is applied in an empty context #53751

Comments

DaveCTurner commented Mar 18, 2020 • edited Loading

elasticmachine commented Mar 18, 2020

ywelsch commented Jul 23, 2020

DaveCTurner commented Mar 18, 2020 •

edited

Loading