-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timed out cluster state publication is applied in an empty context #53751
Labels
>bug
:Distributed Coordination/Cluster Coordination
Cluster formation and cluster state publication, including cluster membership and fault detection.
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Comments
DaveCTurner
added
>bug
:Distributed Coordination/Cluster Coordination
Cluster formation and cluster state publication, including cluster membership and fault detection.
labels
Mar 18, 2020
Pinging @elastic/es-distributed (:Distributed/Cluster Coordination) |
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Mar 19, 2020
Today cluster states are sometimes (rarely) applied in the default context rather than system context, which means that any appliers which capture their contexts cannot do things like remote transport actions when security is enabled. There are at least two ways that we end up applying the cluster state in the default context: 1. locally applying a cluster state that indicates that the master has failed 2. the elected master times out while waiting for a response from another node This commit ensures that cluster states are always applied in the system context. Mitigates elastic#53751
DaveCTurner
added a commit
that referenced
this issue
Mar 19, 2020
Today cluster states are sometimes (rarely) applied in the default context rather than system context, which means that any appliers which capture their contexts cannot do things like remote transport actions when security is enabled. There are at least two ways that we end up applying the cluster state in the default context: 1. locally applying a cluster state that indicates that the master has failed 2. the elected master times out while waiting for a response from another node This commit ensures that cluster states are always applied in the system context. Mitigates #53751
DaveCTurner
added a commit
that referenced
this issue
Mar 19, 2020
Today cluster states are sometimes (rarely) applied in the default context rather than system context, which means that any appliers which capture their contexts cannot do things like remote transport actions when security is enabled. There are at least two ways that we end up applying the cluster state in the default context: 1. locally applying a cluster state that indicates that the master has failed 2. the elected master times out while waiting for a response from another node This commit ensures that cluster states are always applied in the system context. Mitigates #53751
rjernst
added
the
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
label
May 4, 2020
Closed by #57792 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>bug
:Distributed Coordination/Cluster Coordination
Cluster formation and cluster state publication, including cluster membership and fault detection.
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Today the elected master waits for all other nodes to acknowledge a cluster state publication before applying it locally, although it will time out if the other nodes are not all fast enough. The timeout is performed by a delayed action scheduled with
ThreadPool#schedule
at the start of the publication.ThreadPool#schedule
does not preserve the context of the caller, however, which means that the cluster state is applied with an empty context rather than being in the system context. This means that any cluster state appliers which use the context of the application (e.g. capture it for future use, or try and send transport messages) will not work correctly if security is enabled.One such case was introduced in #48430: retention lease syncs now run in the context in which the
IndexService
was created, which happens during cluster state application. Thus if the elected master is also a data node, and the cluster state publication that assigns a shard to it times out after committing, then the retention lease syncs will fail.If affected, the workaround is to restart the elected master.
This exposes a gap in the
CoordinatorTests
framework which does not properly simulate how thread contexts behave.The text was updated successfully, but these errors were encountered: