Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove node from cluster when node locks are broken #58373

Closed
DaveCTurner opened this issue Jun 19, 2020 · 5 comments · Fixed by #61400
Closed

Remove node from cluster when node locks are broken #58373

DaveCTurner opened this issue Jun 19, 2020 · 5 comments · Fixed by #61400
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Jun 19, 2020

In #52680 we are introducing a mechanism that will allow nodes to remove themselves from the cluster if they locally determine themselves to be unhealthy. The only check today is that their data paths are all empirically writeable. We could also check NodeEnvironment#assertEnvIsLocked() here; indeed we already call this method during the health check but do not consider a failure to be fatal (see #52680 (comment)).

A broken node lock today blocks things like allocating new shards to the node, but I think it does not block indexing or searching on existing shards since these are protected by shard-level locks instead. On the other hand there's something very wrong with your environment if the node lock is broken and it seems reasonable to treat it pretty seriously.

@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. team-discuss labels Jun 19, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 19, 2020
@Bukhtawar
Copy link
Contributor

In #52860 we are introducing a mechanism that will allow nodes to remove themselves from the cluster if they locally determine themselves to be unhealthy

nit : #52680 :)

Thanks for raising this.

@DaveCTurner
Copy link
Contributor Author

We discussed this today and agreed to proceed.

@Bukhtawar
Copy link
Contributor

Thanks @DaveCTurner should we work on the existing PR, or else we can start a new one

@DaveCTurner
Copy link
Contributor Author

I'd prefer a new one once #52680 is merged.

DaveCTurner pushed a commit that referenced this issue Sep 22, 2020
In #52680 we introduced a mechanism that will allow nodes to remove
themselves from the cluster if they locally determine themselves to be
unhealthy. The only check today is that their data paths are all
empirically writeable. This commit extends this check to consider a
failure of `NodeEnvironment#assertEnvIsLocked()` to be an indication of
unhealthiness.

Closes #58373
DaveCTurner pushed a commit that referenced this issue Sep 22, 2020
In #52680 we introduced a mechanism that will allow nodes to remove
themselves from the cluster if they locally determine themselves to be
unhealthy. The only check today is that their data paths are all
empirically writeable. This commit extends this check to consider a
failure of `NodeEnvironment#assertEnvIsLocked()` to be an indication of
unhealthiness.

Closes #58373
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants