Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove node from cluster when node locks broken #61400

Merged
merged 8 commits into from
Sep 22, 2020

Conversation

amoghRZP
Copy link
Contributor

@amoghRZP amoghRZP commented Aug 21, 2020

In #52680 we introduced a mechanism that will allow nodes to remove
themselves from the cluster if they locally determine themselves to be
unhealthy. The only check today is that their data paths are all
empirically writeable. This commit extends this check to consider a
failure of NodeEnvironment#assertEnvIsLocked() to be an indication of
unhealthiness.

Closes #58373

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amoghRZP, this is good work. I requested a few small changes but nothing fundamental.

}
} catch (IllegalStateException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the try{} block only to contain the call to nodeEnv.nodeDataPaths(), would you reduce its scope? That way you don't need a local lockAssertionFailed, you can set brokenLock and exit immediately.

@@ -117,6 +118,8 @@ public StatusInfo getHealth() {
Set<Path> unhealthyPaths = this.unhealthyPaths;
if (enabled == false) {
statusInfo = new StatusInfo(HEALTHY, "health check disabled");
} else if (brokenLock == true) {
statusInfo = new StatusInfo(UNHEALTHY, "health check failed on node due to broken locks");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor wording nit: specify which lock was broken (and remove redundant on node), suggest this:

Suggested change
statusInfo = new StatusInfo(UNHEALTHY, "health check failed on node due to broken locks");
statusInfo = new StatusInfo(UNHEALTHY, "health check failed due to broken node lock");

@@ -254,7 +383,8 @@ public int getInjectedPathCount(){
public OutputStream newOutputStream(Path path, OpenOption... options) throws IOException {
if (injectIOException.get()){
assert pathPrefix != null : "must set pathPrefix before starting disruptions";
if (path.toString().startsWith(pathPrefix) && path.toString().endsWith(".es_temp_file")) {
if (path.toString().startsWith(pathPrefix) && path.toString().
endsWith(FsHealthService.FsHealthMonitor.TEMP_FILE_NAME)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -289,7 +419,8 @@ public FileChannel newFileChannel(Path path, Set<? extends OpenOption> options,
public void force(boolean metaData) throws IOException {
if (injectIOException.get()) {
assert pathPrefix != null : "must set pathPrefix before starting disruptions";
if (path.toString().startsWith(pathPrefix) && path.toString().endsWith(".es_temp_file")) {
if (path.toString().startsWith(pathPrefix) && path.toString().
endsWith(FsHealthService.FsHealthMonitor.TEMP_FILE_NAME)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -231,6 +234,132 @@ public void testFailsHealthOnSinglePathWriteFailure() throws IOException {
}
}

public void testFailsHealthOnMissingLockFile() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough tests 😄 However they're not really testing anything in the FsHealthService so much as testing the details of the implementation of the NativeFSLock. Let's just have one of these here, and maybe consider filling in any gaps in Lucene's TestNativeFSLockFactory separately.

Copy link
Contributor Author

@amoghRZP amoghRZP Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, i am thinking to keep two of them where one throws an IOException and another for AlreadyClosedException.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, one is all we need here.

NodeEnvironmentTests would be the right place to verify that NodeEnvironment#assertEnvIsLocked throws an IllegalStateException in both of those cases. I think we don't do that today, but again that's a question for a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, Got it.

@amoghRZP
Copy link
Contributor Author

amoghRZP commented Aug 24, 2020

Thanks @amoghRZP, this is good work. I requested a few small changes but nothing fundamental.

Thanks @DaveCTurner, I have made changes as suggested.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue remains (and another minor wording change)

@@ -150,7 +153,17 @@ public void run() {

private void monitorFSHealth() {
Set<Path> currentUnhealthyPaths = null;
for (Path path : nodeEnv.nodeDataPaths()) {
brokenLock = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearing this flag here may mean the node reports itself as healthy even though it hasn't actually passed this health check yet. I think we should clear this flag only after setting unhealthyPaths at the very bottom of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done.

try {
paths = nodeEnv.nodeDataPaths();
} catch (IllegalStateException e) {
logger.error("Lock assertions failed due to", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor wording nit:

Suggested change
logger.error("Lock assertions failed due to", e);
logger.error("health check failed", e);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed 👍

@amoghRZP
Copy link
Contributor Author

amoghRZP commented Sep 1, 2020

@DaveCTurner i have made changes as suggested.

@amoghRZP
Copy link
Contributor Author

amoghRZP commented Sep 8, 2020

@DaveCTurner let me know if any change etc is required, if you got chance to look at it.

@DaveCTurner
Copy link
Contributor

@elasticmachine update branch

@DaveCTurner
Copy link
Contributor

@elasticmachine ok to test

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner changed the title Remove node from cluster when node locks are broken. Remove node from cluster when node locks are broken Sep 22, 2020
@DaveCTurner DaveCTurner changed the title Remove node from cluster when node locks are broken Remove node from cluster when node locks broken Sep 22, 2020
@DaveCTurner DaveCTurner added :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. v7.10.0 v8.0.0 labels Sep 22, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Resiliency)

@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Sep 22, 2020
@DaveCTurner DaveCTurner merged commit 71d0958 into elastic:master Sep 22, 2020
DaveCTurner pushed a commit that referenced this pull request Sep 22, 2020
In #52680 we introduced a mechanism that will allow nodes to remove
themselves from the cluster if they locally determine themselves to be
unhealthy. The only check today is that their data paths are all
empirically writeable. This commit extends this check to consider a
failure of `NodeEnvironment#assertEnvIsLocked()` to be an indication of
unhealthiness.

Closes #58373
@amoghRZP amoghRZP deleted the broken_nl_handling branch September 22, 2020 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. >enhancement Team:Core/Infra Meta label for core/infra team v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove node from cluster when node locks are broken
4 participants