-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove node from cluster when node locks broken #61400
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amoghRZP, this is good work. I requested a few small changes but nothing fundamental.
} | ||
} catch (IllegalStateException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer the try{}
block only to contain the call to nodeEnv.nodeDataPaths()
, would you reduce its scope? That way you don't need a local lockAssertionFailed
, you can set brokenLock
and exit immediately.
@@ -117,6 +118,8 @@ public StatusInfo getHealth() { | |||
Set<Path> unhealthyPaths = this.unhealthyPaths; | |||
if (enabled == false) { | |||
statusInfo = new StatusInfo(HEALTHY, "health check disabled"); | |||
} else if (brokenLock == true) { | |||
statusInfo = new StatusInfo(UNHEALTHY, "health check failed on node due to broken locks"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor wording nit: specify which lock was broken (and remove redundant on node
), suggest this:
statusInfo = new StatusInfo(UNHEALTHY, "health check failed on node due to broken locks"); | |
statusInfo = new StatusInfo(UNHEALTHY, "health check failed due to broken node lock"); |
@@ -254,7 +383,8 @@ public int getInjectedPathCount(){ | |||
public OutputStream newOutputStream(Path path, OpenOption... options) throws IOException { | |||
if (injectIOException.get()){ | |||
assert pathPrefix != null : "must set pathPrefix before starting disruptions"; | |||
if (path.toString().startsWith(pathPrefix) && path.toString().endsWith(".es_temp_file")) { | |||
if (path.toString().startsWith(pathPrefix) && path.toString(). | |||
endsWith(FsHealthService.FsHealthMonitor.TEMP_FILE_NAME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -289,7 +419,8 @@ public FileChannel newFileChannel(Path path, Set<? extends OpenOption> options, | |||
public void force(boolean metaData) throws IOException { | |||
if (injectIOException.get()) { | |||
assert pathPrefix != null : "must set pathPrefix before starting disruptions"; | |||
if (path.toString().startsWith(pathPrefix) && path.toString().endsWith(".es_temp_file")) { | |||
if (path.toString().startsWith(pathPrefix) && path.toString(). | |||
endsWith(FsHealthService.FsHealthMonitor.TEMP_FILE_NAME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -231,6 +234,132 @@ public void testFailsHealthOnSinglePathWriteFailure() throws IOException { | |||
} | |||
} | |||
|
|||
public void testFailsHealthOnMissingLockFile() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thorough tests 😄 However they're not really testing anything in the FsHealthService
so much as testing the details of the implementation of the NativeFSLock
. Let's just have one of these here, and maybe consider filling in any gaps in Lucene's TestNativeFSLockFactory separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, i am thinking to keep two of them where one throws an IOException
and another for AlreadyClosedException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, one is all we need here.
NodeEnvironmentTests
would be the right place to verify that NodeEnvironment#assertEnvIsLocked
throws an IllegalStateException
in both of those cases. I think we don't do that today, but again that's a question for a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, Got it.
Thanks @DaveCTurner, I have made changes as suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue remains (and another minor wording change)
@@ -150,7 +153,17 @@ public void run() { | |||
|
|||
private void monitorFSHealth() { | |||
Set<Path> currentUnhealthyPaths = null; | |||
for (Path path : nodeEnv.nodeDataPaths()) { | |||
brokenLock = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearing this flag here may mean the node reports itself as healthy even though it hasn't actually passed this health check yet. I think we should clear this flag only after setting unhealthyPaths
at the very bottom of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, done.
try { | ||
paths = nodeEnv.nodeDataPaths(); | ||
} catch (IllegalStateException e) { | ||
logger.error("Lock assertions failed due to", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor wording nit:
logger.error("Lock assertions failed due to", e); | |
logger.error("health check failed", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed 👍
@DaveCTurner i have made changes as suggested. |
@DaveCTurner let me know if any change etc is required, if you got chance to look at it. |
@elasticmachine update branch |
@elasticmachine ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Pinging @elastic/es-core-infra (:Core/Infra/Resiliency) |
In #52680 we introduced a mechanism that will allow nodes to remove themselves from the cluster if they locally determine themselves to be unhealthy. The only check today is that their data paths are all empirically writeable. This commit extends this check to consider a failure of `NodeEnvironment#assertEnvIsLocked()` to be an indication of unhealthiness. Closes #58373
In #52680 we introduced a mechanism that will allow nodes to remove
themselves from the cluster if they locally determine themselves to be
unhealthy. The only check today is that their data paths are all
empirically writeable. This commit extends this check to consider a
failure of
NodeEnvironment#assertEnvIsLocked()
to be an indication ofunhealthiness.
Closes #58373