-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not log unsuccessful join attempt each time #39756
Do not log unsuccessful join attempt each time #39756
Conversation
Pinging @elastic/es-distributed |
run elasticsearch-ci/bwc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great move. I left a handful of small suggestions.
...src/test/java/org/elasticsearch/cluster/coordination/ClusterFormationFailureHelperTests.java
Show resolved
Hide resolved
|
||
void logLastFailedJoinAttempt() { | ||
if (lastFailedJoinAttempt != null) { | ||
lastFailedJoinAttempt.logWarnWithTimestamp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can throw a NPE because we read the volatile field twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! 0b19f03
} | ||
} | ||
|
||
boolean isSuspiciousTransportException(TransportException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be static
? Also, perhaps return the Level
and then you can just call logger.log(isSuspiciousTransportException(exception),...
. Also, this is worthy of a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done as a part of this commit 737e7f8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing is added here ce3e69a
server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java
Outdated
Show resolved
Hide resolved
this.timestamp = System.nanoTime(); | ||
} | ||
|
||
void maybeLogNow() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make isSuspiciousTransportException
return the level then this becomes a one-liner so it's probably simpler to inline it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done as a part of this commit. 737e7f8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe inline this? At least rename it to avoid using maybe
since it always logs at some level or other.
@DaveCTurner thanks for your review, it's ready for the second pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised a couple of nits. No need for another review, LGTM either way.
this.timestamp = System.nanoTime(); | ||
} | ||
|
||
void maybeLogNow() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe inline this? At least rename it to avoid using maybe
since it always logs at some level or other.
server/src/main/java/org/elasticsearch/cluster/coordination/ClusterFormationFailureHelper.java
Show resolved
Hide resolved
} | ||
|
||
static Level getLogLevel(TransportException e) { | ||
if (e instanceof RemoteTransportException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps simpler (and more streamlined with other similar code) is to just call unwrapCause on TransportException, i.e. Throwable cause = e.unwrapCause();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, 5409831
void logLastFailedJoinAttempt() { | ||
FailedJoinAttempt attempt = lastFailedJoinAttempt; | ||
if (attempt != null) { | ||
attempt.logWarnWithTimestamp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we avoid repeatedly logging the same failed join attempt? I wonder if we should set lastFailedJoinAttempt
back to null here (and use AtomicReference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we need to log lastFailedJoinAttempt each time we log a warn in ClusterFormationHelper
, because that way we know that cluster still can not form and failed join attempt still could be the reason for it. Otherwise, next time ClusterFormationHelper
logs a warning and no failed join attempt, we don't know if the issue with join is actually resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @ywelsch on this one. Imagine one join attempt fails with an exception and then the network is disconnected. We will continue to log this join exception, with a new timestamp each time, and I think that's confusing because this exception has nothing to do with the ongoing failure to form a cluster. It's true that we log how many milliseconds ago the exception occurred but I think that could easily be overlooked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I fixed it here 283fdea
run elasticsearch-ci/2 |
run elasticsearch-ci/bwc |
When performing the test with 57 master-eligible nodes and one node crash, we saw messy elections, when multiple nodes were attempting to become master. JoinHelper has logged 105 long log messages with lengthy stack traces during one such election. To address this, we decided to log these messages every time only on debug level. We will log last unsuccessful join attempt (along with a timestamp) if any with WARN level if the cluster is failing to form. (cherry picked from commit 17a148c)
When performing the test with 57 master-eligible nodes and one node crash, we saw messy elections, when multiple nodes were attempting to become master. JoinHelper has logged 105 long log messages with lengthy stack traces during one such election. To address this, we decided to log these messages every time only on debug level. We will log last unsuccessful join attempt (along with a timestamp) if any with WARN level if the cluster is failing to form. (cherry picked from commit 17a148c)
When performing the test with 57 master-eligible nodes and one node
crash, we saw messy elections, when multiple nodes were attempting to
become master.
JoinHelper
has logged 105 long log messages with lengthy stacktraces during one such election.
To address this, we decided to log these messages every time only on
debug level.
We will log last unsuccessful join attempt (along with a timestamp)
if any with WARN level if the cluster is failing to form.