Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not log unsuccessful join attempt each time #39756

Merged
merged 11 commits into from
Mar 13, 2019

Conversation

andrershov
Copy link
Contributor

When performing the test with 57 master-eligible nodes and one node
crash, we saw messy elections, when multiple nodes were attempting to
become master.
JoinHelper has logged 105 long log messages with lengthy stack
traces during one such election.
To address this, we decided to log these messages every time only on
debug level.
We will log last unsuccessful join attempt (along with a timestamp)
if any with WARN level if the cluster is failing to form.

@andrershov andrershov added the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Mar 6, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@andrershov andrershov changed the title Log last failed join attempt Do not log unsuccessful join attempt each time Mar 6, 2019
@andrershov
Copy link
Contributor Author

run elasticsearch-ci/bwc

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great move. I left a handful of small suggestions.


void logLastFailedJoinAttempt() {
if (lastFailedJoinAttempt != null) {
lastFailedJoinAttempt.logWarnWithTimestamp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can throw a NPE because we read the volatile field twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! 0b19f03

}
}

boolean isSuspiciousTransportException(TransportException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be static? Also, perhaps return the Level and then you can just call logger.log(isSuspiciousTransportException(exception),.... Also, this is worthy of a test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done as a part of this commit 737e7f8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing is added here ce3e69a

this.timestamp = System.nanoTime();
}

void maybeLogNow() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make isSuspiciousTransportException return the level then this becomes a one-liner so it's probably simpler to inline it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done as a part of this commit. 737e7f8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe inline this? At least rename it to avoid using maybe since it always logs at some level or other.

@andrershov andrershov removed the request for review from ywelsch March 8, 2019 10:04
@andrershov
Copy link
Contributor Author

@DaveCTurner thanks for your review, it's ready for the second pass.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised a couple of nits. No need for another review, LGTM either way.

this.timestamp = System.nanoTime();
}

void maybeLogNow() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe inline this? At least rename it to avoid using maybe since it always logs at some level or other.

}

static Level getLogLevel(TransportException e) {
if (e instanceof RemoteTransportException) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps simpler (and more streamlined with other similar code) is to just call unwrapCause on TransportException, i.e. Throwable cause = e.unwrapCause();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, 5409831

void logLastFailedJoinAttempt() {
FailedJoinAttempt attempt = lastFailedJoinAttempt;
if (attempt != null) {
attempt.logWarnWithTimestamp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we avoid repeatedly logging the same failed join attempt? I wonder if we should set lastFailedJoinAttempt back to null here (and use AtomicReference)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to log lastFailedJoinAttempt each time we log a warn in ClusterFormationHelper, because that way we know that cluster still can not form and failed join attempt still could be the reason for it. Otherwise, next time ClusterFormationHelper logs a warning and no failed join attempt, we don't know if the issue with join is actually resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @ywelsch on this one. Imagine one join attempt fails with an exception and then the network is disconnected. We will continue to log this join exception, with a new timestamp each time, and I think that's confusing because this exception has nothing to do with the ongoing failure to form a cluster. It's true that we log how many milliseconds ago the exception occurred but I think that could easily be overlooked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I fixed it here 283fdea

@ywelsch ywelsch mentioned this pull request Mar 11, 2019
61 tasks
@andrershov
Copy link
Contributor Author

run elasticsearch-ci/2

@andrershov
Copy link
Contributor Author

run elasticsearch-ci/bwc

@andrershov andrershov merged commit 17a148c into elastic:master Mar 13, 2019
andrershov pushed a commit to andrershov/elasticsearch that referenced this pull request Mar 13, 2019
When performing the test with 57 master-eligible nodes and one node
crash, we saw messy elections, when multiple nodes were attempting to
become master.
JoinHelper has logged 105 long log messages with lengthy stack
traces during one such election.
To address this, we decided to log these messages every time only on
debug level.
We will log last unsuccessful join attempt (along with a timestamp)
if any with WARN level if the cluster is failing to form.

(cherry picked from commit 17a148c)
andrershov pushed a commit that referenced this pull request Mar 13, 2019
When performing the test with 57 master-eligible nodes and one node
crash, we saw messy elections, when multiple nodes were attempting to
become master.
JoinHelper has logged 105 long log messages with lengthy stack
traces during one such election.
To address this, we decided to log these messages every time only on
debug level.
We will log last unsuccessful join attempt (along with a timestamp)
if any with WARN level if the cluster is failing to form.

(cherry picked from commit 17a148c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-rc2 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants