Do not log unsuccessful join attempt each time #39756

andrershov · 2019-03-06T17:16:50Z

When performing the test with 57 master-eligible nodes and one node
crash, we saw messy elections, when multiple nodes were attempting to
become master.
JoinHelper has logged 105 long log messages with lengthy stack
traces during one such election.
To address this, we decided to log these messages every time only on
debug level.
We will log last unsuccessful join attempt (along with a timestamp)
if any with WARN level if the cluster is failing to form.

elasticmachine · 2019-03-06T17:17:19Z

Pinging @elastic/es-distributed

andrershov · 2019-03-07T08:33:25Z

run elasticsearch-ci/bwc

DaveCTurner

I think this is a great move. I left a handful of small suggestions.

...src/test/java/org/elasticsearch/cluster/coordination/ClusterFormationFailureHelperTests.java

DaveCTurner · 2019-03-07T11:42:36Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+
+    void logLastFailedJoinAttempt() {
+        if (lastFailedJoinAttempt != null) {
+            lastFailedJoinAttempt.logWarnWithTimestamp();


I think this can throw a NPE because we read the volatile field twice.

Good catch! 0b19f03

DaveCTurner · 2019-03-07T11:51:25Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+            }
+        }
+
+        boolean isSuspiciousTransportException(TransportException e) {


Can this be static? Also, perhaps return the Level and then you can just call logger.log(isSuspiciousTransportException(exception),.... Also, this is worthy of a test.

Done as a part of this commit 737e7f8

Testing is added here ce3e69a

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

DaveCTurner · 2019-03-07T11:53:50Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+            this.timestamp = System.nanoTime();
+        }
+
+        void maybeLogNow() {


If you make isSuspiciousTransportException return the level then this becomes a one-liner so it's probably simpler to inline it.

Good idea, done as a part of this commit. 737e7f8

Maybe inline this? At least rename it to avoid using maybe since it always logs at some level or other.

andrershov · 2019-03-08T10:44:10Z

@DaveCTurner thanks for your review, it's ready for the second pass.

DaveCTurner

Raised a couple of nits. No need for another review, LGTM either way.

DaveCTurner · 2019-03-08T12:34:29Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+            this.timestamp = System.nanoTime();
+        }
+
+        void maybeLogNow() {


Maybe inline this? At least rename it to avoid using maybe since it always logs at some level or other.

server/src/main/java/org/elasticsearch/cluster/coordination/ClusterFormationFailureHelper.java

ywelsch · 2019-03-11T10:57:41Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+        }
+
+        static Level getLogLevel(TransportException e) {
+            if (e instanceof RemoteTransportException) {


perhaps simpler (and more streamlined with other similar code) is to just call unwrapCause on TransportException, i.e. Throwable cause = e.unwrapCause();

Good idea, 5409831

ywelsch · 2019-03-11T11:02:54Z

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

+    void logLastFailedJoinAttempt() {
+        FailedJoinAttempt attempt = lastFailedJoinAttempt;
+        if (attempt != null) {
+            attempt.logWarnWithTimestamp();


should we avoid repeatedly logging the same failed join attempt? I wonder if we should set lastFailedJoinAttempt back to null here (and use AtomicReference)

I think that we need to log lastFailedJoinAttempt each time we log a warn in ClusterFormationHelper, because that way we know that cluster still can not form and failed join attempt still could be the reason for it. Otherwise, next time ClusterFormationHelper logs a warning and no failed join attempt, we don't know if the issue with join is actually resolved.

I'm with @ywelsch on this one. Imagine one join attempt fails with an exception and then the network is disconnected. We will continue to log this join exception, with a new timestamp each time, and I think that's confusing because this exception has nothing to do with the ongoing failure to form a cluster. It's true that we log how many milliseconds ago the exception occurred but I think that could easily be overlooked.

Ok, I fixed it here 283fdea

andrershov · 2019-03-12T15:54:49Z

run elasticsearch-ci/2

andrershov · 2019-03-12T15:54:58Z

run elasticsearch-ci/bwc

When performing the test with 57 master-eligible nodes and one node crash, we saw messy elections, when multiple nodes were attempting to become master. JoinHelper has logged 105 long log messages with lengthy stack traces during one such election. To address this, we decided to log these messages every time only on debug level. We will log last unsuccessful join attempt (along with a timestamp) if any with WARN level if the cluster is failing to form. (cherry picked from commit 17a148c)

Log last failed join attempt

a48b6ca

andrershov requested review from ywelsch and DaveCTurner March 6, 2019 17:16

andrershov added the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Mar 6, 2019

andrershov added >enhancement v8.0.0 v7.2.0 v7.0.0 labels Mar 6, 2019

andrershov changed the title ~~Log last failed join attempt~~ Do not log unsuccessful join attempt each time Mar 6, 2019

DaveCTurner requested changes Mar 7, 2019

View reviewed changes

andrershov removed the request for review from ywelsch March 8, 2019 10:04

Andrey Ershov added 4 commits March 8, 2019 11:18

Test that runnable is called

452af96

Avoid NPE due to volatile double read

0b19f03

isSuspiciousTransportException -> getLogLevel

737e7f8

Test for log level

ce3e69a

andrershov requested a review from DaveCTurner March 8, 2019 10:44

DaveCTurner approved these changes Mar 8, 2019

View reviewed changes

ywelsch reviewed Mar 11, 2019

View reviewed changes

Andrey Ershov added 3 commits March 11, 2019 15:00

maybeLogNow -> logNow

7c0798f

Change log order

765f66f

unwrapCause

5409831

ywelsch mentioned this pull request Mar 11, 2019

A new cluster coordination layer #32006

Closed

61 tasks

Andrey Ershov added 2 commits March 12, 2019 15:39

Remove unused import

83f1bb5

Reset lastFailedJoinAttempt

283fdea

Merge branch 'master' into log_last_failed_join_attempt

463a9b2

andrershov merged commit 17a148c into elastic:master Mar 13, 2019

jakelandis added v7.0.0-rc2 and removed v7.0.0 labels Apr 3, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not log unsuccessful join attempt each time #39756

Do not log unsuccessful join attempt each time #39756

andrershov commented Mar 6, 2019

elasticmachine commented Mar 6, 2019

andrershov commented Mar 7, 2019

DaveCTurner left a comment

DaveCTurner Mar 7, 2019

andrershov Mar 8, 2019

DaveCTurner Mar 7, 2019

andrershov Mar 8, 2019

andrershov Mar 8, 2019

DaveCTurner Mar 7, 2019

andrershov Mar 8, 2019

DaveCTurner Mar 8, 2019

andrershov commented Mar 8, 2019

DaveCTurner left a comment

DaveCTurner Mar 8, 2019

ywelsch Mar 11, 2019

andrershov Mar 11, 2019

ywelsch Mar 11, 2019

andrershov Mar 11, 2019

DaveCTurner Mar 11, 2019

andrershov Mar 12, 2019

andrershov commented Mar 12, 2019

andrershov commented Mar 12, 2019

Do not log unsuccessful join attempt each time #39756

Do not log unsuccessful join attempt each time #39756

Conversation

andrershov commented Mar 6, 2019

elasticmachine commented Mar 6, 2019

andrershov commented Mar 7, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrershov commented Mar 8, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrershov commented Mar 12, 2019

andrershov commented Mar 12, 2019