Move monitoring collection timeouts to coordinator #67084

DaveCTurner · 2021-01-06T11:38:35Z

With #66993 there is now support for coordinator-side timeouts on a
BroadcastRequest, which includes requests for indices stats and
recoveries. This commit adjusts Monitoring to use these coordinator-side
timeouts where applicable, which will prevent partial stats responses
from accumulating on the master while one or more nodes are not
responding quickly enough. It also enhances the message logged on a
timeout to include the IDs of the nodes which did not respond in time.

Closes #60188.

With elastic#66993 there is now support for coordinator-side timeouts on a `BroadcastRequest`, which includes requests for node stats and recoveries. This commit adjusts Monitoring to use these coordinator-side timeouts where applicable, which will prevent partial stats responses from accumulating on the master while one or more nodes are not responding quickly enough. It also enhances the message logged on a timeout to include the IDs of the nodes which did not respond in time. Closes elastic#60188.

elasticmachine · 2021-01-06T11:38:38Z

Pinging @elastic/es-core-features (Team:Core/Features)

dakrone

I left some minor comments on this, but I'll defer to Jake for its approval/disapproval

dakrone · 2021-01-06T16:35:18Z

.../plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/Collector.java

+            logger.error((Supplier<?>) () ->
+                    new ParameterizedMessage("collector [{}] timed out when collecting data: {}", name(), e.getMessage()));


(super minor nit)

Why do we use a ParameterizedMessage in the any case here, since the exception is not actually passed? Could this be instead

Suggested change

logger.error((Supplier<?>) () ->

new ParameterizedMessage("collector [{}] timed out when collecting data: {}", name(), e.getMessage()));

logger.error("collector [{}] timed out when collecting data: {}", name(), e.getMessage());

?

It's a valid question, I don't know, I was just making minimal adjustments to the existing code. Fixed in f68ef2e.

dakrone · 2021-01-06T16:35:42Z

...ugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/TimeoutUtils.java

+import java.util.HashSet;
+import java.util.concurrent.TimeoutException;
+
+public final class TimeoutUtils {


Can you add javadocs for this class as well as its public static methods please?

Sure, see 6b845ba.

dakrone · 2021-01-06T16:39:26Z

...ring/src/main/java/org/elasticsearch/xpack/monitoring/collector/node/NodeStatsCollector.java


-        final NodesStatsResponse response = client.admin().cluster().nodesStats(request).actionGet(getCollectionTimeout());
+        final NodesStatsResponse response = client.admin().cluster().nodesStats(request).actionGet();


Do we need an ensureNoTimeouts(getCollectionTimeout(), response); after this line as well?

Hmm, I was thinking no since we throw response.failures().get(0) anyway, but on reflection that'll be a FailedNodeException not the inner timeout. I'll address that.

Ok done, see 56ce978.

dakrone · 2021-01-06T16:41:58Z

Also, if this is truly a bug fix rather than enhancement, maybe it should be backported to 7.11.x also?

DaveCTurner · 2021-01-06T17:35:24Z

Also, if this is truly a bug fix rather than enhancement, maybe it should be backported to 7.11.x also?

I'm ambivalent. It's not a bug we see very often.

DaveCTurner · 2021-01-06T18:31:51Z

@elasticmachine please run elasticsearch-ci/2 -- I opened #67119 for the failure.

jakelandis

LGTM, thanks for addressing this !

Also, I don't think this needs to be backported to the current patch release. IMO without any specific compelling reason to back port to the patch, this is abit too much internal behavior changes for a patch release.

DaveCTurner · 2021-01-11T07:29:43Z

Thanks @dakrone & @jakelandis.

With #66993 there is now support for coordinator-side timeouts on a `BroadcastRequest`, which includes requests for node stats and recoveries. This commit adjusts Monitoring to use these coordinator-side timeouts where applicable, which will prevent partial stats responses from accumulating on the master while one or more nodes are not responding quickly enough. It also enhances the message logged on a timeout to include the IDs of the nodes which did not respond in time. Closes #60188.

DaveCTurner added >bug :Data Management/Monitoring v8.0.0 v7.12.0 labels Jan 6, 2021

DaveCTurner requested a review from jakelandis January 6, 2021 11:38

elasticmachine added the Team:Data Management Meta label for data/management team label Jan 6, 2021

DaveCTurner added 4 commits January 6, 2021 12:11

Handle other timeout exceptions too

7b284a3

Better grammar in message

d0da26b

JobStatsCollector can have the same treatment

3c07511

Imports

7588ecd

dakrone reviewed Jan 6, 2021

View reviewed changes

DaveCTurner added 2 commits January 6, 2021 17:22

Just log the darn message

f68ef2e

Javadoc

6b845ba

Ensure no timeouts in NodeStatsCollector too

56ce978

jakelandis approved these changes Jan 7, 2021

View reviewed changes

DaveCTurner merged commit 1d2462e into elastic:master Jan 11, 2021

DaveCTurner deleted the 2021-01-06-monitoring-timeouts-in-coordinator branch January 11, 2021 07:29

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move monitoring collection timeouts to coordinator #67084

Move monitoring collection timeouts to coordinator #67084

DaveCTurner commented Jan 6, 2021 •

edited

Loading

elasticmachine commented Jan 6, 2021

dakrone left a comment

dakrone Jan 6, 2021

DaveCTurner Jan 6, 2021

dakrone Jan 6, 2021

DaveCTurner Jan 6, 2021

dakrone Jan 6, 2021

DaveCTurner Jan 6, 2021

DaveCTurner Jan 6, 2021

dakrone commented Jan 6, 2021

DaveCTurner commented Jan 6, 2021

DaveCTurner commented Jan 6, 2021

jakelandis left a comment

DaveCTurner commented Jan 11, 2021

		logger.error((Supplier<?>) () ->
		new ParameterizedMessage("collector [{}] timed out when collecting data: {}", name(), e.getMessage()));

	logger.error((Supplier<?>) () ->
	new ParameterizedMessage("collector [{}] timed out when collecting data: {}", name(), e.getMessage()));
	logger.error("collector [{}] timed out when collecting data: {}", name(), e.getMessage());


		final NodesStatsResponse response = client.admin().cluster().nodesStats(request).actionGet(getCollectionTimeout());
		final NodesStatsResponse response = client.admin().cluster().nodesStats(request).actionGet();

Move monitoring collection timeouts to coordinator #67084

Move monitoring collection timeouts to coordinator #67084

Conversation

DaveCTurner commented Jan 6, 2021 • edited Loading

elasticmachine commented Jan 6, 2021

dakrone left a comment

Choose a reason for hiding this comment

dakrone Jan 6, 2021

Choose a reason for hiding this comment

DaveCTurner Jan 6, 2021

Choose a reason for hiding this comment

dakrone Jan 6, 2021

Choose a reason for hiding this comment

DaveCTurner Jan 6, 2021

Choose a reason for hiding this comment

dakrone Jan 6, 2021

Choose a reason for hiding this comment

DaveCTurner Jan 6, 2021

Choose a reason for hiding this comment

DaveCTurner Jan 6, 2021

Choose a reason for hiding this comment

dakrone commented Jan 6, 2021

DaveCTurner commented Jan 6, 2021

DaveCTurner commented Jan 6, 2021

jakelandis left a comment

Choose a reason for hiding this comment

DaveCTurner commented Jan 11, 2021

DaveCTurner commented Jan 6, 2021 •

edited

Loading