Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move monitoring collection timeouts to coordinator #67084

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Jan 6, 2021

With #66993 there is now support for coordinator-side timeouts on a
BroadcastRequest, which includes requests for indices stats and
recoveries. This commit adjusts Monitoring to use these coordinator-side
timeouts where applicable, which will prevent partial stats responses
from accumulating on the master while one or more nodes are not
responding quickly enough. It also enhances the message logged on a
timeout to include the IDs of the nodes which did not respond in time.

Closes #60188.

With elastic#66993 there is now support for coordinator-side timeouts on a
`BroadcastRequest`, which includes requests for node stats and
recoveries. This commit adjusts Monitoring to use these coordinator-side
timeouts where applicable, which will prevent partial stats responses
from accumulating on the master while one or more nodes are not
responding quickly enough. It also enhances the message logged on a
timeout to include the IDs of the nodes which did not respond in time.

Closes elastic#60188.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments on this, but I'll defer to Jake for its approval/disapproval

Comment on lines 87 to 88
logger.error((Supplier<?>) () ->
new ParameterizedMessage("collector [{}] timed out when collecting data: {}", name(), e.getMessage()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(super minor nit)

Why do we use a ParameterizedMessage in the any case here, since the exception is not actually passed? Could this be instead

Suggested change
logger.error((Supplier<?>) () ->
new ParameterizedMessage("collector [{}] timed out when collecting data: {}", name(), e.getMessage()));
logger.error("collector [{}] timed out when collecting data: {}", name(), e.getMessage());

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a valid question, I don't know, I was just making minimal adjustments to the existing code. Fixed in f68ef2e.

import java.util.HashSet;
import java.util.concurrent.TimeoutException;

public final class TimeoutUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add javadocs for this class as well as its public static methods please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, see 6b845ba.


final NodesStatsResponse response = client.admin().cluster().nodesStats(request).actionGet(getCollectionTimeout());
final NodesStatsResponse response = client.admin().cluster().nodesStats(request).actionGet();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an ensureNoTimeouts(getCollectionTimeout(), response); after this line as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I was thinking no since we throw response.failures().get(0) anyway, but on reflection that'll be a FailedNodeException not the inner timeout. I'll address that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok done, see 56ce978.

@dakrone
Copy link
Member

dakrone commented Jan 6, 2021

Also, if this is truly a bug fix rather than enhancement, maybe it should be backported to 7.11.x also?

@DaveCTurner
Copy link
Contributor Author

Also, if this is truly a bug fix rather than enhancement, maybe it should be backported to 7.11.x also?

I'm ambivalent. It's not a bug we see very often.

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/2 -- I opened #67119 for the failure.

Copy link
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing this !

Also, I don't think this needs to be backported to the current patch release. IMO without any specific compelling reason to back port to the patch, this is abit too much internal behavior changes for a patch release.

@DaveCTurner DaveCTurner merged commit 1d2462e into elastic:master Jan 11, 2021
@DaveCTurner DaveCTurner deleted the 2021-01-06-monitoring-timeouts-in-coordinator branch January 11, 2021 07:29
@DaveCTurner
Copy link
Contributor Author

Thanks @dakrone & @jakelandis.

DaveCTurner added a commit that referenced this pull request Jan 11, 2021
With #66993 there is now support for coordinator-side timeouts on a
`BroadcastRequest`, which includes requests for node stats and
recoveries. This commit adjusts Monitoring to use these coordinator-side
timeouts where applicable, which will prevent partial stats responses
from accumulating on the master while one or more nodes are not
responding quickly enough. It also enhances the message logged on a
timeout to include the IDs of the nodes which did not respond in time.

Closes #60188.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Client-side stats collection timeouts can result in overloaded master
4 participants