-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move monitoring collection timeouts to coordinator #67084
Move monitoring collection timeouts to coordinator #67084
Conversation
With elastic#66993 there is now support for coordinator-side timeouts on a `BroadcastRequest`, which includes requests for node stats and recoveries. This commit adjusts Monitoring to use these coordinator-side timeouts where applicable, which will prevent partial stats responses from accumulating on the master while one or more nodes are not responding quickly enough. It also enhances the message logged on a timeout to include the IDs of the nodes which did not respond in time. Closes elastic#60188.
Pinging @elastic/es-core-features (Team:Core/Features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor comments on this, but I'll defer to Jake for its approval/disapproval
logger.error((Supplier<?>) () -> | ||
new ParameterizedMessage("collector [{}] timed out when collecting data: {}", name(), e.getMessage())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(super minor nit)
Why do we use a ParameterizedMessage
in the any case here, since the exception is not actually passed? Could this be instead
logger.error((Supplier<?>) () -> | |
new ParameterizedMessage("collector [{}] timed out when collecting data: {}", name(), e.getMessage())); | |
logger.error("collector [{}] timed out when collecting data: {}", name(), e.getMessage()); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a valid question, I don't know, I was just making minimal adjustments to the existing code. Fixed in f68ef2e.
import java.util.HashSet; | ||
import java.util.concurrent.TimeoutException; | ||
|
||
public final class TimeoutUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add javadocs for this class as well as its public static methods please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, see 6b845ba.
|
||
final NodesStatsResponse response = client.admin().cluster().nodesStats(request).actionGet(getCollectionTimeout()); | ||
final NodesStatsResponse response = client.admin().cluster().nodesStats(request).actionGet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an ensureNoTimeouts(getCollectionTimeout(), response);
after this line as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I was thinking no since we throw response.failures().get(0)
anyway, but on reflection that'll be a FailedNodeException
not the inner timeout. I'll address that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok done, see 56ce978.
Also, if this is truly a bug fix rather than enhancement, maybe it should be backported to 7.11.x also? |
I'm ambivalent. It's not a bug we see very often. |
@elasticmachine please run elasticsearch-ci/2 -- I opened #67119 for the failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for addressing this !
Also, I don't think this needs to be backported to the current patch release. IMO without any specific compelling reason to back port to the patch, this is abit too much internal behavior changes for a patch release.
Thanks @dakrone & @jakelandis. |
With #66993 there is now support for coordinator-side timeouts on a `BroadcastRequest`, which includes requests for node stats and recoveries. This commit adjusts Monitoring to use these coordinator-side timeouts where applicable, which will prevent partial stats responses from accumulating on the master while one or more nodes are not responding quickly enough. It also enhances the message logged on a timeout to include the IDs of the nodes which did not respond in time. Closes #60188.
With #66993 there is now support for coordinator-side timeouts on a
BroadcastRequest
, which includes requests for indices stats andrecoveries. This commit adjusts Monitoring to use these coordinator-side
timeouts where applicable, which will prevent partial stats responses
from accumulating on the master while one or more nodes are not
responding quickly enough. It also enhances the message logged on a
timeout to include the IDs of the nodes which did not respond in time.
Closes #60188.