-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Overly Aggressive Request DeDuplication #51270
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,6 +53,14 @@ public void executeOnce(T request, ActionListener<Void> listener, BiConsumer<T, | |
} | ||
} | ||
|
||
/** | ||
* Remove all tracked requests from this instance so that the first time {@link #executeOnce} is invoked with any request it triggers | ||
* an actual request execution. Use this e.g. for requests to master that need to be sent again on master failover. | ||
*/ | ||
public void clear() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know there's just one use case for this now, but just dropping all deduplication seems to be the right approach to master failovers in general if we ever want to reuse this for e.g. put mapping calls or so. |
||
requests.clear(); | ||
} | ||
|
||
public int size() { | ||
return requests.size(); | ||
} | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is needed.
UpdateSnapshotStatusAction
is aTransportMasterNodeAction
and should resend the request in case of master failover.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think currently we're only retrying on
FailedToCommitClusterStateException
andNotMasterException
but not on things like the node closed exception. That's why we have the manual retry/re-send of these messages inSnapshotShardsService
in the first place don't we?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this stack trace. It shows that the node that's sending the request is closed, no? Why would we want to retry sending a request from a node that's shut down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry pasted the wrong trace:
We have the same for the remote end as well. I'm not 100% sure we can get that exception outside of a test (because of the shut down order, but as you can see at least in the test requests are clearly not retried on the master node restart
tm_0
and just go to the listener'sonFailure
). We could improve the master action to retry this case as well, but as long as we have the fail-safe retry on master failover in the snapshot shard's service already, we should make that actually work?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a bug in
TransportMasterNodeAction
, then. In case where a node is shut down, we first shut down the cluster service, and only then do we shut down the transport service. This means that a node that has sent aMasterNodeRequest
to a master node that is in the process of shutting down can receive aTransportException
with the cause being aNodeClosedException
, for which we don't trigger a retry (we only do that for aConnectTransportException
). Can we fix that instead of this hack here (which also fixes this situation for any other master-level request)? Long term, the logic insyncShardStatsOnNewMaster
can go away. I think it's mainly still here because theShardSnapshotStatus
action was not aTransportMasterNodeAction
in some older versions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywelsch sounds good, I'll look into fixing the
TransportMasterNodeAction
and removing the retry in the snapshot shards service then shortly :)Still, so long as we have
syncShardStatsOnNewMaster
I feel like we should have this hack in place as without it that method doesn't really make sense?