-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow rate-limiting on data nodes (for shards.tolerant=true) #239
base: fs/branch_9_3
Are you sure you want to change the base?
Conversation
opening this against fs/branch_9_3 for quick evaluation |
this is a quick-and-dirty hack that will work for our usage, but this should be reconsidered and something more general committed upstream. Honestly it doesn't make sense to hardcode handling of this at the level of RateLimitManager. Individual rate limiters should be specified as plugins, with the context necessary to make their own decisions.
16e02d9
to
1f52d51
Compare
SolrRequest.SolrClientContext context = getContext(); | ||
req.header(CommonParams.SOLR_REQUEST_CONTEXT_PARAM, context.toString()); | ||
if (context == SolrRequest.SolrClientContext.CLIENT | ||
|| solrRequest.getParams().getBool(ShardParams.SHARDS_TOLERANT, false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whole purpose of ratelimiter to limit the resources on all nodes. Not sure we really need to make any special case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for differentiating between shards.tolarant=true
vs. shards.tolerant=false
is explained in the comment immediately below:
// NOTE: if `shards.tolerant=false`, do _not_ set the `Solr-Request-Type` header, because we
// could end up doing a lot of extra work at the cluster level, retrying requests that may
// only have failed to obtain a ratelimit permit on a single shard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our case in particular, practically we actually do want to avoid failing any requests that are shards.tolerant=false
. Notably these are also the most likely requests to be retried on failure, so if we end up repeatedly executing requests on all nodes, only to repeatedly fail because of 1 struggling node (for example), that could easily cause load to increase on the other nodes to the point where the problem spreads to the entire cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess these are two different problems. Rate limiting should be agnostic to any parameters.
if we see any issue with shards.tolerant=false
or single node, then we need to track that in that context. We don't want to increase the load on other node same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you're suggesting to do here. If we see an issue with shards.tolerant=false
on a single node, we will already have increased the load on other nodes (requests to nodes are issued in parallel), and if the top-level request fails due to the one node rate-limiting, then the client is likely to retry, increasing the load on the cluster overall (the exact situation that we both agree we want to avoid).
Rate limiting should be agnostic to any parameters
Why do you say this? I think the status quo (evaluate rate limiting only on the coordinator node) is due to the potential for rate-limiting evaluation on data-nodes to amplify request load. So if we really want rate limiting to be agnostic to any parameters, then I think rate-limiting on data nodes must be avoided entirely (due to the request amplification issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's chat on this.
this is a quick-and-dirty hack that will work for our usage, but this should be reconsidered and something more general committed upstream.
Honestly it doesn't make sense to hardcode handling of this at the level of RateLimitManager. Individual rate limiters should be specified as plugins, with the context necessary to make their own decisions.