Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow rate-limiting on data nodes (for shards.tolerant=true) #239

Open
wants to merge 1 commit into
base: fs/branch_9_3
Choose a base branch
from

Conversation

magibney
Copy link
Collaborator

@magibney magibney commented Dec 9, 2024

this is a quick-and-dirty hack that will work for our usage, but this should be reconsidered and something more general committed upstream.

Honestly it doesn't make sense to hardcode handling of this at the level of RateLimitManager. Individual rate limiters should be specified as plugins, with the context necessary to make their own decisions.

@magibney
Copy link
Collaborator Author

magibney commented Dec 9, 2024

opening this against fs/branch_9_3 for quick evaluation

this is a quick-and-dirty hack that will work for our
usage, but this should be reconsidered and something
more general committed upstream.

Honestly it doesn't make sense to hardcode handling of
this at the level of RateLimitManager. Individual rate
limiters should be specified as plugins, with the
context necessary to make their own decisions.
@magibney magibney force-pushed the michaelgibney/data-node-ratelimiting branch from 16e02d9 to 1f52d51 Compare December 9, 2024 15:44
SolrRequest.SolrClientContext context = getContext();
req.header(CommonParams.SOLR_REQUEST_CONTEXT_PARAM, context.toString());
if (context == SolrRequest.SolrClientContext.CLIENT
|| solrRequest.getParams().getBool(ShardParams.SHARDS_TOLERANT, false)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whole purpose of ratelimiter to limit the resources on all nodes. Not sure we really need to make any special case here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for differentiating between shards.tolarant=true vs. shards.tolerant=false is explained in the comment immediately below:

      // NOTE: if `shards.tolerant=false`, do _not_ set the `Solr-Request-Type` header, because we
      // could end up doing a lot of extra work at the cluster level, retrying requests that may
      // only have failed to obtain a ratelimit permit on a single shard.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our case in particular, practically we actually do want to avoid failing any requests that are shards.tolerant=false. Notably these are also the most likely requests to be retried on failure, so if we end up repeatedly executing requests on all nodes, only to repeatedly fail because of 1 struggling node (for example), that could easily cause load to increase on the other nodes to the point where the problem spreads to the entire cluster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these are two different problems. Rate limiting should be agnostic to any parameters.
if we see any issue with shards.tolerant=false or single node, then we need to track that in that context. We don't want to increase the load on other node same time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you're suggesting to do here. If we see an issue with shards.tolerant=false on a single node, we will already have increased the load on other nodes (requests to nodes are issued in parallel), and if the top-level request fails due to the one node rate-limiting, then the client is likely to retry, increasing the load on the cluster overall (the exact situation that we both agree we want to avoid).

Rate limiting should be agnostic to any parameters

Why do you say this? I think the status quo (evaluate rate limiting only on the coordinator node) is due to the potential for rate-limiting evaluation on data-nodes to amplify request load. So if we really want rate limiting to be agnostic to any parameters, then I think rate-limiting on data nodes must be avoided entirely (due to the request amplification issue).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's chat on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants