allow rate-limiting on data nodes (for shards.tolerant=true) #239

magibney · 2024-12-09T15:33:50Z

this is a quick-and-dirty hack that will work for our usage, but this should be reconsidered and something more general committed upstream.

Honestly it doesn't make sense to hardcode handling of this at the level of RateLimitManager. Individual rate limiters should be specified as plugins, with the context necessary to make their own decisions.

magibney · 2024-12-09T15:36:10Z

opening this against fs/branch_9_3 for quick evaluation

this is a quick-and-dirty hack that will work for our usage, but this should be reconsidered and something more general committed upstream. Honestly it doesn't make sense to hardcode handling of this at the level of RateLimitManager. Individual rate limiters should be specified as plugins, with the context necessary to make their own decisions.

hiteshk25 · 2024-12-13T21:16:05Z

solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java

+    SolrRequest.SolrClientContext context = getContext();
+    req.header(CommonParams.SOLR_REQUEST_CONTEXT_PARAM, context.toString());
+    if (context == SolrRequest.SolrClientContext.CLIENT
+        || solrRequest.getParams().getBool(ShardParams.SHARDS_TOLERANT, false)) {


Whole purpose of ratelimiter to limit the resources on all nodes. Not sure we really need to make any special case here.

The reason for differentiating between shards.tolarant=true vs. shards.tolerant=false is explained in the comment immediately below:

// NOTE: if `shards.tolerant=false`, do _not_ set the `Solr-Request-Type` header, because we // could end up doing a lot of extra work at the cluster level, retrying requests that may // only have failed to obtain a ratelimit permit on a single shard.

For our case in particular, practically we actually do want to avoid failing any requests that are shards.tolerant=false. Notably these are also the most likely requests to be retried on failure, so if we end up repeatedly executing requests on all nodes, only to repeatedly fail because of 1 struggling node (for example), that could easily cause load to increase on the other nodes to the point where the problem spreads to the entire cluster.

I guess these are two different problems. Rate limiting should be agnostic to any parameters.
if we see any issue with shards.tolerant=false or single node, then we need to track that in that context. We don't want to increase the load on other node same time.

I'm not sure what you're suggesting to do here. If we see an issue with shards.tolerant=false on a single node, we will already have increased the load on other nodes (requests to nodes are issued in parallel), and if the top-level request fails due to the one node rate-limiting, then the client is likely to retry, increasing the load on the cluster overall (the exact situation that we both agree we want to avoid).

Rate limiting should be agnostic to any parameters

Why do you say this? I think the status quo (evaluate rate limiting only on the coordinator node) is due to the potential for rate-limiting evaluation on data-nodes to amplify request load. So if we really want rate limiting to be agnostic to any parameters, then I think rate-limiting on data nodes must be avoided entirely (due to the request amplification issue).

Let's chat on this.

magibney force-pushed the michaelgibney/data-node-ratelimiting branch from 16e02d9 to 1f52d51 Compare December 9, 2024 15:44

hiteshk25 reviewed Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow rate-limiting on data nodes (for shards.tolerant=true) #239

allow rate-limiting on data nodes (for shards.tolerant=true) #239

magibney commented Dec 9, 2024

magibney commented Dec 9, 2024

hiteshk25 Dec 13, 2024

magibney Dec 17, 2024

magibney Dec 17, 2024

hiteshk25 Jan 3, 2025

magibney Jan 3, 2025

hiteshk25 Jan 3, 2025

allow rate-limiting on data nodes (for shards.tolerant=true) #239

Are you sure you want to change the base?

allow rate-limiting on data nodes (for shards.tolerant=true) #239

Conversation

magibney commented Dec 9, 2024

magibney commented Dec 9, 2024

hiteshk25 Dec 13, 2024

Choose a reason for hiding this comment

magibney Dec 17, 2024

Choose a reason for hiding this comment

magibney Dec 17, 2024

Choose a reason for hiding this comment

hiteshk25 Jan 3, 2025

Choose a reason for hiding this comment

magibney Jan 3, 2025

Choose a reason for hiding this comment

hiteshk25 Jan 3, 2025

Choose a reason for hiding this comment