-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Target Allocator is not capable of scaling out with least-weighted strategy #2477
Comments
Thank you so much for putting the time and effort in to this issue. Given how detailed your question is, i'm going to do my best to give you a worthy response... Honestly, i think we should switch to the consistent hashing strategy as the default. It lets you autoscale your target allocators, it's a much better strategy in terms of efficiency as well. To answer the question though... This is actually the intended behavior. The worry is that if you redestribute targets too often your metrics are more prone to data loss. Internally we call this "stable allocation". There are a few scenarios we have in mind for this, I'll describe a few but let me know if you have any more questions. You could imagine a scenario where one collector's allocated target is very busy (imagine something like kube-state-metrics which handles many high cardinality metrics). This would mean that assigned collector is more prone to OOMs and restarts. If we were to detect a restart and reallocate this target, we may actually cause another collector to OOM and drop more data. Ultimately this would cause metric patches as your collectors come up, scrape some data, and die. Another scenario is more dangerous – imagine a large-enough cluster with millions of potential targets being discovered. The collector pool doing the sharded scraping probably has relatively small per pod resourcing to be more resilient to failures. And let's say you have 10000 targets per collector in this state. If your cluster is constantly scaling up and down, every time you add a collector, you need to lock the target allocator to delete and re-allocate the entire pool of targets. If you hold on to that lock for too long, you will begin causing the collector to timeout. The last reason we do this is a bit more challenging to explain, but I'll try and do this... There's a known issue in the compatibility between otel and prometheus relating to start timestamps. It's explained in detail here, but to summarize, when we change the assigned collectors targets, the collector's prometheus receiver starts a new timeseries each time. If we were to re-distribute the targets from collector A to collector B back to collector A, I'm unsure we would properly count a reset for a cumulative metric to make the series accurate. This is mostly speculation on my end, I haven't tested it explicitly, but from the conversations I've had with my colleagues, we believe that keeping stable allocations is going to ultimately be best for users. We discussed the implementation of the consistent hashing in our SIG meetings and a bit on the introductory PR here. The gist of our goal there is when a new collector comes online, we minimize the possible redistributions by having replication of our collectors on a large enough hashring. cc @swiatekm-sumo who may have more info he'd like to add here. Circling back to where I started, I do think its time we change the default strategy, and I would support a PR that does exactly that. It wouldn't be a breaking change for target allocator users, in fact our average user may experience a performance boost. With that change we should also recommend the relabel filter strategy which also contains performance improvements. |
@jaronoff97 Thanks a lot for the great response! It's definitely helpful for me to understand the background intention. |
yeah! that would be great, thank you! |
Component(s)
target allocator
What happened?
Description
I'm not sure whether this is a bug or an intentional implemention, however the
least-weighted
strategy of the Target allocator is not capable of scaling out in terms of re-distributing the existing targets in case of new collector addition.Steps to Reproduce
collector1
->target1
,target3
,target5
collector2
->target2
,target4
Expected Result
Expected outcome would be re-distribution of the targets across all 3 instances.
collector1
->target1
,target4
collector2
->target2
,target5
collector3
->target3
Actual Result
No re-distribution is triggered and the 3rd collector does not get assigned any target.
collector1
->target1
,target3
,target5
collector2
->target2
,target4
collector3
->Kubernetes Version
1.27.1
Operator version
0.90.0
Collector version
0.91.0
Environment information
Environment
Log output
No response
Additional context
Implementation analysis
I've checked the code and if I'm not mistaken scaling out is not possible with the current implementation (please correct me if I'm wrong).
When newly added collector instances are found, it is detected here and the
SetCollectors()
function of the allocator is executed. It checks the differences of collectors between current and new states, finds a new one (addition
) and therefore runshandleCollectors
.The critical part is this
if case
. Since the number of the collectors is not zero, the flagallocateTargets
remainsfalse
and therefore we never re-distribute the targets here. However, even we were to enter this loop, the rest of the implementation at the moment does not handle re-assigning targets from the old state of the collectors to the new one.So as far as I've understood, if one wants to scale out, one needs to delete and re-create the allocator instance after introducing a new collector instance.
Actual intention
The actual reason why I wanted to bring up this issue is that I am facing
OOMKilled
problems with the collectors due to uneven distribution of scrape targets and scaling them out seems not to help. Since the default value for the allocation strategy is defined as theleast-weighted
but notconsistent-hashing
, I assumed that the community encourages us to go for theleast-weighted
even thoughconsistent-hashing
strategy re-distributes all of the scrape targets in case a new collector instance is added.I would be more than happy to get your experiences & recommendations.
The text was updated successfully, but these errors were encountered: