-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Loadbalancingexporter] - Seeing a lot less traces received on Tier 2 collectors after moving to v0.94.0 #31274
Comments
Pinging code owners for exporter/loadbalancing: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I think this is a result of #30141, which aggregates exported data to make it more efficient. This likely reduced the output numbers of the metric you're referencing, meaning there's no data actually being dropped or missed. Are you missing any data you expected? Is there any reason for concern, in addition to the |
The number of spans should have remained stable, batching it should reduce the number of requests but not the amount of data. Would you be able to share the relevant parts of your config? Bonus points if you can share a |
Just adding some more info here for others who may have found this issue. This may help other people investigate while we are also trying to figure out stuff internally. We are seeing the same pattern and only in certain clusters where we have heavy traffic, in the region of 5 millions of spans per second. We also suspect it is the change from Note that we have autoscaling in both layer-1 and layer-2 hence you see these massive fluctuations in number of backends, and we have double checked and is correctly matched to the number of pods in layer-2. (Yes, we have that many pods in layer-2.) One thing which we have not tried - at the time of writing - is to switch the Also worthy to mention is that we are not seeing the same behaviour in other clusters that has a stable throughput at just under 1 million spans a second. Furthermore, we have in fact upgraded all the way to version I will see if I can replicate this with Edit: Happy to take any guidance from the maintainers on what we can look out for or adjustment in settings for the next time we try an upgrade :) |
After quite a bit of experimentation and trying to understand the new behaviour, we have successfully upgrade to processors:
batch:
# the two settings below are the defaults
# send_batch_max_size: 0
# send_batch_size: 8192 can lead to a scenario where the batch is very large under heavy load as it is not bounded. Lowering both I am not familiar with the codebase but from the limited understanding after looking at the code, it feels like the aforementioned PR changed the interaction from:
to
This change means that now we would need 2x the previous memory (in the worst case) and also have the possibility that the spans are stuck in the exporter for way longer than desired as it is trying to re-batch and fails to handle the continuous pressure via the receiver. I am also not sure whether a change in the backends - layer-2 pod terminated - during this re-batch process will cause additional issues. During our debugging, we also discover a potential missing tag in the metrics. Namely, opentelemetry-collector-contrib/exporter/loadbalancingexporter/trace_exporter.go Lines 122 to 123 in 27840d5
suggests that
|
Hey @edwintye, we are observing the same issue as you. |
I have a feeling that this is fixed in v0.101.0, as part of another issue reporting a memory issue with the exporter. Once that version is out (should be out tomorrow, 21 May 2024), could you give it a try and report back? |
Hey @edwintye , side question - how did you profile the OTel collector? I see you are sending the profile data to Grafana ? |
Sorry I didn't see the pings earlier. We run pyroscope and does a "pull" via grafana agent/alloy. The general setup is to add the extensions:
pprof:
endpoint: ${POD_IP}:6060
block_profile_fraction: 0
mutex_profile_fraction: 0
service:
extensions: [pprof] and then the pull config pyroscope.scrape "otel_collector" {
targets = <YOUR_OTEL_COLLECTOR_ADDRESS> # we get this from a k8s discovery
forward_to = [pyroscope.write.agent.receiver] # needs to be defined
profiling_config {
profile.goroutine {
enabled = true
path = "/debug/pprof/goroutine"
delta = false
}
profile.process_cpu {
enabled = true
path = "/debug/pprof/profile"
delta = true
}
profile.godeltaprof_memory {
enabled = false
path = "/debug/pprof/delta_heap"
}
profile.memory {
enabled = true
path = "/debug/pprof/heap"
delta = false
}
profile.godeltaprof_mutex {
enabled = false
path = "/debug/pprof/delta_mutex"
}
profile.mutex {
enabled = false
path = "/debug/pprof/mutex"
delta = false
}
profile.godeltaprof_block {
enabled = false
path = "/debug/pprof/delta_block"
}
profile.block {
enabled = false
path = "/debug/pprof/block"
delta = false
}
}
} which ships to the backend. Then we do various analysis on the cpu/memory/object usage and adjust our settings to fit the shape of our workload. We have since upgraded to Note that our improvements here may not be representative since we have done a fair amount of adjustments to get ourselves past the initial upgrade on the load balancing exporter. From the load balancing exporter perspective, our observation so far there has only been benefit by upgrading to and past |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
Component(s)
No response
Describe the issue you're reporting
We have two collectors tier 1 and tier 2. Tier 1 collectors have a loadbalancingexporter configured for exporting spans to Tier 2 collectors.
We are monitoring the number of spans received on tier 1 and tier 2 collectors using the metric
otelcol_receiver_accepted_spans
When we upgrade all our opentelemetry libraries to v0.94.0, I am seeing a huge mismatch b/w the number of spans accepted on tier 1 vs tier 2. It worked fine till v0.93.0. Can someone please help?
The text was updated successfully, but these errors were encountered: