[Loadbalancingexporter] - Seeing a lot less traces received on Tier 2 collectors after moving to v0.94.0 #31274

aishyandapalli · 2024-02-14T19:51:23Z

Component(s)

No response

Describe the issue you're reporting

We have two collectors tier 1 and tier 2. Tier 1 collectors have a loadbalancingexporter configured for exporting spans to Tier 2 collectors.

We are monitoring the number of spans received on tier 1 and tier 2 collectors using the metric otelcol_receiver_accepted_spans

When we upgrade all our opentelemetry libraries to v0.94.0, I am seeing a huge mismatch b/w the number of spans accepted on tier 1 vs tier 2. It worked fine till v0.93.0. Can someone please help?

The text was updated successfully, but these errors were encountered:

github-actions · 2024-02-26T20:52:25Z

Pinging code owners for exporter/loadbalancing: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 · 2024-02-28T22:51:15Z

I think this is a result of #30141, which aggregates exported data to make it more efficient. This likely reduced the output numbers of the metric you're referencing, meaning there's no data actually being dropped or missed.

Are you missing any data you expected? Is there any reason for concern, in addition to the otelcol_receiver_accepted_spans metric being lower? Just want to make sure data isn't actually being dropped here.

jpkrohling · 2024-03-04T10:08:15Z

The number of spans should have remained stable, batching it should reduce the number of requests but not the amount of data.

Would you be able to share the relevant parts of your config? Bonus points if you can share a telemetrygen command that exemplifies the problem.

edwintye · 2024-04-12T13:31:08Z

Just adding some more info here for others who may have found this issue. This may help other people investigate while we are also trying to figure out stuff internally.

We are seeing the same pattern and only in certain clusters where we have heavy traffic, in the region of 5 millions of spans per second. We also suspect it is the change from 0.93.0 to 0.94.0 on the named PR because the there are only 2 metrics where we can spot an obvious difference: otelcol_loadbalancer_num_resolutions, and otelcol_loadbalancer_backend_outcome. Attaching the 7 days graph, where the switch to 0.94.0 was done prior to our normal peak traffic starting just after 00:00-ish. Then we see basically a flat line for otelcol_loadbalancer_backend_outcome while at the same time otelcol_loadbalancer_num_resolutions reflects almost perfectly with the number of pods we have at layer-2. There is also very evident (cpu) throttling happening while it is not shipping, and there appears to be "hotspots" across pods, i.e. some pods were doing fine while others had high cpu/mem usage while not shipping.

Note that we have autoscaling in both layer-1 and layer-2 hence you see these massive fluctuations in number of backends, and we have double checked and is correctly matched to the number of pods in layer-2. (Yes, we have that many pods in layer-2.)

One thing which we have not tried - at the time of writing - is to switch the resolver from k8 to dns and against the headless service of layer-2. It is not obvious this is the cause but the limited amount of profiling we managed to do while trying to restore service did show heavy mem usage on k8s api machinery. Unfortunately, I am not too familiar with how the different resolver behaves and where/when does that triggered to get a new list of endpoints.

Also worthy to mention is that we are not seeing the same behaviour in other clusters that has a stable throughput at just under 1 million spans a second. Furthermore, we have in fact upgraded all the way to version 0.97.0 in other clusters that does not go through heavy burst cycles and have found an improve in (total) cpu usage between 10% to 20%.

I will see if I can replicate this with telemetrygen locally when I get some time but that may not be any time soon, or even worthwhile if having a static list of target does not trigger this behaviour.

Edit: Happy to take any guidance from the maintainers on what we can look out for or adjustment in settings for the next time we try an upgrade :)

edwintye · 2024-04-24T14:01:34Z

After quite a bit of experimentation and trying to understand the new behaviour, we have successfully upgrade to 0.97.0 everywhere. Turns out the biggest problem was the batch size in the batch processor in the pipeline that is just in front of the load balancing exporter. We have our batch processor as the last step of the processors pipeline. The default setting (commented out part)

processors:
  batch:
    # the two settings below are the defaults
    # send_batch_max_size: 0
    # send_batch_size: 8192

can lead to a scenario where the batch is very large under heavy load as it is not bounded. Lowering both send_batch_size and send_batch_max_size has allowed us to return to an acceptable cpu/memory and the number of export back to normal. We have yet to find a "perfect" max batch size; our current max size is in the region of 4000 and collector is working fine.

I am not familiar with the codebase but from the limited understanding after looking at the code, it feels like the aforementioned PR changed the interaction from:

Original batch -> ship in a FIFO manner.

to

Original batch -> Allocate and split into new batches according to the number of endpoints -> Ship new batches.

This change means that now we would need 2x the previous memory (in the worst case) and also have the possibility that the spans are stuck in the exporter for way longer than desired as it is trying to re-batch and fails to handle the continuous pressure via the receiver. I am also not sure whether a change in the backends - layer-2 pod terminated - during this re-batch process will cause additional issues.

During our debugging, we also discover a potential missing tag in the metrics. Namely,

opentelemetry-collector-contrib/exporter/loadbalancingexporter/trace_exporter.go

Lines 122 to 123 in 27840d5

    
           []tag.Mutator{tag.Upsert(endpointTagKey, endpoints[exp]), successTrueMutator}, 
        
           mBackendLatency.M(duration.Milliseconds()))

suggests that success should be available but it is not defined in the histogram

opentelemetry-collector-contrib/exporter/loadbalancingexporter/metrics.go

Lines 53 to 61 in 27840d5

    
           { 
        
           	Name:        mBackendLatency.Name(), 
        
           	Measure:     mBackendLatency, 
        
           	Description: mBackendLatency.Description(), 
        
           	TagKeys: []tag.Key{ 
        
           		tag.MustNewKey("endpoint"), 
        
           	}, 
        
           	Aggregation: view.Distribution(0, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000), 
        
           },

while it is in the corresponding count

opentelemetry-collector-contrib/exporter/loadbalancingexporter/metrics.go

Lines 62 to 71 in 27840d5

    
           { 
        
           	Name:        "loadbalancer_backend_outcome", 
        
           	Measure:     mBackendLatency, 
        
           	Description: "Number of success/failures for each endpoint", 
        
           	TagKeys: []tag.Key{ 
        
           		tag.MustNewKey("endpoint"), 
        
           		tag.MustNewKey("success"), 
        
           	}, 
        
           	Aggregation: view.Count(), 
        
           },

but to be honest these metrics didn't provide much insight in this particular occasion.

MasterExecuson · 2024-05-10T15:51:14Z

Hey @edwintye, we are observing the same issue as you.
I wanted to also perform the Collector profilling, how did you manage to get that working? 🤔

jpkrohling · 2024-05-21T19:28:52Z

I have a feeling that this is fixed in v0.101.0, as part of another issue reporting a memory issue with the exporter. Once that version is out (should be out tomorrow, 21 May 2024), could you give it a try and report back?

gsaxena-au · 2024-05-24T00:34:34Z

Hey @edwintye , side question - how did you profile the OTel collector? I see you are sending the profile data to Grafana ?

edwintye · 2024-07-04T07:35:44Z

Sorry I didn't see the pings earlier. We run pyroscope and does a "pull" via grafana agent/alloy. The general setup is to add the pprof extension to the collector

extensions:
  pprof:
    endpoint: ${POD_IP}:6060
    block_profile_fraction: 0
    mutex_profile_fraction: 0
service:
  extensions: [pprof]

and then the pull config

  pyroscope.scrape "otel_collector" {
      targets    = <YOUR_OTEL_COLLECTOR_ADDRESS> # we get this from a k8s discovery
      forward_to = [pyroscope.write.agent.receiver] # needs to be defined
      profiling_config {
        profile.goroutine {
          enabled = true
          path = "/debug/pprof/goroutine"
          delta = false
        }
        profile.process_cpu {
          enabled = true
          path = "/debug/pprof/profile"
          delta = true
        }
        profile.godeltaprof_memory {
          enabled = false
          path = "/debug/pprof/delta_heap"
        }
        profile.memory {
          enabled = true
          path = "/debug/pprof/heap"
          delta = false
        }
        profile.godeltaprof_mutex {
          enabled = false
          path = "/debug/pprof/delta_mutex"
        }
        profile.mutex {
          enabled = false
          path = "/debug/pprof/mutex"
          delta = false
        }
        profile.godeltaprof_block {
          enabled = false
          path = "/debug/pprof/delta_block"
        }
        profile.block {
          enabled = false
          path = "/debug/pprof/block"
          delta = false
        }
      }
    }

which ships to the backend. Then we do various analysis on the cpu/memory/object usage and adjust our settings to fit the shape of our workload.

We have since upgraded to v0.103.0 which contains the change on merging traces. So far, scoped to layer-1 and under significant load (million spans per second), we are looking at cutting the amount of usage by 60% for both cpu and memory. The number of objects (from our live profiles during an upgrade) under batchprocess changed from about 80% to around 20%. Furthermore, the fraction of time spent in mergeTraces went from about 30% to essentially 0.

Note that our improvements here may not be representative since we have done a fair amount of adjustments to get ourselves past the initial upgrade on the load balancing exporter. From the load balancing exporter perspective, our observation so far there has only been benefit by upgrading to and past v0.101.0 as jpkrohling suggested.

github-actions · 2024-09-03T03:32:03Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-11-02T05:19:42Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

aishyandapalli added the needs triage New item requiring triage label Feb 14, 2024

This was referenced Feb 20, 2024

Weekly Report: 2024-02-13 - 2024-02-20 #31323

Closed

Weekly Report: 2024-02-13 - 2024-02-20 asuresh4/opentelemetry-collector-contrib#11541

Open

crobert-1 added exporter/loadbalancing bug Something isn't working labels Feb 26, 2024

This was referenced Feb 27, 2024

Weekly Report: 2024-02-20 - 2024-02-27 #31422

Closed

Weekly Report: 2024-02-20 - 2024-02-27 asuresh4/opentelemetry-collector-contrib#11542

Open

crobert-1 removed bug Something isn't working needs triage New item requiring triage labels Mar 1, 2024

github-actions bot added the Stale label Sep 3, 2024

github-actions bot added the closed as inactive label Nov 2, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Loadbalancingexporter] - Seeing a lot less traces received on Tier 2 collectors after moving to v0.94.0 #31274

[Loadbalancingexporter] - Seeing a lot less traces received on Tier 2 collectors after moving to v0.94.0 #31274

aishyandapalli commented Feb 14, 2024 •

edited

Loading

github-actions bot commented Feb 26, 2024

crobert-1 commented Feb 28, 2024 •

edited

Loading

jpkrohling commented Mar 4, 2024

edwintye commented Apr 12, 2024 •

edited

Loading

edwintye commented Apr 24, 2024

MasterExecuson commented May 10, 2024

jpkrohling commented May 21, 2024

gsaxena-au commented May 24, 2024

edwintye commented Jul 4, 2024

github-actions bot commented Sep 3, 2024

github-actions bot commented Nov 2, 2024

[Loadbalancingexporter] - Seeing a lot less traces received on Tier 2 collectors after moving to v0.94.0 #31274

[Loadbalancingexporter] - Seeing a lot less traces received on Tier 2 collectors after moving to v0.94.0 #31274

Comments

aishyandapalli commented Feb 14, 2024 • edited Loading

Component(s)

Describe the issue you're reporting

github-actions bot commented Feb 26, 2024

crobert-1 commented Feb 28, 2024 • edited Loading

jpkrohling commented Mar 4, 2024

edwintye commented Apr 12, 2024 • edited Loading

edwintye commented Apr 24, 2024

MasterExecuson commented May 10, 2024

jpkrohling commented May 21, 2024

gsaxena-au commented May 24, 2024

edwintye commented Jul 4, 2024

github-actions bot commented Sep 3, 2024

github-actions bot commented Nov 2, 2024

aishyandapalli commented Feb 14, 2024 •

edited

Loading

crobert-1 commented Feb 28, 2024 •

edited

Loading

edwintye commented Apr 12, 2024 •

edited

Loading