Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTel features overwhelmed during high load #2565

Closed
calebschoepp opened this issue Jun 14, 2024 · 5 comments · Fixed by #2572
Closed

OTel features overwhelmed during high load #2565

calebschoepp opened this issue Jun 14, 2024 · 5 comments · Fixed by #2572

Comments

@calebschoepp
Copy link
Collaborator

When o11y is enabled in Spin (some variation of OTEL_EXPORTER_OTLP_ENDPOINT is set) and a large amount of load is run against Spin we start to see that the OTel feature gets overloaded.

2024-06-14T21:24:49.087649Z ERROR spin_telemetry: There has been an error with the OpenTelemetry system, traces and metrics are likely failing to export
2024-06-14T21:24:49.087664Z ERROR spin_telemetry: Further OpenTelemetry errors will be logged at DEBUG level
2024-06-14T21:24:49.087669Z  WARN spin_telemetry: OpenTelemetry error err=Trace(Other(ChannelFull))
2024-06-14T21:24:49.087675Z  WARN spin_telemetry: OpenTelemetry error err=Trace(Other(ChannelFull))
2024-06-14T21:24:49.087681Z  WARN spin_telemetry: OpenTelemetry error err=Trace(Other(ChannelFull))

Possible fixes to explore:

  • Figure out sampling and document how to configure it.
  • Increase channel sizes.
  • Something else?
@hpvd
Copy link

hpvd commented Jun 17, 2024

there is a lot of on going work to make Otel suitable for this kind of demanding usecases.
Requirements from Apache Pulsar seems to be the starting point/driver to work on this

  • on Otel side and
  • on Pulsar side

This is an overview/parent issue with lots of insides for both sides:
apache/pulsar#21121

@lann
Copy link
Collaborator

lann commented Jun 17, 2024

I'm not sure that there is anything we can/should be doing about this in Spin itself apart from handling overflows gracefully. To that end: maybe we should rate-limit/collapse these otel errors (and add a metric?) to make sure we aren't blowing out logs?

@calebschoepp
Copy link
Collaborator Author

I'm not sure that there is anything we can/should be doing about this in Spin itself apart from handling overflows gracefully. To that end: maybe we should rate-limit/collapse these otel errors (and add a metric?) to make sure we aren't blowing out logs?

Agreed. I like the idea of tracking errors with a metric.

OTel errors are already only emitted at DEBUG level. @lann are you saying you think they should be rate limited still so that if you RUST_LOG=debug you don't get overwhelmed with OTel errors?

@lann
Copy link
Collaborator

lann commented Jun 17, 2024

@lann are you saying you think they should be rate limited still so that if you RUST_LOG=debug you don't get overwhelmed with OTel errors?

It might be nice, especially for this specific ChannelFull error indicating backpressure from a telemetry sink.

@calebschoepp
Copy link
Collaborator Author

Quick learning. Lots of the parameters on the batch processor are configurable by env vars. Cranking up a bunch of the parameters prevents us from dropping messages.

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 OTEL_BSP_MAX_CONCURRENT_EXPORTS=4 OTEL_BSP_MAX_QUEUE_SIZE=4096 OTEL_BSP_SCHEDULE_DELAY=2000 spin up

I don't think we will want to actually hardcode this into Spin, but for certain end users they might want to tune it this aggressively. It's something worth documenting more explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants