Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporter jaeger encountered the following error(s): thrift agent failed with message too long #851

Closed
taladar opened this issue Jul 25, 2022 · 12 comments
Labels
M-exporter-jaeger Deprecated - we will likely close most related bugs.

Comments

@taladar
Copy link

taladar commented Jul 25, 2022

I am using opentelemetry-jaeger 0.16.0 with tracing-opentelemetry 0.17.4 on Linux along with the latest Jaeger binary (non-docker) release.

I did see the old tickets on this issue (#648 #676 and #759 ) but none of the workarounds described there seem to have any significant impact on the issue.

I tried setting OTEL_BSP_MAX_EXPORT_BATCH_SIZE to progressively lower values down to 1 which did not make the error disappear. I tried using with_auto_split_batch(true) with with_max_packet_size with values of 8192, 4096, 1024, 512 and finally 256 and the error did not disappear. I tried install_simple() instead of install_batch(Tokio) and the error did not disappear.

I do not produce spans that are particularly large (just half a line of text and a host name as a parameter) but they are relatively short and plentiful.

Some data appears in jaeger but it is just a dozen or two spans before it aborts (per run of my program obviously).

I honestly have my doubts about the UDP packet size explanation in the previous tickets since everyone else seems to be able to handle sending larger amounts of data over UDP just fine.

Maybe it would be useful to add more information to the error message, e.g. how long it was, how many and which spans were batched and what the limit for 'too long' is.

@bes
Copy link

bes commented Aug 1, 2022

I too am experiencing this issue and the proposed fixes (as seen in different issues) don't work for me.

OTEL_BSP_MAX_EXPORT_BATCH_SIZE=25 OTEL_BSP_MAX_QUEUE_SIZE=32768 cargo run
fn init_tracer() -> Result<sdktrace::Tracer, TraceError> {
    opentelemetry_jaeger::new_pipeline()
        .with_service_name("trace-demo")
        .with_max_packet_size(9216) // Default max UDP packet size on OSX
        .with_auto_split_batch(true) // Auto split batches so they fit under packet size
        .install_batch(opentelemetry::runtime::Tokio)
}

#[tokio::main()]
async fn main() -> Result<(), Box<dyn Error + Send + Sync + 'static>> {
    // JAEGER
    // Create a layer with the configured tracer
    let tracer = init_tracer()?;
    let otel_layer = tracing_opentelemetry::layer().with_tracer(tracer);
    let subscriber = Registry::default().with(otel_layer);
    tracing::subscriber::set_global_default(subscriber).expect("setting default subscriber failed");
}

But still seeing

OpenTelemetry trace error occurred. Exporter jaeger encountered the following error(s): thrift agent failed with message too long

Before the changes above I saw

OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full

I am running a few hundred async tasks in parallel, and a jaeger instance locally.

@fmassot
Copy link

fmassot commented Nov 15, 2022

I observed the same issue on our OSS project quickwit-oss/quickwit#2295

I tried a bunch of different settings but did not manage to make it work. The errors happen when there are a lot of spans, I will try to isolate that and report it here.

@fmassot
Copy link

fmassot commented Nov 18, 2022

I dig into the issue a bit and the problem comes from the number of bytes of spans that will be sent to jaeger.

I guess the error that you have is the same as mine (I added some printf! in the code to have that):

upload error ExportFailed(ThriftAgentError(ProtocolError { kind: SizeLimit, message: "single span's jaeger exporter payload size of 28330 bytes over max UDP packet size of 10000 bytes" }))

I solved the issue by doing two things:

  • reduce the size of some of my spans: I discovered that I was putting way too much info in them as I'm using the macros #[instrument] on methods and some of my arguments can print a lot of stuff
  • reduce the size of the batch to a minimum of 8 or 16.

@bes In your case, what is the max size of your UDP packet? I'm on macos and I had to run sudo sysctl -w net.inet.udp.maxdgram=65535 to have a decent size.

@taladar
Copy link
Author

taladar commented Nov 18, 2022

That whole protocol design seems broken if you need to tweak the max UDP package size to work around the flaws in its design.

@TommyCpp
Copy link
Contributor

As an alternative. Have you tried auto_split_batch. This config will automatically split the span batches if it exceeded the UDP max size for one packet.

Note that it has a performance overhead

@taladar
Copy link
Author

taladar commented Nov 21, 2022

As mentioned in my first post, I did try that and it was just as broken.

@TommyCpp
Copy link
Contributor

TommyCpp commented Nov 21, 2022

As mentioned in my first post, I did try that and it was just as broken.

Ah sorry I missed it. In this case, the most likely cause is that one of the spans exceeded the limit of UDP packet. Since we cannot split the span we have to fail the request.

As for the debugging information. We use the apache thrift rust client so the only information we will know is the error passed to us from thrift agent. I will see what's available and add some more context in the error message.

For the protocol design part, the UDP limit for jaeger is a known issue(See https://www.jaegertracing.io/docs/1.39/client-libraries/#emsgsize-and-udp-buffer-limits) so I don't think we can do more about that. One suggestion is to switch to http client w/ collector

@taladar
Copy link
Author

taladar commented Nov 21, 2022

add some more context in the error message.

It would probably be useful if we could get the messages size of the message that failed. The whole set of parameters I tried are very opaque in their effect. If you can't print the size maybe you could print the content you pass to thrift. I don't think performance matters very much at the point where nothing is really working.

One suggestion is to switch to http client w/ collector

Could you elaborate on that?

@punkeel
Copy link

punkeel commented Aug 5, 2023

macOS has a max UDP size set to 9216 by default. Would it be acceptable to adjust the values in this library to do the right thing out of the box? Requiring every project to figure this out through trial and error, then fix it, sounds like a waste of time.

$ uname -a
Darwin host 22.6.0 Darwin Kernel Version 22.6.0: Wed Jul  5 22:22:05 PDT 2023; root:xnu-8796.141.3~6/RELEASE_ARM64_T6000 arm64 arm Darwin
$ sysctl net.inet.udp.maxdgram
net.inet.udp.maxdgram: 9216

@cijothomas
Copy link
Member

add some more context in the error message.

It would probably be useful if we could get the messages size of the message that failed. The whole set of parameters I tried are very opaque in their effect. If you can't print the size maybe you could print the content you pass to thrift. I don't think performance matters very much at the point where nothing is really working.

One suggestion is to switch to http client w/ collector

Could you elaborate on that?

The dedicated Jaeger exporter is going to be deprecated, so its unlikely to get bug fixes. The recommendation is to use use OTLP Exporter as Jaeger can now natively understand OTLP:
#1022 - This PR has an example. (During a recent refactoring, it looks like the example got lost, will work to bring it back.)

@hdost hdost added the M-exporter-jaeger Deprecated - we will likely close most related bugs. label Feb 21, 2024
@hdost
Copy link
Contributor

hdost commented Feb 21, 2024

As mentioned we're looking to stop producing the crate , if you have feedback please leave it on #995

@cijothomas
Copy link
Member

Closing jaeger related issues, given its imminent removal. See #995

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
M-exporter-jaeger Deprecated - we will likely close most related bugs.
Projects
None yet
Development

No branches or pull requests

7 participants