-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exporter jaeger encountered the following error(s): thrift agent failed with message too long #851
Comments
I too am experiencing this issue and the proposed fixes (as seen in different issues) don't work for me. OTEL_BSP_MAX_EXPORT_BATCH_SIZE=25 OTEL_BSP_MAX_QUEUE_SIZE=32768 cargo run fn init_tracer() -> Result<sdktrace::Tracer, TraceError> {
opentelemetry_jaeger::new_pipeline()
.with_service_name("trace-demo")
.with_max_packet_size(9216) // Default max UDP packet size on OSX
.with_auto_split_batch(true) // Auto split batches so they fit under packet size
.install_batch(opentelemetry::runtime::Tokio)
}
#[tokio::main()]
async fn main() -> Result<(), Box<dyn Error + Send + Sync + 'static>> {
// JAEGER
// Create a layer with the configured tracer
let tracer = init_tracer()?;
let otel_layer = tracing_opentelemetry::layer().with_tracer(tracer);
let subscriber = Registry::default().with(otel_layer);
tracing::subscriber::set_global_default(subscriber).expect("setting default subscriber failed");
} But still seeing OpenTelemetry trace error occurred. Exporter jaeger encountered the following error(s): thrift agent failed with message too long Before the changes above I saw OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full I am running a few hundred async tasks in parallel, and a jaeger instance locally. |
I observed the same issue on our OSS project quickwit-oss/quickwit#2295 I tried a bunch of different settings but did not manage to make it work. The errors happen when there are a lot of spans, I will try to isolate that and report it here. |
I dig into the issue a bit and the problem comes from the number of bytes of spans that will be sent to jaeger. I guess the error that you have is the same as mine (I added some
I solved the issue by doing two things:
@bes In your case, what is the max size of your UDP packet? I'm on macos and I had to run |
That whole protocol design seems broken if you need to tweak the max UDP package size to work around the flaws in its design. |
As an alternative. Have you tried auto_split_batch. This config will automatically split the span batches if it exceeded the UDP max size for one packet. Note that it has a performance overhead |
As mentioned in my first post, I did try that and it was just as broken. |
Ah sorry I missed it. In this case, the most likely cause is that one of the spans exceeded the limit of UDP packet. Since we cannot split the span we have to fail the request. As for the debugging information. We use the apache thrift rust client so the only information we will know is the error passed to us from thrift agent. I will see what's available and add some more context in the error message. For the protocol design part, the UDP limit for jaeger is a known issue(See https://www.jaegertracing.io/docs/1.39/client-libraries/#emsgsize-and-udp-buffer-limits) so I don't think we can do more about that. One suggestion is to switch to http client w/ collector |
It would probably be useful if we could get the messages size of the message that failed. The whole set of parameters I tried are very opaque in their effect. If you can't print the size maybe you could print the content you pass to thrift. I don't think performance matters very much at the point where nothing is really working.
Could you elaborate on that? |
macOS has a max UDP size set to 9216 by default. Would it be acceptable to adjust the values in this library to do the right thing out of the box? Requiring every project to figure this out through trial and error, then fix it, sounds like a waste of time.
|
The dedicated Jaeger exporter is going to be deprecated, so its unlikely to get bug fixes. The recommendation is to use use OTLP Exporter as Jaeger can now natively understand OTLP: |
As mentioned we're looking to stop producing the crate , if you have feedback please leave it on #995 |
Closing jaeger related issues, given its imminent removal. See #995 |
I am using opentelemetry-jaeger 0.16.0 with tracing-opentelemetry 0.17.4 on Linux along with the latest Jaeger binary (non-docker) release.
I did see the old tickets on this issue (#648 #676 and #759 ) but none of the workarounds described there seem to have any significant impact on the issue.
I tried setting OTEL_BSP_MAX_EXPORT_BATCH_SIZE to progressively lower values down to 1 which did not make the error disappear. I tried using with_auto_split_batch(true) with with_max_packet_size with values of 8192, 4096, 1024, 512 and finally 256 and the error did not disappear. I tried install_simple() instead of install_batch(Tokio) and the error did not disappear.
I do not produce spans that are particularly large (just half a line of text and a host name as a parameter) but they are relatively short and plentiful.
Some data appears in jaeger but it is just a dozen or two spans before it aborts (per run of my program obviously).
I honestly have my doubts about the UDP packet size explanation in the previous tickets since everyone else seems to be able to handle sending larger amounts of data over UDP just fine.
Maybe it would be useful to add more information to the error message, e.g. how long it was, how many and which spans were batched and what the limit for 'too long' is.
The text was updated successfully, but these errors were encountered: