Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM in Quarkus 3.0.0.Beta1 caused by okio via OpenTelemetry #32238

Closed
snazy opened this issue Mar 29, 2023 · 33 comments · Fixed by #34647
Closed

OOM in Quarkus 3.0.0.Beta1 caused by okio via OpenTelemetry #32238

snazy opened this issue Mar 29, 2023 · 33 comments · Fixed by #34647
Assignees
Labels
area/tracing kind/bug Something isn't working
Milestone

Comments

@snazy
Copy link
Contributor

snazy commented Mar 29, 2023

Describe the bug

okio starts a thread via the class okio.AsyncTimeout.Watchdog once "it has to deal with a timeout". The thread's configured as a daemon thread and seems to have some shutdown logic, and also aborts when its interrupted.

The behavior isn't an issue in production usecases, but it's an issue when running tests.

okio.AsyncTimeout.Watchdog is loaded by a Quarkus class loader for every test, so it implicitly keeps a reference to its class loader and transitively to all the resources that one holds - just because the thread's still running.

Setting the following parameters disables timeouts and in turn does not start that watchdog thread and the OOM doesn't happen. It's maybe a legit workaround for tests, until the issue's fixed.

quarkus.otel.exporter.otlp.timeout=0
quarkus.otel.exporter.otlp.traces.timeout=0

This behavior seems to be introduced after Quarkus 3.0.0.Alpha5, but I'm not sure why, because okio seems to behave this way "forever".

Expected behavior

No response

Actual behavior

No response

How to Reproduce?

No response

Output of uname -a or ver

No response

Output of java -version

No response

GraalVM version (if different from Java)

No response

Quarkus version or git rev

No response

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

@snazy snazy added the kind/bug Something isn't working label Mar 29, 2023
@quarkus-bot
Copy link

quarkus-bot bot commented Mar 29, 2023

/cc @brunobat (opentelemetry), @radcortez (opentelemetry)

snazy added a commit to snazy/nessie that referenced this issue Mar 29, 2023
snazy added a commit to snazy/nessie that referenced this issue Mar 30, 2023
@brunobat
Copy link
Contributor

brunobat commented Mar 30, 2023

Hi @snazy, not sure what kind of tests you are performing that lead to an OOM... I would need to have an example.
The recommended config for tests with OTel, if using Quarkus 3 is something like this:

quarkus.otel.bsp.schedule.delay=50
quarkus.otel.bsp.export.timeout=1s

This makes sure everything is flushed in a timely fashion because the defaults are tuned for prod. See:
quarkus.otel.bsp.schedule.delay
quarkus.otel.bsp.export.timeout

Please let me know if this works for you.
I can document this issue.

@snazy
Copy link
Contributor Author

snazy commented Mar 30, 2023

It's a legit bug caused by the usage of okio/okhttp w/ that watchdog thread in tests.

It's not about flushing, it's that the watchdog thread never terminates and keeps all classes on the heap, eventually producing an OOM.

Can repro at will:

You can also attach a debugger to it and check that the Watchdog thread keeps running and never terminates.

snazy added a commit to snazy/nessie that referenced this issue Mar 30, 2023
@brunobat
Copy link
Contributor

What are the memory definitions that you use @snazy ?

@snazy
Copy link
Contributor Author

snazy commented Mar 30, 2023

At max 1.5G heap IIRC, maybe less

@brunobat
Copy link
Contributor

I cannot run the tests. Throws me a too many open files before I reach to any OOM.
I'm in a locked mac. No way to increase more of that.
Is there a specific test where the OOM will happen more often?
(will find a different, slower machine)

@snazy
Copy link
Contributor Author

snazy commented Mar 30, 2023

There are not that many test classes, OOMs pretty much „at the end“.

snazy added a commit to snazy/nessie that referenced this issue Mar 30, 2023
@brunobat
Copy link
Contributor

brunobat commented Mar 31, 2023

I asked @alesj to please take a look.

snazy added a commit to snazy/nessie that referenced this issue Apr 2, 2023
@alesj
Copy link
Contributor

alesj commented Apr 3, 2023

I already got this on the main branch (which I ran by mistake ...)

@alesj
Copy link
Contributor

alesj commented Apr 3, 2023

Where as this never stopped: https://gist.github.com/alesj/60d7467f85644dd639869f3c23833818 ... 42min and it would go on ...

@snazy any easier way to reproduce this -- not to have the whole Nessie project ... ?

@snazy
Copy link
Contributor Author

snazy commented Apr 3, 2023

Probably - you could create a test project with many test classes, and each produces an otel event.

@alesj
Copy link
Contributor

alesj commented Apr 3, 2023

@snazy what if you set this

for the okio library.
This way it will only load okio classes once for all tests.

@alesj
Copy link
Contributor

alesj commented Apr 5, 2023

@snazy any progress / luck with this parent-first?
Or easier reproducer ...

snazy added a commit to snazy/nessie that referenced this issue Apr 7, 2023
snazy added a commit to snazy/nessie that referenced this issue Apr 7, 2023
snazy added a commit to snazy/nessie that referenced this issue Apr 7, 2023
snazy added a commit to snazy/nessie that referenced this issue Apr 8, 2023
@snazy
Copy link
Contributor Author

snazy commented Apr 8, 2023

Sorry, no luck so far. It's pretty clear that it's caused by okio, already known and was hit before. I think, the current idea is to remove okio. It's "only" a test-issue (prod-like doesn't reload stuff).

snazy added a commit to snazy/nessie that referenced this issue Jul 5, 2023
@geoand
Copy link
Contributor

geoand commented Jul 7, 2023

So I assume the workaround we need is to have our own implementation of io.opentelemetry.exporter.internal.grpc.GrpcExporter which would be used instead of io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter?

@brunobat
Copy link
Contributor

brunobat commented Jul 7, 2023

Yes and not just for this reason. We need to drop OkHttp because of support reasons.

@geoand
Copy link
Contributor

geoand commented Jul 7, 2023

Good point

@geoand
Copy link
Contributor

geoand commented Jul 7, 2023

So I assume this is one for @alesj due to the need for gRPC

@brunobat
Copy link
Contributor

brunobat commented Jul 7, 2023

Yes, it would be nice

@geoand
Copy link
Contributor

geoand commented Jul 7, 2023

I might be misreading OkHttpGrpcExporter, but it seems to simply be making a HTTP call. Is that correct or am I missing something?

@brunobat
Copy link
Contributor

brunobat commented Jul 7, 2023

The OkHttp lib is able to do both GRPC and HTTP, there is a config for that.

@geoand
Copy link
Contributor

geoand commented Jul 7, 2023

Yeah, I see that. My point is that in reality, OkHttpGrpcExporter doesn't really need to know much about gRPC.

And actually this is true, as the description of the original PR that introduced the class above says.

@brunobat
Copy link
Contributor

brunobat commented Jul 7, 2023

Correct.
Also, beware of this: open-telemetry/opentelemetry-java#5557
We might not need to implement much.

@geoand
Copy link
Contributor

geoand commented Jul 7, 2023

I saw that too, yeah. But from a quick read through it, doesn't seem to change any gRPC related stuff.

@geoand
Copy link
Contributor

geoand commented Jul 7, 2023

According to this, it's not possible to use the JDK HTTP Client because there is not trailing header support.

@cescoffier @alesj I assume we could leverage the Vert.x GrpcClient or even the HttpClient here as all we need to do is shove some bytes over HTTP2 (see what they are currently doing with OkHttp). WDYT?

@brunobat
Copy link
Contributor

brunobat commented Jul 7, 2023

That's actually my preference as well.
Before this JDK HTTP Client implementation there was no API to plug a different client.
Everything was tight to OkHttp.
Now there is one.

@cescoffier
Copy link
Member

What do you call trailing headers? trailer? gRPC is often using trailers, and we do support that.

@geoand
Copy link
Contributor

geoand commented Jul 7, 2023

@cescoffier yeah trailer

and we do support that.

so it should be possible to use the pure GrpcClient client from Vert.x to perform the write operation we need?

@cescoffier
Copy link
Member

Yes, it should be fine.

@geoand
Copy link
Contributor

geoand commented Jul 7, 2023

Cool, thanks

@geoand
Copy link
Contributor

geoand commented Jul 7, 2023

I assume https://vertx.io/docs/vertx-grpc/java/#_message_level_api_2 is what we are after. I'll try and have a look next week

@geoand
Copy link
Contributor

geoand commented Jul 10, 2023

I have a prototype here. I think it's a pretty good starting point and if you agree I can move it forward, it should definitely be reviewed / improved by Vert.x / gRPC / OTel folks

@geoand geoand self-assigned this Jul 10, 2023
@geoand
Copy link
Contributor

geoand commented Jul 10, 2023

I got native to work, so now I think it's more a matter of hardening

snazy added a commit to snazy/nessie that referenced this issue Jul 10, 2023
snazy added a commit to projectnessie/nessie that referenced this issue Jul 11, 2023
Includes a workaround for Quarkus3/OTel/okio OOM in tests

See quarkusio/quarkus#32238
@geoand geoand moved this from Todo to In Progress in Quarkus Roadmap/Planning Jul 12, 2023
geoand added a commit that referenced this issue Jul 18, 2023
Replace OkHttp tracing backend with Vert.x
@github-project-automation github-project-automation bot moved this from In Progress to Done in Quarkus Roadmap/Planning Jul 18, 2023
@quarkus-bot quarkus-bot bot added this to the 3.3 - main milestone Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tracing kind/bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants