Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Jackson 2.17.0 LockFreePool causes memory issues #4729

Closed
JannikBrand opened this issue Jul 11, 2024 · 10 comments
Closed

[BUG] Jackson 2.17.0 LockFreePool causes memory issues #4729

JannikBrand opened this issue Jul 11, 2024 · 10 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@JannikBrand
Copy link
Contributor

JannikBrand commented Jul 11, 2024

Describe the bug

Data Prepper runs in heap OOM issues. This was observed when ingesting OTel metrics via Data Prepper into OpenSearch (~400 metric data points per second).

image

(The picture shows the summed up heap memory of 2 Data Prepper instances. The instances do not crash, since circuit breakers are configured and constantly open.)
The memory is taken away from objects sitting in the Old Gen space.

Possible trigger: The issue started to occur when updating from DP version 2.7.0 to 2.8.0.

I created a heap dump:

image

The org.opensearch.dataprepper.pipeline.Pipeline object is taking away almost all the memory. Within the dominator tree, I can trace back the memory consumption to the jackson LockFreePool:

image

There are some known issues with the LockFreePool, e.g. see

I discovered a memory leak while performing performance test in a multi threaded environment. This seems to be due to the switch to the LockFreePool introduced in 2.17.0.

if you search around, you'll see that this pool is not working well. Stick with Jackson 2.16 or override the recycler pool to use the thread local one. 2.17.1 goes back to that pool as the default.

I am not sure what jackson version is exactly used within the opensearch sink, but at least we see that the LockFreePool is used.

To Reproduce
Steps to reproduce the behavior:

  1. Setup Data Prepper with otel metrics source and processor and the opensearch sink.
  2. ingest OTel metrics
  3. wait (I could not reproduce it reliably in my dev setup, however for some Data Prepper instances in our environment it happens frequently.)

Expected behavior
For comparison this is how the heap utilization looks without this issue (same ingestion workload):
image

Environment (please complete the following information):

  • Data Prepper 2.8.0
@JannikBrand JannikBrand added bug Something isn't working untriaged labels Jul 11, 2024
@JannikBrand
Copy link
Contributor Author

JannikBrand commented Jul 11, 2024

I think I found the reason why it started to occur for version 2.8.0:
The LockFreePool has the parent node OpenSearchClientRefresher (see dominator tree above). The client refresher was added with #4283.

@KarstenSchnitter
Copy link
Collaborator

I analysed the issue together with @JannikBrand. We also pulled a thread dump at the time, when the circuit breaker was active and no data was ingested. We could verify, that all threads are waiting for data, either from the network or a queue within DataPrepper. This underlines the issue with the _ recyclerPool from Jackson.

@KarstenSchnitter
Copy link
Collaborator

This bug might be introduced by a transitive dependency from armeria v1.28.2, which has Jackson 2.17.0 as dependency. This would explain, why the issue with the LockJoinPool for the _recyclerPool arises, even when the explicit dependency for Jackson is specified to be Jackson 2.16.2. I am going to verify, what the actual version bundled into DataPrepper 2.8.0 is. The upgrade of armeria to v1.29.0 upgrade Jackson to the fixed version 2.17.1. Therefore, the problem should not be reproducible with the main branch.

@KarstenSchnitter
Copy link
Collaborator

KarstenSchnitter commented Jul 15, 2024

I downloaded the Linux distribution of DataPrepper 2.8.0 and found the vulnerable Jackson version 2.17.0 in the libs folder:

dataprepper_2 8 0_jackson-dependencies

This indicates a conflict with the explicit jackson-bom 2.16.1 in

implementation platform('com.fasterxml.jackson:jackson-bom:2.16.1')

As a fix, the armeria version needs to be upgraded to at least 1.29.0. This has already been done by @dlvenable for the main branch. I suggest to backport #4629 to the 2.8 release. Furthermore, the mismatch between the build.gradle and the actual Jackson version should be addressed.

@KarstenSchnitter KarstenSchnitter changed the title [BUG] Jackson LockFreePool causes memory issues [BUG] Jackson 2.17.0 LockFreePool causes memory issues Jul 15, 2024
@dlvenable dlvenable self-assigned this Jul 16, 2024
@dlvenable
Copy link
Member

@KarstenSchnitter , @JannikBrand , Thank you for reporting this issue and the fantastic analysis!

It does appear that Jackson 2.17.1 fixes this. I'm putting together some backport PRs to support a 2.8.1 release to fix this.

Would you be able to test this using a locally-built Data Prepper on the 2.8 branch to see if it resolves the issue?

dlvenable added a commit to dlvenable/data-prepper that referenced this issue Jul 16, 2024
@dlvenable dlvenable added this to the v2.8.1 milestone Jul 16, 2024
@KarstenSchnitter
Copy link
Collaborator

KarstenSchnitter commented Jul 17, 2024

@dlvenable: I talked to @JannikBrand about testing your change. In principle, we are able to verify, whether the upgrade is effective. But we both have a few days off, so that we can only look into that next week.

dlvenable added a commit that referenced this issue Jul 17, 2024
opensearch-trigger-bot bot pushed a commit that referenced this issue Jul 17, 2024
Signed-off-by: David Venable <[email protected]>
(cherry picked from commit 418a2a5)
@JannikBrand
Copy link
Contributor Author

I checked out the backport/backport-4744-to-2.8 branch from this PR in order to verify the change. I've built a docker image on that branch and let it run. Then I created a heap dump and could confirm, that there is no LockFreePool reference anymore within the openSearchClientRefresher > currentClient > transport > mapper > innerMapper > jsonProvider > jsonFactory:

image

So, I think the change took effect. I did not perform an actual performance test, since I first would have to reproduce this locally with 2.8.0 and afterwards again with the patched 2.8.1 version. I could still do it next week, or instead I could also just confirm that the memory issues do not reoccur in our environment after upgrading to the patched 2.8.1 version.

dlvenable added a commit that referenced this issue Jul 19, 2024
Signed-off-by: David Venable <[email protected]>
(cherry picked from commit 418a2a5)

Co-authored-by: David Venable <[email protected]>
kkondaka pushed a commit to kkondaka/kk-data-prepper-f2 that referenced this issue Jul 23, 2024
kkondaka pushed a commit to kkondaka/kk-data-prepper-f2 that referenced this issue Jul 23, 2024
kkondaka pushed a commit to kkondaka/kk-data-prepper-f2 that referenced this issue Jul 23, 2024
kkondaka pushed a commit to kkondaka/kk-data-prepper-f2 that referenced this issue Jul 30, 2024
@dlvenable
Copy link
Member

@JannikBrand , We just released Data Prepper 2.8.1 if you'd like to try to verify that the issue is resolved.

@JannikBrand
Copy link
Contributor Author

JannikBrand commented Aug 2, 2024

@dlvenable I verified that the same aspect from my last comment is true for the released version:

could confirm, that there is no LockFreePool reference anymore

From our side the issue can be closed. Thanks for processing and fixing it so quickly!

@dlvenable
Copy link
Member

You're welcome @JannikBrand. And thank you for the great analysis that helped us resolve this so quickly.

kkondaka pushed a commit to kkondaka/kk-data-prepper-f2 that referenced this issue Aug 8, 2024
kkondaka pushed a commit to kkondaka/kk-data-prepper-f2 that referenced this issue Aug 12, 2024
kkondaka pushed a commit to kkondaka/kk-data-prepper-f2 that referenced this issue Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants