-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [Broker] Entry digest does not match #22103
Comments
This might be related to apache/bookkeeper#4196 . Which BookKeeper version (client & server) do you have? |
Do you have TLS enabled between Pulsar & BookKeeper? In the checksum calculation on the bookkeeper server side it has an impact since Netty's TLS implementation would use CompositeByteBufs in some cases. |
version 4.14.1, we are studying this pr too, but we are not sure the problem is related with it yet. |
No, we don't set any tls config. |
apache/bookkeeper#2701 is included in 4.14.1 so there's a clear problem in checksum calculation in 4.14.1 . The problem is fixed by apache/bookkeeper#4196 . |
Yes, there is the bug need to be fixed by apache/bookkeeper#4196. |
@thetumbled One possible reference to using TLS between Pulsar Broker & BookKeeper is to check how the Apache Pulsar Helm Chart handles the TLS configuration. Regarding TLS, do you use TLS between clients and the broker? I guess that's the common configuration. I think that CompositeByteBufs could be created occasionally by Netty TLS handling also in that case, even without having BrokerEntryMetadata. |
This might be related to #21421 |
There's also #21933 |
A "double release" of a Netty ByteBuf or a thread safety issue could also cause such issues that the buffer's memory space is already used for other purposes causing the checksum calculation to fail. Netty ByteBufs aren't thread safe alone and it's possible that there could be an unsafe publication of the ByteBuf instance across thread boundaries (safe publication concept is explained here). In the "double release" issue the ByteBuf instance would get released twice so that it's already in use by another thread when it is released the second time. There's So there's a high chance that checksum errors could happen even after the apache/bookkeeper#4196 fix has been applied. When looking into the "double release" problem, I did find bad code in the Pulsar code base that hides the problem: Line 52 in 82237d3
I have a draft PR to address the |
I add some log to detect whether the buffer is of type
And no log above printed, we don't use a CompositeByteBuf object to calculate checksum, so the problem may not related with apache/bookkeeper#4196. |
@thetumbled Is there a specific reason to continue to use Pulsar 2.9.x ? The support for 2.9.x has ended a long time ago. Apache Pulsar PMC currently only supports officially 3.0.x and 3.2.x at the moment: https://pulsar.apache.org/contribute/release-policy/#supported-versions |
We upgrade to Pulsar 2.9.x in 2022, and don't upgrade the version frequently but cherry pick some useful pr to fix the problem we meet.😂 |
@thetumbled is there a specific reason to follow this practice? |
Upgrading pulsar version require lots of verifying work, there are multiple types of client and complcated production enviroment, we can't upgrade the version frequently. Some client may even use 2.8 version sdk. |
Is there a way that Pulsar upgrades could be made easier? This is something that has been discussed on Pulsar Slack too. |
We found that the corrupted entry is composed by two corrupted messages, apache/bookkeeper#4196 may not be able to result into such phenomenon. |
@thetumbled I think @poorbarcode was looking into a similar problem recently and worked on some fixes. |
Search before asking
Version
2.9
Minimal reproduce step
We use transactinal feature and start some tasks.
What did you expect to see?
.
What did you see instead?
One entry can't be read out because entry digest does not match, we have meet this problem serveral times, but the frequency of occurrency is pretty low.
The replication factor is
3 3 2
, so there are 3 copy of this entry. Further, i find that all of these three copy are same, which exclude the possibility of disk error. There must be something wrong with the code.Anything else?
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: