Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Broker] Entry digest does not match #22103

Open
1 of 2 tasks
thetumbled opened this issue Feb 23, 2024 · 19 comments
Open
1 of 2 tasks

[Bug] [Broker] Entry digest does not match #22103

thetumbled opened this issue Feb 23, 2024 · 19 comments

Comments

@thetumbled
Copy link
Member

thetumbled commented Feb 23, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Version

2.9

Minimal reproduce step

We use transactinal feature and start some tasks.

What did you expect to see?

.

What did you see instead?

One entry can't be read out because entry digest does not match, we have meet this problem serveral times, but the frequency of occurrency is pretty low.
image

The replication factor is 3 3 2, so there are 3 copy of this entry. Further, i find that all of these three copy are same, which exclude the possibility of disk error. There must be something wrong with the code.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@lhotari
Copy link
Member

lhotari commented Feb 23, 2024

This might be related to apache/bookkeeper#4196 . Which BookKeeper version (client & server) do you have?

@lhotari
Copy link
Member

lhotari commented Feb 23, 2024

Do you have TLS enabled between Pulsar & BookKeeper? In the checksum calculation on the bookkeeper server side it has an impact since Netty's TLS implementation would use CompositeByteBufs in some cases.

@thetumbled
Copy link
Member Author

thetumbled commented Feb 23, 2024

This might be related to apache/bookkeeper#4196 . Which BookKeeper version (client & server) do you have?

version 4.14.1, we are studying this pr too, but we are not sure the problem is related with it yet.

@thetumbled
Copy link
Member Author

Do you have TLS enabled between Pulsar & BookKeeper? In the checksum calculation on the bookkeeper server side it has an impact since Netty's TLS implementation would use CompositeByteBufs in some cases.

No, we don't set any tls config.

@lhotari
Copy link
Member

lhotari commented Feb 23, 2024

This might be related to apache/bookkeeper#4196 . Which BookKeeper version (client & server) do you have?

version 4.14.1, we are studying this pr too, but we are not sure the problem is related with it yet.

apache/bookkeeper#2701 is included in 4.14.1 so there's a clear problem in checksum calculation in 4.14.1 . The problem is fixed by apache/bookkeeper#4196 .

@thetumbled
Copy link
Member Author

thetumbled commented Feb 23, 2024

This might be related to apache/bookkeeper#4196 . Which BookKeeper version (client & server) do you have?

version 4.14.1, we are studying this pr too, but we are not sure the problem is related with it yet.

apache/bookkeeper#2701 is included in 4.14.1 so there's a clear problem in checksum calculation in 4.14.1 . The problem is fixed by apache/bookkeeper#4196 .

Yes, there is the bug need to be fixed by apache/bookkeeper#4196.
I have a quetion, which config is used to enable the tls feature between Pulsar & BookKeeper?

@lhotari
Copy link
Member

lhotari commented Feb 23, 2024

Yes, there is the bug need to be fixed by apache/bookkeeper#4196. I have a quetion, which config is used to enable the tls feature between Pulsar & BookKeeper?

@thetumbled One possible reference to using TLS between Pulsar Broker & BookKeeper is to check how the Apache Pulsar Helm Chart handles the TLS configuration.

Regarding TLS, do you use TLS between clients and the broker? I guess that's the common configuration. I think that CompositeByteBufs could be created occasionally by Netty TLS handling also in that case, even without having BrokerEntryMetadata.

@lhotari
Copy link
Member

lhotari commented Feb 23, 2024

This might be related to #21421

@lhotari
Copy link
Member

lhotari commented Feb 23, 2024

There's also #21933

@lhotari
Copy link
Member

lhotari commented Feb 23, 2024

A "double release" of a Netty ByteBuf or a thread safety issue could also cause such issues that the buffer's memory space is already used for other purposes causing the checksum calculation to fail. Netty ByteBufs aren't thread safe alone and it's possible that there could be an unsafe publication of the ByteBuf instance across thread boundaries (safe publication concept is explained here).
This type of issues are hard to debug and it's very hard to reason about the correctness in the Pulsar code base.
Passing an mutable object instance across thread boundaries is fine as long as it happens with one of the methods listed in "safe publication" (which is a concept to make it easier to reason about Java Memory Model's "happens before" rules).

In the "double release" issue the ByteBuf instance would get released twice so that it's already in use by another thread when it is released the second time. There's io.netty.util.ReferenceCountUtil#safeRelease(java.lang.Object) which should be called "unsafe release". It hides the problems, although it does log issues.
In the case of "double release" issues there should be either "Failed to release a message" or IllegalReferenceCountException in logs at some point when the double release results in a failure.

So there's a high chance that checksum errors could happen even after the apache/bookkeeper#4196 fix has been applied.

When looking into the "double release" problem, I did find bad code in the Pulsar code base that hides the problem:

I have a draft PR to address the setRefCnt(1) and adding a solution to catch the double release bugs.
#22110

@thetumbled
Copy link
Member Author

thetumbled commented Feb 26, 2024

Yes, there is the bug need to be fixed by apache/bookkeeper#4196. I have a quetion, which config is used to enable the tls feature between Pulsar & BookKeeper?

@thetumbled One possible reference to using TLS between Pulsar Broker & BookKeeper is to check how the Apache Pulsar Helm Chart handles the TLS configuration.

Regarding TLS, do you use TLS between clients and the broker? I guess that's the common configuration. I think that CompositeByteBufs could be created occasionally by Netty TLS handling also in that case, even without having BrokerEntryMetadata.

I add some log to detect whether the buffer is of type CompositeByteBuf:

        if (unwrapped instanceof CompositeByteBuf) {
            if (logger.isDebugEnabled()) {
                logger.debug("test Digesting a composite buffer for ledger-id: {}, entry-id: {}",
                        ledgerId, entryId);
            }
            ((CompositeByteBuf) unwrapped).forEach(this::update);
        } else {
            update(unwrapped);
        }

And no log above printed, we don't use a CompositeByteBuf object to calculate checksum, so the problem may not related with apache/bookkeeper#4196.

@lhotari
Copy link
Member

lhotari commented Feb 26, 2024

@thetumbled Is there a specific reason to continue to use Pulsar 2.9.x ? The support for 2.9.x has ended a long time ago. Apache Pulsar PMC currently only supports officially 3.0.x and 3.2.x at the moment: https://pulsar.apache.org/contribute/release-policy/#supported-versions

@thetumbled
Copy link
Member Author

thetumbled commented Feb 26, 2024

@thetumbled Is there a specific reason to continue to use Pulsar 2.9.x ? The support for 2.9.x has ended a long time ago. Apache Pulsar PMC currently only supports officially 3.0.x and 3.2.x at the moment: https://pulsar.apache.org/contribute/release-policy/#supported-versions

We upgrade to Pulsar 2.9.x in 2022, and don't upgrade the version frequently but cherry pick some useful pr to fix the problem we meet.😂

@lhotari
Copy link
Member

lhotari commented Feb 26, 2024

@thetumbled Is there a specific reason to continue to use Pulsar 2.9.x ? The support for 2.9.x has ended a long time ago. Apache Pulsar PMC currently only supports officially 3.0.x and 3.2.x at the moment: https://pulsar.apache.org/contribute/release-policy/#supported-versions

We upgrade to Pulsar 2.9.x in 2022, and don't upgrade the version frequently but cherry pick some useful pr to fix the problem we meet.😂

@thetumbled is there a specific reason to follow this practice?

@thetumbled
Copy link
Member Author

@thetumbled Is there a specific reason to continue to use Pulsar 2.9.x ? The support for 2.9.x has ended a long time ago. Apache Pulsar PMC currently only supports officially 3.0.x and 3.2.x at the moment: https://pulsar.apache.org/contribute/release-policy/#supported-versions

We upgrade to Pulsar 2.9.x in 2022, and don't upgrade the version frequently but cherry pick some useful pr to fix the problem we meet.😂

@thetumbled is there a specific reason to follow this practice?

Upgrading pulsar version require lots of verifying work, there are multiple types of client and complcated production enviroment, we can't upgrade the version frequently. Some client may even use 2.8 version sdk.

@lhotari
Copy link
Member

lhotari commented Feb 26, 2024

Upgrading pulsar version require lots of verifying work, there are multiple types of client and complcated production enviroment, we can't upgrade the version frequently. Some client may even use 2.8 version sdk.

Is there a way that Pulsar upgrades could be made easier? This is something that has been discussed on Pulsar Slack too.
It's not a sustainable solution that the Apache Pulsar project tries to support users on huge variations of software. It would be much more productive for everyone if the amount of configurations could be reduced so that we work on stabilizing and improving Pulsar.

@zhanglistar
Copy link

@thetumbled
Copy link
Member Author

We found that the corrupted entry is composed by two corrupted messages, apache/bookkeeper#4196 may not be able to result into such phenomenon.
Do you have any other clues about this problem? thanks. @lhotari

@lhotari
Copy link
Member

lhotari commented Apr 3, 2024

We found that the corrupted entry is composed by two corrupted messages, apache/bookkeeper#4196 may not be able to result into such phenomenon. Do you have any other clues about this problem? thanks. @lhotari

@thetumbled I think @poorbarcode was looking into a similar problem recently and worked on some fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants