[Bug] [Broker] Entry digest does not match #22103

thetumbled · 2024-02-23T06:57:23Z

Search before asking

I searched in the issues and found nothing similar.

Version

2.9

Minimal reproduce step

We use transactinal feature and start some tasks.

What did you expect to see?

.

What did you see instead?

One entry can't be read out because entry digest does not match, we have meet this problem serveral times, but the frequency of occurrency is pretty low.

The replication factor is 3 3 2, so there are 3 copy of this entry. Further, i find that all of these three copy are same, which exclude the possibility of disk error. There must be something wrong with the code.

Anything else?

No response

Are you willing to submit a PR?

I'm willing to submit a PR!

The text was updated successfully, but these errors were encountered:

lhotari · 2024-02-23T08:59:54Z

This might be related to apache/bookkeeper#4196 . Which BookKeeper version (client & server) do you have?

lhotari · 2024-02-23T09:01:40Z

Do you have TLS enabled between Pulsar & BookKeeper? In the checksum calculation on the bookkeeper server side it has an impact since Netty's TLS implementation would use CompositeByteBufs in some cases.

thetumbled · 2024-02-23T09:10:14Z

This might be related to apache/bookkeeper#4196 . Which BookKeeper version (client & server) do you have?

version 4.14.1, we are studying this pr too, but we are not sure the problem is related with it yet.

thetumbled · 2024-02-23T09:10:47Z

Do you have TLS enabled between Pulsar & BookKeeper? In the checksum calculation on the bookkeeper server side it has an impact since Netty's TLS implementation would use CompositeByteBufs in some cases.

No, we don't set any tls config.

lhotari · 2024-02-23T09:56:37Z

This might be related to apache/bookkeeper#4196 . Which BookKeeper version (client & server) do you have?

version 4.14.1, we are studying this pr too, but we are not sure the problem is related with it yet.

apache/bookkeeper#2701 is included in 4.14.1 so there's a clear problem in checksum calculation in 4.14.1 . The problem is fixed by apache/bookkeeper#4196 .

thetumbled · 2024-02-23T10:14:27Z

This might be related to apache/bookkeeper#4196 . Which BookKeeper version (client & server) do you have?

version 4.14.1, we are studying this pr too, but we are not sure the problem is related with it yet.

apache/bookkeeper#2701 is included in 4.14.1 so there's a clear problem in checksum calculation in 4.14.1 . The problem is fixed by apache/bookkeeper#4196 .

Yes, there is the bug need to be fixed by apache/bookkeeper#4196.
I have a quetion, which config is used to enable the tls feature between Pulsar & BookKeeper?

lhotari · 2024-02-23T13:02:07Z

Yes, there is the bug need to be fixed by apache/bookkeeper#4196. I have a quetion, which config is used to enable the tls feature between Pulsar & BookKeeper?

@thetumbled One possible reference to using TLS between Pulsar Broker & BookKeeper is to check how the Apache Pulsar Helm Chart handles the TLS configuration.

Regarding TLS, do you use TLS between clients and the broker? I guess that's the common configuration. I think that CompositeByteBufs could be created occasionally by Netty TLS handling also in that case, even without having BrokerEntryMetadata.

lhotari · 2024-02-23T13:04:37Z

This might be related to #21421

lhotari · 2024-02-23T13:05:38Z

There's also #21933

lhotari · 2024-02-23T16:01:27Z

A "double release" of a Netty ByteBuf or a thread safety issue could also cause such issues that the buffer's memory space is already used for other purposes causing the checksum calculation to fail. Netty ByteBufs aren't thread safe alone and it's possible that there could be an unsafe publication of the ByteBuf instance across thread boundaries (safe publication concept is explained here).
This type of issues are hard to debug and it's very hard to reason about the correctness in the Pulsar code base.
Passing an mutable object instance across thread boundaries is fine as long as it happens with one of the methods listed in "safe publication" (which is a concept to make it easier to reason about Java Memory Model's "happens before" rules).

In the "double release" issue the ByteBuf instance would get released twice so that it's already in use by another thread when it is released the second time. There's io.netty.util.ReferenceCountUtil#safeRelease(java.lang.Object) which should be called "unsafe release". It hides the problems, although it does log issues.
In the case of "double release" issues there should be either "Failed to release a message" or IllegalReferenceCountException in logs at some point when the double release results in a failure.

So there's a high chance that checksum errors could happen even after the apache/bookkeeper#4196 fix has been applied.

When looking into the "double release" problem, I did find bad code in the Pulsar code base that hides the problem:

pulsar/pulsar-common/src/main/java/org/apache/pulsar/common/api/raw/ReferenceCountedMessageMetadata.java

Line 52 in 82237d3

ref.setRefCnt(1);

I have a draft PR to address the setRefCnt(1) and adding a solution to catch the double release bugs.
#22110

thetumbled · 2024-02-26T07:46:49Z

Yes, there is the bug need to be fixed by apache/bookkeeper#4196. I have a quetion, which config is used to enable the tls feature between Pulsar & BookKeeper?

@thetumbled One possible reference to using TLS between Pulsar Broker & BookKeeper is to check how the Apache Pulsar Helm Chart handles the TLS configuration.

Regarding TLS, do you use TLS between clients and the broker? I guess that's the common configuration. I think that CompositeByteBufs could be created occasionally by Netty TLS handling also in that case, even without having BrokerEntryMetadata.

I add some log to detect whether the buffer is of type CompositeByteBuf:

        if (unwrapped instanceof CompositeByteBuf) {
            if (logger.isDebugEnabled()) {
                logger.debug("test Digesting a composite buffer for ledger-id: {}, entry-id: {}",
                        ledgerId, entryId);
            }
            ((CompositeByteBuf) unwrapped).forEach(this::update);
        } else {
            update(unwrapped);
        }

And no log above printed, we don't use a CompositeByteBuf object to calculate checksum, so the problem may not related with apache/bookkeeper#4196.

lhotari · 2024-02-26T08:10:52Z

@thetumbled Is there a specific reason to continue to use Pulsar 2.9.x ? The support for 2.9.x has ended a long time ago. Apache Pulsar PMC currently only supports officially 3.0.x and 3.2.x at the moment: https://pulsar.apache.org/contribute/release-policy/#supported-versions

thetumbled · 2024-02-26T08:18:33Z

@thetumbled Is there a specific reason to continue to use Pulsar 2.9.x ? The support for 2.9.x has ended a long time ago. Apache Pulsar PMC currently only supports officially 3.0.x and 3.2.x at the moment: https://pulsar.apache.org/contribute/release-policy/#supported-versions

We upgrade to Pulsar 2.9.x in 2022, and don't upgrade the version frequently but cherry pick some useful pr to fix the problem we meet.😂

lhotari · 2024-02-26T08:21:05Z

@thetumbled Is there a specific reason to continue to use Pulsar 2.9.x ? The support for 2.9.x has ended a long time ago. Apache Pulsar PMC currently only supports officially 3.0.x and 3.2.x at the moment: https://pulsar.apache.org/contribute/release-policy/#supported-versions

We upgrade to Pulsar 2.9.x in 2022, and don't upgrade the version frequently but cherry pick some useful pr to fix the problem we meet.😂

@thetumbled is there a specific reason to follow this practice?

thetumbled · 2024-02-26T08:28:50Z

@thetumbled Is there a specific reason to continue to use Pulsar 2.9.x ? The support for 2.9.x has ended a long time ago. Apache Pulsar PMC currently only supports officially 3.0.x and 3.2.x at the moment: https://pulsar.apache.org/contribute/release-policy/#supported-versions

We upgrade to Pulsar 2.9.x in 2022, and don't upgrade the version frequently but cherry pick some useful pr to fix the problem we meet.😂

@thetumbled is there a specific reason to follow this practice?

Upgrading pulsar version require lots of verifying work, there are multiple types of client and complcated production enviroment, we can't upgrade the version frequently. Some client may even use 2.8 version sdk.

lhotari · 2024-02-26T10:03:11Z

Upgrading pulsar version require lots of verifying work, there are multiple types of client and complcated production enviroment, we can't upgrade the version frequently. Some client may even use 2.8 version sdk.

Is there a way that Pulsar upgrades could be made easier? This is something that has been discussed on Pulsar Slack too.
It's not a sustainable solution that the Apache Pulsar project tries to support users on huge variations of software. It would be much more productive for everyone if the amount of configurations could be reduced so that we work on stabilizing and improving Pulsar.

zhanglistar · 2024-03-18T08:16:50Z

apache/bookkeeper#4194
apache/bookkeeper#4196
Note

thetumbled · 2024-04-03T10:34:56Z

We found that the corrupted entry is composed by two corrupted messages, apache/bookkeeper#4196 may not be able to result into such phenomenon.
Do you have any other clues about this problem? thanks. @lhotari

lhotari · 2024-04-03T10:46:09Z

We found that the corrupted entry is composed by two corrupted messages, apache/bookkeeper#4196 may not be able to result into such phenomenon. Do you have any other clues about this problem? thanks. @lhotari

@thetumbled I think @poorbarcode was looking into a similar problem recently and worked on some fixes.

lhotari mentioned this issue Apr 11, 2024

[Bug] Bookkeeper returns mac mismatch while reading #21421

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [Broker] Entry digest does not match #22103

[Bug] [Broker] Entry digest does not match #22103

thetumbled commented Feb 23, 2024 •

edited

Loading

lhotari commented Feb 23, 2024

lhotari commented Feb 23, 2024

thetumbled commented Feb 23, 2024 •

edited

Loading

thetumbled commented Feb 23, 2024

lhotari commented Feb 23, 2024

thetumbled commented Feb 23, 2024 •

edited

Loading

lhotari commented Feb 23, 2024 •

edited

Loading

lhotari commented Feb 23, 2024

lhotari commented Feb 23, 2024

lhotari commented Feb 23, 2024

thetumbled commented Feb 26, 2024 •

edited

Loading

lhotari commented Feb 26, 2024

thetumbled commented Feb 26, 2024 •

edited

Loading

lhotari commented Feb 26, 2024

thetumbled commented Feb 26, 2024

lhotari commented Feb 26, 2024

zhanglistar commented Mar 18, 2024

thetumbled commented Apr 3, 2024

lhotari commented Apr 3, 2024

[Bug] [Broker] Entry digest does not match #22103

[Bug] [Broker] Entry digest does not match #22103

Comments

thetumbled commented Feb 23, 2024 • edited Loading

Search before asking

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything else?

Are you willing to submit a PR?

lhotari commented Feb 23, 2024

lhotari commented Feb 23, 2024

thetumbled commented Feb 23, 2024 • edited Loading

thetumbled commented Feb 23, 2024

lhotari commented Feb 23, 2024

thetumbled commented Feb 23, 2024 • edited Loading

lhotari commented Feb 23, 2024 • edited Loading

lhotari commented Feb 23, 2024

lhotari commented Feb 23, 2024

lhotari commented Feb 23, 2024

thetumbled commented Feb 26, 2024 • edited Loading

lhotari commented Feb 26, 2024

thetumbled commented Feb 26, 2024 • edited Loading

lhotari commented Feb 26, 2024

thetumbled commented Feb 26, 2024

lhotari commented Feb 26, 2024

zhanglistar commented Mar 18, 2024

thetumbled commented Apr 3, 2024

lhotari commented Apr 3, 2024

thetumbled commented Feb 23, 2024 •

edited

Loading

thetumbled commented Feb 23, 2024 •

edited

Loading

thetumbled commented Feb 23, 2024 •

edited

Loading

lhotari commented Feb 23, 2024 •

edited

Loading

thetumbled commented Feb 26, 2024 •

edited

Loading

thetumbled commented Feb 26, 2024 •

edited

Loading