[fix][broker] Fix data corruption issues when TLS is enabled and optimize TLS between Pulsar client and brokers #22810

lhotari · 2024-05-30T16:22:29Z

Fixes #22601 #21892 #19460
This PR replaces #22760

Motivation

In Pulsar, there are multiple reported issues where the transferred output gets corrupted and fails with exceptions around invalid reader and writer index. One source of these issues are the ones which occur only when TLS is enabled between clients and Pulsar broker or between Pulsar broker and bookies.

In Pulsar, the sharing of ByteBuf instance happens in this case at least via the broker cache (RangeEntryCacheManagerImpl) and the pending reads manager (PendingReadsManager).

The SslHandler related issue was originally reported in Pulsar in 2018 with #2401 . The fix that time was #2464.
The ByteBuf .copy() method was used to copy the ByteBuf. There hasn't been a similar solution in Bookkeeper or Bookkeeper client to address corruption that is caused by Netty SslHandler.
One of the problems with .copy() is that it's unefficient. I have also created a PR to Netty to make SslHandler not mutate input buffers. The PR is netty/netty#14086 .

The Failed to peek sticky key from the message metadata java.lang.IllegalArgumentException: Invalid unknonwn tag type: 4 exceptions are caused by the SslHandler mutation issue between broker and bookies. It also corrupts the data that gets written to bookkeeper since bookkeeper doesn't check the checksum at writing time, only when it's retrieved from the storage.
java.lang.IndexOutOfBoundsException: readerIndex: 31215, writerIndex: 21324 (expected: 0 <= readerIndex <= writerIndex <= capacity(65536)) type of exceptions on the broker side are also symptoms of the same problem.

The root cause of such exceptions could also be different. A shared Netty ByteBuf must have at least have an independent view created with duplicate, slice or retainedDuplicate if the readerIndex is mutated.
The ByteBuf instance must also be properly shared in a thread safe way. Failing to do that could result in similar symptoms and this PR doesn't fix that.

Modifications

Remove the ByteBufPair.CopyingEncoder and make the ByteBufPair.Encoder suitable for both use cases
- A .retainedSlice() ByteBuf needs to be passed to SslHandler so that it doesn't get mutated. A deep copy isn't required.
- This solution is also more performant since there will be less memory copies.
Workaround an IntelliJ issue where it shows a red mark in PulsarChannelInitializer classes (casting to ChannelHandler fixes the issue).
Address the data corruption issue between broker and bookies by using .retainedSlice() for the input ByteBuf.
- refactor the code to reduce duplication and unnecessary methods

Documentation

doc
doc-required
doc-not-needed
doc-complete

…bled

lhotari · 2024-05-30T17:39:58Z

This PR doesn't yet address the root cause although I was already assuming that it does.

lhotari · 2024-05-30T23:02:56Z

It's possible that netty/netty#14086 is needed to address the issues.
I haven't found a workaround to the problems yet.

Tested with with Pulsar v3.2.3 patched to use Netty 4.1.111.Final-SNAPSHOT (without macos or arm64 native libraries)
v3.2.3...lhotari:pulsar:lh-pulsar-3.2.3-netty-4.1.111.Final-SNAPSHOT

git clone -b lh-dont-mutate-sslhandler-input-buffers-by-default https://github.com/lhotari/netty
cd netty
mvn clean install -DskipTests -Dcheckstyle.skip=true
mvn -pl transport-native-unix-common,transport-native-epoll install -DskipTests

cd ..
git clone -b lh-pulsar-3.2.3-netty-4.1.111.Final-SNAPSHOT https://github.com/lhotari/pulsar
cd pulsar
mvn -Pcore-modules,-main clean install -DskipTests -Dspotbugs.skip=true -Dcheckstyle.skip=true
# this is where the distribution is:
cp distribution/server/target/apache-pulsar-3.2.3-bin.tar.gz ~

The reproducer https://github.com/lhotari/pulsar-playground/tree/master/issues/issue22601/standalone_env doesn't reproduce the problem with Netty PR 14086 changes.

lhotari · 2024-06-01T19:10:46Z

Finally found the root cause of the TLS data corruption and IndexOutOfBounds issues: netty/netty#14086 (comment)

lhotari · 2024-06-03T14:55:16Z

I'll resume this PR after Netty 4.1.111.Final has released with the required fixes.

lhotari · 2024-06-11T17:45:29Z

Closing this PR since Netty 4.1.111.Final will address the problematic issue: #22892 . I'll create a separate PR to remove the obsolete CopyingEncoder once that is merged.

[fix][broker] Fix multiple transfer corruption issues when TLS is ena…

db0362d

…bled

lhotari added type/bug The PR fixed a bug or issue reported a bug ready-to-test release/2.10.7 release/2.11.5 release/3.1.5 release/3.3.1 release/3.0.6 release/3.2.4 labels May 30, 2024

lhotari requested review from merlimat, eolivelli, codelipenghui, dao-jun and poorbarcode May 30, 2024 16:22

github-actions bot added the doc-not-needed Your PR changes do not impact docs label May 30, 2024

This was referenced May 30, 2024

[fix][broker] Fix multiple transfer corruption issues when TLS is enabled #22760

Closed

Fix TLS stability issues with V2 protocol that caused data corruption apache/bookkeeper#4404

Merged

lhotari marked this pull request as draft May 30, 2024 16:52

This was referenced May 31, 2024

[Bug] Messages seems broken when got SSL exception in the producer #21933

Closed

[Bug] Msg backlog & unack msg remains when using acknowledgeAsync #21958

Open

[Bug] Cannot determine whether the message is a duplicate at this time #21892

Closed

lhotari closed this Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][broker] Fix data corruption issues when TLS is enabled and optimize TLS between Pulsar client and brokers #22810

[fix][broker] Fix data corruption issues when TLS is enabled and optimize TLS between Pulsar client and brokers #22810

lhotari commented May 30, 2024

lhotari commented May 30, 2024 •

edited

Loading

lhotari commented May 30, 2024

lhotari commented Jun 1, 2024

lhotari commented Jun 3, 2024

lhotari commented Jun 11, 2024

[fix][broker] Fix data corruption issues when TLS is enabled and optimize TLS between Pulsar client and brokers #22810

[fix][broker] Fix data corruption issues when TLS is enabled and optimize TLS between Pulsar client and brokers #22810

Conversation

lhotari commented May 30, 2024

Motivation

Modifications

Documentation

lhotari commented May 30, 2024 • edited Loading

lhotari commented May 30, 2024

lhotari commented Jun 1, 2024

lhotari commented Jun 3, 2024

lhotari commented Jun 11, 2024

lhotari commented May 30, 2024 •

edited

Loading