[fix][broker] Fix data corruption issues when TLS is enabled and optimize TLS between Pulsar client and brokers #22810
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #22601 #21892 #19460
This PR replaces #22760
Motivation
In Pulsar, there are multiple reported issues where the transferred output gets corrupted and fails with exceptions around invalid reader and writer index. One source of these issues are the ones which occur only when TLS is enabled between clients and Pulsar broker or between Pulsar broker and bookies.
In Pulsar, the sharing of ByteBuf instance happens in this case at least via the broker cache (RangeEntryCacheManagerImpl) and the pending reads manager (PendingReadsManager).
The SslHandler related issue was originally reported in Pulsar in 2018 with #2401 . The fix that time was #2464.
The ByteBuf
.copy()
method was used to copy the ByteBuf. There hasn't been a similar solution in Bookkeeper or Bookkeeper client to address corruption that is caused by Netty SslHandler.One of the problems with
.copy()
is that it's unefficient. I have also created a PR to Netty to make SslHandler not mutate input buffers. The PR is netty/netty#14086 .The
Failed to peek sticky key from the message metadata java.lang.IllegalArgumentException: Invalid unknonwn tag type: 4
exceptions are caused by the SslHandler mutation issue between broker and bookies. It also corrupts the data that gets written to bookkeeper since bookkeeper doesn't check the checksum at writing time, only when it's retrieved from the storage.java.lang.IndexOutOfBoundsException: readerIndex: 31215, writerIndex: 21324 (expected: 0 <= readerIndex <= writerIndex <= capacity(65536))
type of exceptions on the broker side are also symptoms of the same problem.The root cause of such exceptions could also be different. A shared Netty ByteBuf must have at least have an independent view created with
duplicate
,slice
orretainedDuplicate
if the readerIndex is mutated.The ByteBuf instance must also be properly shared in a thread safe way. Failing to do that could result in similar symptoms and this PR doesn't fix that.
Modifications
.retainedSlice()
ByteBuf needs to be passed to SslHandler so that it doesn't get mutated. A deep copy isn't required..retainedSlice()
for the input ByteBuf.Documentation
doc
doc-required
doc-not-needed
doc-complete