Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][pip] PIP-393: Improve performance of Negative Acknowledgement #23601

Merged
merged 11 commits into from
Dec 5, 2024

Conversation

thetumbled
Copy link
Member

@thetumbled thetumbled commented Nov 15, 2024

PIP: 393
Implementation PR: #23600.

Motivation

There are many issues with the current implementation of Negative Acknowledgement in Pulsar:

  • the memory occupation is high.
  • the code execution efficiency is low.
  • the redelivery time is not accurate.
  • multiple negative ack for messages in the same entry(batch) will interfere with each other.
    All of these problem is severe and need to be solved.

Modifications

Refactor the NegativeAcksTracker to solve the above problems.

Space complexity of new data structure

I will show you how great the new data structure it is with theorectical space complexity analysis.

Space complexity of ConcurrentLongLongPairHashMap

Before analyzing the new data structure, we need to know how much space it take before this pip. We need to store 4 long field for (ledgerId, entryId, partitionIndex, timestamp) for each entry, which takes 4*8=32byte.
As ConcurrentLongLongPairHashMap use open hash addressing and linear probe to handle hash confliction, there are rebundunt spaces to avoid high confliction rate. There are two configurations that control how much rebundunt space to reserver: fill factor and idle factor. When the space utility rate soar high to fill factor, the size of backing array will be double, when the space utility rate reduce to idle factor, the size of backing array will reduce by half.
The default value of fill factor is 0.66, idle factor is 0.15, which means the min space occupation of ConcurrentLongLongPairHashMap is 32/0.66N byte = 48N byte, the max space occupation is 32/0.15N byte=213N byte, where N is the number of entries.

List some test data to verify this:
image
There are 100w entries in the map, which take up 32*1000000/1024/1024byte=30MB, the space utility rate is 30/64=0.46, in the range of [0.15, 0.66].

Space complexity of new data structure

New data structure:

// timestamp -> ledgerId -> entryId
Long2ObjectSortedMap<Long2ObjectMap<Roaring64Bitmap>> map2 = new Long2ObjectAVLTreeMap<>();

The space used by new data structure is related to several factors: message rate, the time deviation user accepted, the max entries written in one ledger.

  • Pulsar conf managedLedgerMaxEntriesPerLedger=50000 determine the max entries can be wriitten into one ledger, we use the default value to analyze.
  • the time deviation user accepted: when user accept 1024ms delivery time deviation, we can trim the lower 10 bit of the timestamp in ms, which can bucket 1024 timestamp.

We will analyze the space used by one bucket, and calculate the average space used by one entry.
Assuming that the message rate is x msg/ms, and we trim y bit of the timestamp, one bucket will contains 2**x ms, M=2**x*y msgs in one bucket.

  • For one single bucket, we only need to store one timestamp, which takes 8byte.
  • Then, we need to store the ledgerId, when M is greater than 5w(managedLedgerMaxEntriesPerLedger), the ledger will switch. There are L=ceil(M/50000) ledgers, which take 8*L byte.
  • Further, we analyze how much space the entry id takes. As there are L=ceil(M/50000) ledgers, there will be L bitmap to store, which take L*size(bitmap). The total space consumed by new data structure is 8byte + 8L byte + L*size(bitmap).

As the size(bitmap) is far more greater than 8byte, we can ignore the first two items. Then we get the formular of space consumed one bucket: D=L*size(bitmap)=ceil(M/50000)*size(bitmap).

Entry id is stored in a Roaring64Bitmap, for simplicity we can replace it with RoaringBitmap, as the max entry id is 49999, which is smaller than 4294967296 (2 * Integer.MAX_VALUE)(the max value can be stored in RoaringBitmap). The space consume by RoaringBitmap depends on how many elements it contains, when the size of bitmap < 4096, the space is 4N btye, when the size of bitmap > 4096, the consumed space is a fixed value 8KB.
Then we get the final result:

  • when M>50000, D = ceil(M/50000)*size(bitmap) ~= M/50000 * 8KB = M/50000 * 8 * 1024 byte = 0.163M byte, each entry takes 0.163byte by average.
  • when 4096<M<50000, D = ceil(M/50000)*size(bitmap) = 1 * 8KB = 8KB, each entry takes 8*1024/M=8192/M byte by average.
  • when M<4096, D = ceil(M/50000)*size(bitmap) = 1 * 4Mbyte = 4Mbyte, each entry take 4 byte by average.

Conclusion

  • The space complexity of ConcurrentLongLongPairHashMap is 48N byte in best case, 213N byte in worst case, where N is the number of entries.
  • The space complexity of new data structure is determined by the total number of messages in one bucket M.
    • when M>50000, space complexity is 0.163N byte.
    • when 4096<M<50000, space complexity is 8192/M * N byte .
    • when M<4096, space complexity is 4N byte.

test data

List some experiment data to verify the analysis above.
Test code:

static long trimLowerBit(long timestamp, int bits) {
        return timestamp & (-1L << bits);
    }
    public static void main(String[] args) throws IOException {
        ConcurrentLongLongPairHashMap map1 = ConcurrentLongLongPairHashMap.newBuilder()
                .autoShrink(true)
                .concurrencyLevel(16)
                .build();
        // timestamp -> ledgerId -> entryId, no need to batch index, if different messages have
        // different timestamp, there will be multiple entries in the map
        // AVL Tree -> LongOpenHashMap -> Roaring64Bitmap
        // there are many timestamp, a few ledgerId, many entryId
        Long2ObjectSortedMap<Long2ObjectMap<Roaring64Bitmap>> map2 = new Long2ObjectAVLTreeMap<>();
        
        int trimLowerBits = 10;
        long numMessages = 1000000, entriesPerLedger = 1000, numLedgers = numMessages / entriesPerLedger;
        long ledgerId, entryId, timestamp=System.currentTimeMillis(), tmp=0;
        for (long i = 0; i < numLedgers; i++) {
            ledgerId = 10000+i;
            for (long j = 0; j < entriesPerLedger; j++) {
                entryId = j;
                // 1ms per message
                timestamp++;
                // queue.add(timestamp, ledgerId, entryId);
                map1.put(ledgerId, entryId, 0L, timestamp);
                
                tmp = trimLowerBit(timestamp, trimLowerBits);
                map2.computeIfAbsent(tmp, k -> new Long2ObjectOpenHashMap<>())
                    .computeIfAbsent(ledgerId, k -> new Roaring64Bitmap())
                    .add(entryId);
            }
        }
    }

x=1, y=10

Let x=1, that is 1msg/ms, y=10, we will trim 10 bit of the timestamp. Then M=1*2**10=1024<4096. According to the reslut above, we predict that the space consume by 100w entries is 4*1000000/1024/1024=3.81MB.

The actual space consumed is 3.35MB, which is quite near to the theorectical value.

x=50, y=10

We try to reach to the best space complexity. M=50*2**10=51200>50000, we predict that average space consume by one entry is 0.163 byte.

int trimLowerBits = 10, messagePerMs = 50, tick=0;
long numMessages = 1000000, entriesPerLedger = 50000, numLedgers = numMessages / entriesPerLedger;

image
But the experiment result is 0.33*1024*1024/1000000=0.34byte, almost twice of the theorectal value 0.163.

We can print the size of bitmap to know why.

There are still many bitmaps whose size is far more smaller than 5w, which result into the lower space utility rate.

x=500, y=10

int trimLowerBits = 10, messagePerMs = 500, tick=0;
long numMessages = 1000000, entriesPerLedger = 50000, numLedgers = numMessages / entriesPerLedger;

All bitmaps contains almost 5w entries.
image
Each entry take 0.18*1024*1024/1000000=0.18byte, which is quite near to the the theorectical value.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@github-actions github-actions bot added PIP doc-not-needed Your PR changes do not impact docs labels Nov 15, 2024
@thetumbled

This comment was marked as outdated.

@thetumbled thetumbled added doc-required Your PR changes impact docs and you will update later. release/4.0.1 labels Nov 22, 2024
@github-actions github-actions bot removed the doc-required Your PR changes impact docs and you will update later. label Nov 22, 2024
@thetumbled
Copy link
Member Author

thetumbled commented Nov 26, 2024

I add the space complexity analysis of the new data structure, please review it again, thanks. @lhotari @nodece @BewareMyPower @poorbarcode @codelipenghui @dao-jun

@lhotari
Copy link
Member

lhotari commented Nov 26, 2024

I add the space complexity analysis of the new data structure, please review it again, thanks. @lhotari @nodece @BewareMyPower @poorbarcode @codelipenghui @dao-jun

Great analysis @thetumbled . Please move the analysis from the PR description to the PIP document itself.

One small detail (which doesn't impact the analysis or solution): "Entry id is stored in a Roaring64Bitmap, for simplicity we can replace it with RoaringBitmap, as the max entry id is 49999, which is smaller than 65535."
Isn't the value 65535 irrelevant since RoaringBitmap supports storing 4294967296 (2 * Integer.MAX_VALUE) integers, explained in https://github.com/RoaringBitmap/RoaringBitmap/blob/cca90c986d5c0096bbeabb5f968833bf12c28c0e/roaringbitmap/src/main/java/org/roaringbitmap/RoaringBitmap.java#L46-L49 . Roaring64Bitmap can store up to 9223372036854775807 long integers (2 * Long.MAX_VALUE).

@lhotari lhotari changed the title [improve][client] PIP-393: Improve performance of Negative Acknowledgement [improve][pip] PIP-393: Improve performance of Negative Acknowledgement Nov 26, 2024
@lhotari
Copy link
Member

lhotari commented Nov 26, 2024

@thetumbled The title of any PR containing PIP documentation should include [pip] to distinguish it from other types of PRs. I made that change to the title.

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PIP-393 document should include the high level plan of avoiding to increase the size of the Pulsar client by the size of fastutil jar file. The fastutil jar file is very large, 23MB. We use only a few classes of fastutil. There's fastutil-core library which is smaller, about ≅6MB. However, that is also relatively large and using fastutil-core will introduce another problem on the broker side since there's already fastutil jar which also includes fastutil-core jar classes. It's necessary to design a proper shading solution as part of this PIP design and implementation.
More details in the thread #23600 (comment)

@thetumbled
Copy link
Member Author

I add the space complexity analysis of the new data structure, please review it again, thanks. @lhotari @nodece @BewareMyPower @poorbarcode @codelipenghui @dao-jun

Great analysis @thetumbled . Please move the analysis from the PR description to the PIP document itself.

One small detail (which doesn't impact the analysis or solution): "Entry id is stored in a Roaring64Bitmap, for simplicity we can replace it with RoaringBitmap, as the max entry id is 49999, which is smaller than 65535." Isn't the value 65535 irrelevant since RoaringBitmap supports storing 4294967296 (2 * Integer.MAX_VALUE) integers, explained in https://github.com/RoaringBitmap/RoaringBitmap/blob/cca90c986d5c0096bbeabb5f968833bf12c28c0e/roaringbitmap/src/main/java/org/roaringbitmap/RoaringBitmap.java#L46-L49 . Roaring64Bitmap can store up to 9223372036854775807 long integers (2 * Long.MAX_VALUE).

You are right, not 65535, but 4294967296 (2 * Integer.MAX_VALUE).

@thetumbled
Copy link
Member Author

The PIP-393 document should include the high level plan of avoiding to increase the size of the Pulsar client by the size of fastutil jar file. The fastutil jar file is very large, 23MB. We use only a few classes of fastutil. There's fastutil-core library which is smaller, about ≅6MB. However, that is also relatively large and using fastutil-core will introduce another problem on the broker side since there's already fastutil jar which also includes fastutil-core jar classes. It's necessary to design a proper shading solution as part of this PIP design and implementation. More details in the thread #23600 (comment)

Thanks for review, i add it in high level design.

@thetumbled thetumbled requested a review from lhotari December 2, 2024 08:08
@thetumbled
Copy link
Member Author

The vote is completed, please review this pr again, thanks. @lhotari @nodece @eolivelli

pip/pip-393.md Outdated Show resolved Hide resolved
pip/pip-393.md Outdated Show resolved Hide resolved
pip/pip-393.md Outdated Show resolved Hide resolved
@thetumbled thetumbled requested a review from lhotari December 2, 2024 12:10
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for driving the effort @thetumbled. Great work!

@lhotari lhotari merged commit 04cec0f into apache:master Dec 5, 2024
20 checks passed
@lhotari
Copy link
Member

lhotari commented Dec 6, 2024

release labels shouldn't be added to PIP document PRs since we only maintain PIP documents in the master branch. release labels are used for cherry-picking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs PIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants