From d3f58ead15ddc1b24b0df67c486808eea086b13c Mon Sep 17 00:00:00 2001
From: thetumbled <843221020@qq.com>
Date: Fri, 15 Nov 2024 15:05:17 +0800
Subject: [PATCH 01/11] add pip-393.

---
 pip/pip-393.md | 161 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 pip/pip-393.md

diff --git a/pip/pip-393.md b/pip/pip-393.md
new file mode 100644
index 0000000000000..c15782a997019
--- /dev/null
+++ b/pip/pip-393.md
@@ -0,0 +1,161 @@
+
+# PIP-393: Improve performance of Negative Acknowledgement
+
+# Background knowledge
+
+Negative Acknowledgement is a feature in Pulsar that allows consumers to trigger the redelivery 
+of a message after some time when they fail to process it. The redelivery delay is determined by
+the `negativeAckRedeliveryDelay` configuration.
+
+When user calls `negativeAcknowledge` method, `NegativeAcksTracker` in `ConsumerImpl` will add an entry
+into the map `NegativeAcksTracker.nackedMessages`, mapping the message ID to the redelivery time.
+When the redelivery time comes, `NegativeAcksTracker` will send a redelivery request to the broker to
+redeliver the message.
+
+# Motivation
+
+There are many issues with the current implementation of Negative Acknowledgement in Pulsar:
+- the memory occupation is high.
+- the code execution efficiency is low.
+- the redelivery time is not accurate.
+- multiple negative ack for messages in the same entry(batch) will interfere with each other.
+All of these problem is severe and need to be solved.
+
+## Memory occupation is high
+After the improvement of https://github.com/apache/pulsar/pull/23582, we have reduce half more memory occupation
+of `NegativeAcksTracker` by replacing `HashMap` with `ConcurrentLongLongPairHashMap`. With 100w entry, the memory
+occupation decrease from 178Mb to 64Mb. With 1kw entry, the memory occupation decrease from 1132Mb to 512Mb.
+The average memory occupation of each entry decrease from 1132MB/10000000=118byte to 512MB/10000000=53byte.
+
+But it is not enough. Assuming that we negative ack message 1w/s, assigning 1h redelivery delay for each message,
+the memory occupation of `NegativeAcksTracker` will be `3600*10000*53/1024/1024/1024=1.77GB`, if the delay is 5h,
+the required memory is `3600*10000*53/1024/1024/1024*5=8.88GB`, which increase too fast.
+
+## Code execution efficiency is low
+Currently, each time the timer task is triggered, it will iterate all the entries in `NegativeAcksTracker.nackedMessages`, 
+which is unnecessary. We can sort entries by timestamp and only iterate the entries that need to be redelivered.
+
+## Redelivery time is not accurate
+Currently, the redelivery time is controlled by the `timerIntervalNanos`, which is 1/3 of the `negativeAckRedeliveryDelay`.
+That means, if the `negativeAckRedeliveryDelay` is 1h, the redelivery time will be 20min, which is unacceptable.
+
+## Multiple negative ack for messages in the same entry(batch) will interfere with each other
+Currently, `NegativeAcksTracker#nackedMessages` map `(ledgerId, entryId)` to `timestamp`, which means multiple nacks from messages 
+in the same batch share single one timestamp. 
+If we let msg1 redelivered 10s later, then let msg2 redelivered 20s later, these two messages are delivered 20s later together.
+msg1 will not be redelivered 10s later as the timestamp recorded in `NegativeAcksTracker#nackedMessages` is overrode by the second
+nack call.
+
+
+# Goals
+
+Refactor the `NegativeAcksTracker` to solve the above problems.
+
+
+To avoid interation of all entries in `NegativeAcksTracker.nackedMessages`, we use a sorted map to store the entries.
+To reduce memory occupation, we use util class provided by fastutil(https://fastutil.di.unimi.it/docs/), and design 
+a new algorithm to store the entries, reduce the memory occupation to 1/4 of the current implementation, that is 
+each entry only occupy 13 bytes in my test.
+
+
+# Detailed Design
+
+## Design & Implementation Details
+
+### New Data Structure
+Use following data structure to store the entries:
+```java
+Long2ObjectSortedMap<Long2ObjectMap<LongSet>> nackedMessages = new Long2ObjectAVLTreeMap<>();
+```
+mapping `timestamp -> ledgerId -> entryId`.
+We need to sort timestamp in ascending order, so we use a sorted map to map timestamp to `ledgerId -> entryId` map.
+As there will many entries in the map, we use `Long2ObjectAVLTreeMap` instead of `Long2ObjectRBTreeMap`.
+As for the inner map, we use `Long2ObjectMap` to map `ledgerId` to `entryId` because we don't need to keep the order of `ledgerId`.
+`Long2ObjectOpenHashMap` will be satisfied.
+All entry id for the same ledger id will be stored in a `LongSet`, which is a `LongOpenHashSet`, which is a type-specific set 
+provided by fastutil and eliminate the overhead of `Long` boxing.
+
+## TimeStamp Bucket
+Timestamp in ms is used as the key of the map. As most of the use cases don't require that the precision of the delay time is 1ms,
+we can make the timestamp bucketed, that is, we can trim the lower bit of the timestamp to map the timestamp to a bucket.
+For example, if we trim the lower 1 bit of the timestamp, the timestamp 0b1000 and 0b1001 will be mapped to the same bucket 0b1000.
+Then all messages in the same bucket will be redelivered at the same time.
+If user can accept 1024ms deviation of the redelivery time, we can trim the lower 10 bits of the timestamp, which can group a lot
+entries into the same bucket and reduce the memory occupation.
+
+following code snippet will be helpful to understand the design:
+```java
+    static long trimLowerBit(long timestamp, int bits) {
+        return timestamp & (-1L << bits);
+    }
+```
+
+```java
+Long2ObjectSortedMap<Long2ObjectMap<LongSet>> map = new Long2ObjectAVLTreeMap<>();
+Long2ObjectMap<LongSet> ledgerMap = new Long2ObjectOpenHashMap<>();
+LongSet entrySet = new LongOpenHashSet();
+entrySet.add(entryId);
+ledgerMap.put(ledgerId, entrySet);
+map.put(timestamp, ledgerMap);
+```
+
+With such kind of design, we can reduce the 
+
+
+
+## Public-facing Changes
+
+<!--
+Describe the additions you plan to make for each public facing component. 
+Remove the sections you are not changing.
+Clearly mark any changes which are BREAKING backward compatability.
+-->
+
+### Configuration
+
+
+
+# Security Considerations
+<!--
+A detailed description of the security details that ought to be considered for the PIP. This is most relevant for any new HTTP endpoints, new Pulsar Protocol Commands, and new security features. The goal is to describe details like which role will have permission to perform an action.
+
+An important aspect to consider is also multi-tenancy: Does the feature I'm adding have the permissions / roles set in such a way that prevent one tenant accessing another tenant's data/configuration? For example, the Admin API to read a specific message for a topic only allows a client to read messages for the target topic. However, that was not always the case. CVE-2021-41571 (https://github.com/apache/pulsar/wiki/CVE-2021-41571) resulted because the API was incorrectly written and did not properly prevent a client from reading another topic's messages even though authorization was in place. The problem was missing input validation that verified the requested message was actually a message for that topic. The fix to CVE-2021-41571 was input validation. 
+
+If there is uncertainty for this section, please submit the PIP and request for feedback on the mailing list.
+-->
+
+# Backward & Forward Compatibility
+
+## Upgrade
+
+<!--
+Specify the list of instructions, if there are such, needed to perform before/after upgrading to Pulsar version containing this feature.
+-->
+
+## Downgrade / Rollback
+
+<!--
+Describe a cookbook detailing the steps required to rollback Pulsar to previous version *without* this feature.
+-->
+
+## Pulsar Geo-Replication Upgrade & Downgrade/Rollback Considerations
+
+<!--
+Describe what needs to be considered in Pulsar Geo-Replication in the upgrade and possible downgrade/rollback of this feature.
+-->
+
+# Alternatives
+
+<!--
+If there are alternatives that were already considered by the authors or, after the discussion, by the community, and were rejected, please list them here along with the reason why they were rejected.
+-->
+
+# General Notes
+
+# Links
+
+<!--
+Updated afterwards
+-->
+* Mailing List discussion thread:
+* Mailing List voting thread:

From ab9e6aad0af3b842282403c81f785d575f6b2a59 Mon Sep 17 00:00:00 2001
From: thetumbled <843221020@qq.com>
Date: Fri, 15 Nov 2024 15:51:20 +0800
Subject: [PATCH 02/11] add doc.

---
 pip/pip-393.md | 73 +++++++++++++++++---------------------------------
 1 file changed, 25 insertions(+), 48 deletions(-)

diff --git a/pip/pip-393.md b/pip/pip-393.md
index c15782a997019..8d31473f22cb8 100644
--- a/pip/pip-393.md
+++ b/pip/pip-393.md
@@ -51,12 +51,10 @@ nack call.
 
 Refactor the `NegativeAcksTracker` to solve the above problems.
 
-
 To avoid interation of all entries in `NegativeAcksTracker.nackedMessages`, we use a sorted map to store the entries.
 To reduce memory occupation, we use util class provided by fastutil(https://fastutil.di.unimi.it/docs/), and design 
-a new algorithm to store the entries, reduce the memory occupation to 1/4 of the current implementation, that is 
-each entry only occupy 13 bytes in my test.
-
+a new algorithm to store the entries, reduce the memory occupation to even 1% less than the current implementation.
+(the actual effect rely on the configuration and the throughput).
 
 # Detailed Design
 
@@ -65,17 +63,17 @@ each entry only occupy 13 bytes in my test.
 ### New Data Structure
 Use following data structure to store the entries:
 ```java
-Long2ObjectSortedMap<Long2ObjectMap<LongSet>> nackedMessages = new Long2ObjectAVLTreeMap<>();
+Long2ObjectSortedMap<Long2ObjectMap<Roaring64Bitmap>> nackedMessages = new Long2ObjectAVLTreeMap<>();
 ```
 mapping `timestamp -> ledgerId -> entryId`.
 We need to sort timestamp in ascending order, so we use a sorted map to map timestamp to `ledgerId -> entryId` map.
 As there will many entries in the map, we use `Long2ObjectAVLTreeMap` instead of `Long2ObjectRBTreeMap`.
 As for the inner map, we use `Long2ObjectMap` to map `ledgerId` to `entryId` because we don't need to keep the order of `ledgerId`.
 `Long2ObjectOpenHashMap` will be satisfied.
-All entry id for the same ledger id will be stored in a `LongSet`, which is a `LongOpenHashSet`, which is a type-specific set 
-provided by fastutil and eliminate the overhead of `Long` boxing.
+All entry id for the same ledger id will be stored in a bit set, as we only care about the existence of the entry id.
 
-## TimeStamp Bucket
+
+### TimeStamp Bucket
 Timestamp in ms is used as the key of the map. As most of the use cases don't require that the precision of the delay time is 1ms,
 we can make the timestamp bucketed, that is, we can trim the lower bit of the timestamp to map the timestamp to a bucket.
 For example, if we trim the lower 1 bit of the timestamp, the timestamp 0b1000 and 0b1001 will be mapped to the same bucket 0b1000.
@@ -91,7 +89,7 @@ following code snippet will be helpful to understand the design:
 ```
 
 ```java
-Long2ObjectSortedMap<Long2ObjectMap<LongSet>> map = new Long2ObjectAVLTreeMap<>();
+Long2ObjectSortedMap<Long2ObjectMap<Roaring64Bitmap>> map = new Long2ObjectAVLTreeMap<>();
 Long2ObjectMap<LongSet> ledgerMap = new Long2ObjectOpenHashMap<>();
 LongSet entrySet = new LongOpenHashSet();
 entrySet.add(entryId);
@@ -99,58 +97,37 @@ ledgerMap.put(ledgerId, entrySet);
 map.put(timestamp, ledgerMap);
 ```
 
-With such kind of design, we can reduce the 
-
+With such kind of design, we can reduce the memory occupation of `NegativeAcksTracker` to 1% less than the current implementation.
+The detailed test result will be provided in the PR.
 
 
-## Public-facing Changes
-
-<!--
-Describe the additions you plan to make for each public facing component. 
-Remove the sections you are not changing.
-Clearly mark any changes which are BREAKING backward compatability.
--->
-
 ### Configuration
 
+Add a new configuration `negativeAckPrecisionBitCnt` to control the precision of the redelivery time.
+```
+@ApiModelProperty(
+            name = "negativeAckPrecisionBitCnt",
+            value = "The redelivery time precision bit count. The lower bits of the redelivery time will be\n" + 
+                "trimmed to reduce the memory occupation. The default value is 8, which means the redelivery time\n" +
+                "will be bucketed by 256ms. In worst cases, the redelivery time will be 512ms earlier(no later)\n" +
+                "than the expected time. If the value is 0, the redelivery time will be accurate to ms.".
+    )
+    private long negativeAckPrecisionBitCnt = 8;
+```
+The higher the value, the more entries will be grouped into the same bucket, the less memory occupation, the less accurate the redelivery time.
+Default value is 8, which means the redelivery time will be bucketed by 256ms. In worst cases, the redelivery time will be 512ms earlier(no later)
+than the expected time.
 
 
-# Security Considerations
-<!--
-A detailed description of the security details that ought to be considered for the PIP. This is most relevant for any new HTTP endpoints, new Pulsar Protocol Commands, and new security features. The goal is to describe details like which role will have permission to perform an action.
-
-An important aspect to consider is also multi-tenancy: Does the feature I'm adding have the permissions / roles set in such a way that prevent one tenant accessing another tenant's data/configuration? For example, the Admin API to read a specific message for a topic only allows a client to read messages for the target topic. However, that was not always the case. CVE-2021-41571 (https://github.com/apache/pulsar/wiki/CVE-2021-41571) resulted because the API was incorrectly written and did not properly prevent a client from reading another topic's messages even though authorization was in place. The problem was missing input validation that verified the requested message was actually a message for that topic. The fix to CVE-2021-41571 was input validation. 
-
-If there is uncertainty for this section, please submit the PIP and request for feedback on the mailing list.
--->
-
 # Backward & Forward Compatibility
 
 ## Upgrade
 
-<!--
-Specify the list of instructions, if there are such, needed to perform before/after upgrading to Pulsar version containing this feature.
--->
+User can upgrade to the new version without any compatibility issue.
 
 ## Downgrade / Rollback
 
-<!--
-Describe a cookbook detailing the steps required to rollback Pulsar to previous version *without* this feature.
--->
-
-## Pulsar Geo-Replication Upgrade & Downgrade/Rollback Considerations
-
-<!--
-Describe what needs to be considered in Pulsar Geo-Replication in the upgrade and possible downgrade/rollback of this feature.
--->
-
-# Alternatives
-
-<!--
-If there are alternatives that were already considered by the authors or, after the discussion, by the community, and were rejected, please list them here along with the reason why they were rejected.
--->
-
-# General Notes
+User can downgrade to the old version without any compatibility issue.
 
 # Links
 

From 61898ae7f2e0fe80afaa641335a2973bd29529e6 Mon Sep 17 00:00:00 2001
From: thetumbled <843221020@qq.com>
Date: Fri, 15 Nov 2024 16:48:29 +0800
Subject: [PATCH 03/11] add discussion link.

---
 pip/pip-393.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pip/pip-393.md b/pip/pip-393.md
index 8d31473f22cb8..dbb1ab372e859 100644
--- a/pip/pip-393.md
+++ b/pip/pip-393.md
@@ -134,5 +134,5 @@ User can downgrade to the old version without any compatibility issue.
 <!--
 Updated afterwards
 -->
-* Mailing List discussion thread:
+* Mailing List discussion thread: https://lists.apache.org/thread/yojl7ylk7cyjxktq3cn8849hvmyv0fg8
 * Mailing List voting thread:

From 38c7dcae6d7d793cd4148a7b67ef477cb6bbe081 Mon Sep 17 00:00:00 2001
From: thetumbled <843221020@qq.com>
Date: Fri, 15 Nov 2024 18:00:23 +0800
Subject: [PATCH 04/11] add doc.

---
 pip/pip-393.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/pip/pip-393.md b/pip/pip-393.md
index dbb1ab372e859..afd3349d599bf 100644
--- a/pip/pip-393.md
+++ b/pip/pip-393.md
@@ -36,8 +36,9 @@ Currently, each time the timer task is triggered, it will iterate all the entrie
 which is unnecessary. We can sort entries by timestamp and only iterate the entries that need to be redelivered.
 
 ## Redelivery time is not accurate
-Currently, the redelivery time is controlled by the `timerIntervalNanos`, which is 1/3 of the `negativeAckRedeliveryDelay`.
-That means, if the `negativeAckRedeliveryDelay` is 1h, the redelivery time will be 20min, which is unacceptable.
+Currently, the redelivery check time is controlled by the `timerIntervalNanos`, which is 1/3 of the `negativeAckRedeliveryDelay`.
+That means, if the `negativeAckRedeliveryDelay` is 1h, check task will be started every 20min, the deviation of the redelivery 
+time is 20min, which is unacceptable.
 
 ## Multiple negative ack for messages in the same entry(batch) will interfere with each other
 Currently, `NegativeAcksTracker#nackedMessages` map `(ledgerId, entryId)` to `timestamp`, which means multiple nacks from messages 

From 6dc7daaa401ec17640f4c9aa24f4fddec6aa07af Mon Sep 17 00:00:00 2001
From: thetumbled <843221020@qq.com>
Date: Fri, 15 Nov 2024 19:14:01 +0800
Subject: [PATCH 05/11] add effect.

---
 pip/pip-393.md | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/pip/pip-393.md b/pip/pip-393.md
index afd3349d599bf..1b1dc0737a753 100644
--- a/pip/pip-393.md
+++ b/pip/pip-393.md
@@ -98,9 +98,25 @@ ledgerMap.put(ledgerId, entrySet);
 map.put(timestamp, ledgerMap);
 ```
 
+### Effect
+
+#### Memory occupation is high
 With such kind of design, we can reduce the memory occupation of `NegativeAcksTracker` to 1% less than the current implementation.
 The detailed test result will be provided in the PR.
 
+#### Code execution efficiency is low
+With the new design, we can avoid the iteration of all entries in `NegativeAcksTracker.nackedMessages`, and only iterate the entries
+that need to be redelivered.
+
+#### Redelivery time is not accurate
+With the new design, we avoid the fixed interval of the redelivery check time. We can control the precision of the redelivery time 
+by trimming the lower bits of the timestamp. If user can accept 1024ms deviation of the redelivery time, we can trim the lower
+10 bits of the timestamp, which can group a lot
+
+#### Multiple negative ack for messages in the same entry(batch) will interfere with each other
+With the new design, if we let msg1 redelivered 10s later, then let msg2 redelivered 20s later, these two nacks will not interfere
+with each other, as they are stored in different buckets.
+
 
 ### Configuration
 

From b023eedd3a14fc0c7a07892fc202cdfcb68f5a53 Mon Sep 17 00:00:00 2001
From: thetumbled <843221020@qq.com>
Date: Tue, 19 Nov 2024 11:06:05 +0800
Subject: [PATCH 06/11] update doc.

---
 pip/pip-393.md | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/pip/pip-393.md b/pip/pip-393.md
index 1b1dc0737a753..1086e039fd258 100644
--- a/pip/pip-393.md
+++ b/pip/pip-393.md
@@ -4,8 +4,12 @@
 # Background knowledge
 
 Negative Acknowledgement is a feature in Pulsar that allows consumers to trigger the redelivery 
-of a message after some time when they fail to process it. The redelivery delay is determined by
-the `negativeAckRedeliveryDelay` configuration.
+of a message after some time when they fail to process it. we can also use it as the consumer-side
+`delayed queue` feature. The `delayed queue` feature provided by Pulsar actually is a producer-side
+feature, which means delay time is determined by the producer. However, in many cases, we need the
+delay time to be determined by the consumer. In this case, we can use the `Negative Acknowledgement`
+as the `delayed queue` feature. When the consumer receives a message, it can negative acknowledge the
+message and set the redelivery time. The message will be redelivered after the redelivery time.
 
 When user calls `negativeAcknowledge` method, `NegativeAcksTracker` in `ConsumerImpl` will add an entry
 into the map `NegativeAcksTracker.nackedMessages`, mapping the message ID to the redelivery time.

From b414f6f774bdeb977a835567964a68ff1f7ae014 Mon Sep 17 00:00:00 2001
From: thetumbled <843221020@qq.com>
Date: Tue, 26 Nov 2024 15:35:33 +0800
Subject: [PATCH 07/11] update doc.

---
 pip/pip-393.md | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/pip/pip-393.md b/pip/pip-393.md
index 1086e039fd258..712c05b3e41ae 100644
--- a/pip/pip-393.md
+++ b/pip/pip-393.md
@@ -4,17 +4,10 @@
 # Background knowledge
 
 Negative Acknowledgement is a feature in Pulsar that allows consumers to trigger the redelivery 
-of a message after some time when they fail to process it. we can also use it as the consumer-side
-`delayed queue` feature. The `delayed queue` feature provided by Pulsar actually is a producer-side
-feature, which means delay time is determined by the producer. However, in many cases, we need the
-delay time to be determined by the consumer. In this case, we can use the `Negative Acknowledgement`
-as the `delayed queue` feature. When the consumer receives a message, it can negative acknowledge the
-message and set the redelivery time. The message will be redelivered after the redelivery time.
-
-When user calls `negativeAcknowledge` method, `NegativeAcksTracker` in `ConsumerImpl` will add an entry
-into the map `NegativeAcksTracker.nackedMessages`, mapping the message ID to the redelivery time.
-When the redelivery time comes, `NegativeAcksTracker` will send a redelivery request to the broker to
-redeliver the message.
+of a message after some time when they fail to process it. When user calls `negativeAcknowledge` method,
+`NegativeAcksTracker` in `ConsumerImpl` will add an entry into the map `NegativeAcksTracker.nackedMessages`,
+mapping the message ID to the redelivery time. When the redelivery time comes, `NegativeAcksTracker` will 
+send a redelivery request to the broker to redeliver the message.
 
 # Motivation
 

From dd322031761352c858548377222d562c44c02195 Mon Sep 17 00:00:00 2001
From: thetumbled <843221020@qq.com>
Date: Tue, 26 Nov 2024 17:37:23 +0800
Subject: [PATCH 08/11] add space complexity analysis.

---
 pip/pip-393.md | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/pip/pip-393.md b/pip/pip-393.md
index 712c05b3e41ae..28f0fd59dfdb7 100644
--- a/pip/pip-393.md
+++ b/pip/pip-393.md
@@ -95,6 +95,66 @@ ledgerMap.put(ledgerId, entrySet);
 map.put(timestamp, ledgerMap);
 ```
 
+### Space complexity analysis
+#### Space complexity of `ConcurrentLongLongPairHashMap`
+Before analyzing the new data structure, we need to know how much space it take before this pip.
+
+We need to store 4 long field for `(ledgerId, entryId, partitionIndex, timestamp)` for each entry, which takes `4*8=32byte`.
+As `ConcurrentLongLongPairHashMap` use open hash addressing and linear probe to handle hash conflict, there are some 
+redundant spaces to avoid high conflict rate. There are two configurations that control how much redundant space to reserver: 
+`fill factor` and `idle factor`. When the space utility rate soar high to `fill factor`, the size of backing array will
+be double, when the space utility rate reduce to `idle factor`,  the size of backing array will reduce by half.
+
+The default value of `fill factor` is 0.66, `idle factor` is 0.15, which means the min space occupation of
+`ConcurrentLongLongPairHashMap` is `32/0.66N byte = 48N byte`, the max space occupation is `32/0.15N byte=213N byte`, 
+where N is the number of entries.
+
+In the experiment showed in the PR, there are 100w entries in the map, taking up `32*1000000/1024/1024byte=30MB`,
+the space utility rate is 30/64=0.46, in the range of `[0.15, 0.66]`.
+
+
+#### Space complexity of the new data structure
+The space used by new data structure is related to several factors: `message rate`, `the time deviation user accepted`,
+`the max entries written in one ledger`.
+- Pulsar conf `managedLedgerMaxEntriesPerLedger=50000` determine the max entries can be written into one ledger,
+we use the default value to analyze.
+- `the time deviation user accepted`: when user accept 1024ms delivery time deviation, we can trim the lower 10 bit
+of the timestamp in ms, which can bucket 1024 timestamp.
+
+Following we will analyze the space used by one bucket, and calculate the average space used by one entry.
+
+Assuming that the message rate is `x msg/ms`, and we trim `y bit` of the timestamp, one bucket will contains `2**x` ms, and
+`M=2**x*y` msgs.
+- For one single bucket, we only need to store one timestamp, which takes `8byte`.
+- Then, we need to store the ledgerId, when M is greater than 5w(`managedLedgerMaxEntriesPerLedger`), the ledger will switch.
+There are `L=ceil(M/50000)` ledgers, which take `8*L` byte.
+- Further, we analyze how much space the entry id takes. As there are `L=ceil(M/50000)` ledgers, there will be `L` bitmap to store,
+which take `L*size(bitmap)`. The total space consumed by new data structure is `8byte + 8L byte + L*size(bitmap)`.
+
+As the `size(bitmap)` is far more greater than `8byte`, we can ignore the first two items. Then we get the formular of space 
+consumed **one bucket**: `D=L*size(bitmap)=ceil(M/50000)*size(bitmap)`.
+
+Entry id is stored in a `Roaring64Bitmap`, for simplicity we can replace it with `RoaringBitmap`, as the max entry id is 49999,
+which is smaller than `4294967296 (2 * Integer.MAX_VALUE)`(the max value can be stored in `RoaringBitmap`). The space consume 
+by `RoaringBitmap` depends on how many elements it contains, when the size of bitmap < 4096, the space is `4N byte`,
+when the size of bitmap > 4096, the consumed space is a fixed value `8KB`.
+
+Then we get the final result:
+- when M>50000, `D = ceil(M/50000)*size(bitmap) ~= M/50000 * 8KB = M/50000 * 8 * 1024 byte = 0.163M byte`, 
+each entry takes `0.163byte` by average.
+- when 4096<M<50000, `D = ceil(M/50000)*size(bitmap) = 1 * 8KB = 8KB`, each entry takes `8*1024/M=8192/M byte` by average.
+- when M<4096, `D = ceil(M/50000)*size(bitmap) = 1 * 4M byte = 4M byte`, each entry take `4 byte` by average.
+
+#### Conclusion
+Assuming N is the number of entries, M is the number of messages in one bucket.
+- `ConcurrentLongLongPairHashMap`: `48N` byte in best case, `213N byte` in worst case.
+- New data structure:
+    - when M>50000, `0.163N byte`.
+    - when 4096<M<50000, `8192/M * N byte` .
+    - when M<4096, `4N byte`.
+
+Some experiment results are showed in the PR, we can fine tune the configuration to get the best performance.
+
 ### Effect
 
 #### Memory occupation is high

From e08f5eab500cd57664796607f8133f96221400f0 Mon Sep 17 00:00:00 2001
From: thetumbled <843221020@qq.com>
Date: Tue, 26 Nov 2024 17:48:08 +0800
Subject: [PATCH 09/11] add High-Level Design to handle dependency.

---
 pip/pip-393.md | 64 ++++++++++++++++++++++++++++++--------------------
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/pip/pip-393.md b/pip/pip-393.md
index 28f0fd59dfdb7..59b450dc84b66 100644
--- a/pip/pip-393.md
+++ b/pip/pip-393.md
@@ -95,8 +95,26 @@ ledgerMap.put(ledgerId, entrySet);
 map.put(timestamp, ledgerMap);
 ```
 
-### Space complexity analysis
-#### Space complexity of `ConcurrentLongLongPairHashMap`
+### Configuration
+
+Add a new configuration `negativeAckPrecisionBitCnt` to control the precision of the redelivery time.
+```
+@ApiModelProperty(
+            name = "negativeAckPrecisionBitCnt",
+            value = "The redelivery time precision bit count. The lower bits of the redelivery time will be\n" + 
+                "trimmed to reduce the memory occupation. The default value is 8, which means the redelivery time\n" +
+                "will be bucketed by 256ms. In worst cases, the redelivery time will be 512ms earlier(no later)\n" +
+                "than the expected time. If the value is 0, the redelivery time will be accurate to ms.".
+    )
+    private long negativeAckPrecisionBitCnt = 8;
+```
+The higher the value, the more entries will be grouped into the same bucket, the less memory occupation, the less accurate the redelivery time.
+Default value is 8, which means the redelivery time will be bucketed by 256ms. In worst cases, the redelivery time will be 512ms earlier(no later)
+than the expected time.
+
+
+## Space complexity analysis
+### Space complexity of `ConcurrentLongLongPairHashMap`
 Before analyzing the new data structure, we need to know how much space it take before this pip.
 
 We need to store 4 long field for `(ledgerId, entryId, partitionIndex, timestamp)` for each entry, which takes `4*8=32byte`.
@@ -113,7 +131,7 @@ In the experiment showed in the PR, there are 100w entries in the map, taking up
 the space utility rate is 30/64=0.46, in the range of `[0.15, 0.66]`.
 
 
-#### Space complexity of the new data structure
+### Space complexity of the new data structure
 The space used by new data structure is related to several factors: `message rate`, `the time deviation user accepted`,
 `the max entries written in one ledger`.
 - Pulsar conf `managedLedgerMaxEntriesPerLedger=50000` determine the max entries can be written into one ledger,
@@ -145,7 +163,7 @@ each entry takes `0.163byte` by average.
 - when 4096<M<50000, `D = ceil(M/50000)*size(bitmap) = 1 * 8KB = 8KB`, each entry takes `8*1024/M=8192/M byte` by average.
 - when M<4096, `D = ceil(M/50000)*size(bitmap) = 1 * 4M byte = 4M byte`, each entry take `4 byte` by average.
 
-#### Conclusion
+### Conclusion
 Assuming N is the number of entries, M is the number of messages in one bucket.
 - `ConcurrentLongLongPairHashMap`: `48N` byte in best case, `213N byte` in worst case.
 - New data structure:
@@ -155,42 +173,38 @@ Assuming N is the number of entries, M is the number of messages in one bucket.
 
 Some experiment results are showed in the PR, we can fine tune the configuration to get the best performance.
 
-### Effect
+## Effect
 
-#### Memory occupation is high
+### Memory occupation is high
 With such kind of design, we can reduce the memory occupation of `NegativeAcksTracker` to 1% less than the current implementation.
-The detailed test result will be provided in the PR.
 
-#### Code execution efficiency is low
+### Code execution efficiency is low
 With the new design, we can avoid the iteration of all entries in `NegativeAcksTracker.nackedMessages`, and only iterate the entries
 that need to be redelivered.
 
-#### Redelivery time is not accurate
+### Redelivery time is not accurate
 With the new design, we avoid the fixed interval of the redelivery check time. We can control the precision of the redelivery time 
 by trimming the lower bits of the timestamp. If user can accept 1024ms deviation of the redelivery time, we can trim the lower
 10 bits of the timestamp, which can group a lot
 
-#### Multiple negative ack for messages in the same entry(batch) will interfere with each other
+### Multiple negative ack for messages in the same entry(batch) will interfere with each other
 With the new design, if we let msg1 redelivered 10s later, then let msg2 redelivered 20s later, these two nacks will not interfere
 with each other, as they are stored in different buckets.
 
 
-### Configuration
+## High-Level Design
+As this pip introduce new dependency `fastutil` into client, which is very large(23MB), while few classes are used, we need to
+reduce the size of the dependency. 
+
+Though there is alternative dependency `fastutil-core`, which is smaller(6MB), but it is also
+relatively large and using `fastutil-core` will introduce another problem on the broker side since there's already `fastutil` jar 
+which also includes `fastutil-core` jar classes.
+
+The optimal solution would be to include only the classes from fastutil into the shaded pulsar-client and pulsar-client-all 
+which are really used and needed. This could be achieved in many ways. One possible solution is to introduce an intermediate
+module for shaded pulsar-client and pulsar-client-all that isn't published to maven central at all. 
+It would be used to minimize and include only the classes from fastutil which are required by pulsar-client shading.
 
-Add a new configuration `negativeAckPrecisionBitCnt` to control the precision of the redelivery time.
-```
-@ApiModelProperty(
-            name = "negativeAckPrecisionBitCnt",
-            value = "The redelivery time precision bit count. The lower bits of the redelivery time will be\n" + 
-                "trimmed to reduce the memory occupation. The default value is 8, which means the redelivery time\n" +
-                "will be bucketed by 256ms. In worst cases, the redelivery time will be 512ms earlier(no later)\n" +
-                "than the expected time. If the value is 0, the redelivery time will be accurate to ms.".
-    )
-    private long negativeAckPrecisionBitCnt = 8;
-```
-The higher the value, the more entries will be grouped into the same bucket, the less memory occupation, the less accurate the redelivery time.
-Default value is 8, which means the redelivery time will be bucketed by 256ms. In worst cases, the redelivery time will be 512ms earlier(no later)
-than the expected time.
 
 
 # Backward & Forward Compatibility

From 22e51d6c05c96aa3d4d2359993273c8841040922 Mon Sep 17 00:00:00 2001
From: thetumbled <843221020@qq.com>
Date: Thu, 28 Nov 2024 11:28:30 +0800
Subject: [PATCH 10/11] add vote link.

---
 pip/pip-393.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pip/pip-393.md b/pip/pip-393.md
index 59b450dc84b66..a7afd316f1394 100644
--- a/pip/pip-393.md
+++ b/pip/pip-393.md
@@ -223,4 +223,4 @@ User can downgrade to the old version without any compatibility issue.
 Updated afterwards
 -->
 * Mailing List discussion thread: https://lists.apache.org/thread/yojl7ylk7cyjxktq3cn8849hvmyv0fg8
-* Mailing List voting thread:
+* Mailing List voting thread: https://lists.apache.org/thread/hyc1r2s9chowdhck53lq07tznopt50dy

From f725d42e2a4c6edfc78c54460cf142ccee8e5522 Mon Sep 17 00:00:00 2001
From: thetumbled <843221020@qq.com>
Date: Mon, 2 Dec 2024 20:08:31 +0800
Subject: [PATCH 11/11] fix unit.

---
 pip/pip-393.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/pip/pip-393.md b/pip/pip-393.md
index a7afd316f1394..646c2beb5fe40 100644
--- a/pip/pip-393.md
+++ b/pip/pip-393.md
@@ -20,11 +20,11 @@ All of these problem is severe and need to be solved.
 
 ## Memory occupation is high
 After the improvement of https://github.com/apache/pulsar/pull/23582, we have reduce half more memory occupation
-of `NegativeAcksTracker` by replacing `HashMap` with `ConcurrentLongLongPairHashMap`. With 100w entry, the memory
-occupation decrease from 178Mb to 64Mb. With 1kw entry, the memory occupation decrease from 1132Mb to 512Mb.
+of `NegativeAcksTracker` by replacing `HashMap` with `ConcurrentLongLongPairHashMap`. With 1 million entry, the memory
+occupation decrease from 178MB to 64MB. With 10 million entry, the memory occupation decrease from 1132MB to 512MB.
 The average memory occupation of each entry decrease from 1132MB/10000000=118byte to 512MB/10000000=53byte.
 
-But it is not enough. Assuming that we negative ack message 1w/s, assigning 1h redelivery delay for each message,
+But it is not enough. Assuming that we negative ack message 10k/s, assigning 1h redelivery delay for each message,
 the memory occupation of `NegativeAcksTracker` will be `3600*10000*53/1024/1024/1024=1.77GB`, if the delay is 5h,
 the required memory is `3600*10000*53/1024/1024/1024*5=8.88GB`, which increase too fast.
 
@@ -127,7 +127,7 @@ The default value of `fill factor` is 0.66, `idle factor` is 0.15, which means t
 `ConcurrentLongLongPairHashMap` is `32/0.66N byte = 48N byte`, the max space occupation is `32/0.15N byte=213N byte`, 
 where N is the number of entries.
 
-In the experiment showed in the PR, there are 100w entries in the map, taking up `32*1000000/1024/1024byte=30MB`,
+In the experiment showed in the PR, there are 1 million entries in the map, taking up `32*1000000/1024/1024byte=30MB`,
 the space utility rate is 30/64=0.46, in the range of `[0.15, 0.66]`.