-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved efficiency in DigestManager.verify() #3810
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3810 +/- ##
============================================
- Coverage 60.47% 60.41% -0.07%
+ Complexity 5847 5844 -3
============================================
Files 473 473
Lines 40933 40937 +4
Branches 5235 5234 -1
============================================
- Hits 24756 24731 -25
- Misses 13976 13991 +15
- Partials 2201 2215 +14
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Please rebase the master @merlimat |
526d640
to
855d68d
Compare
private static final FastThreadLocal<ByteBuf> DIGEST_BUFFER = new FastThreadLocal<ByteBuf>() { | ||
@Override | ||
protected ByteBuf initialValue() throws Exception { | ||
return PooledByteBufAllocator.DEFAULT.directBuffer(1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The required ByteBuf size for each digest manager:
CRC32CDigestManager -> 4
CRC32DigestManager -> 8
DummyDigestManager -> 0
MacDigestManager -> 20
Do we need to allocate 1024 bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case. Though this is only used for MAC at this point, and one per thread. It shouldn't be any noticeable waste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually was referring to my next change in the DigestManager :)
### Motivation In `DigestManager` there are several accesses to ThreadLocal variable per each entry processed. The reason is the mainly due to `DigestManager` API which exposes a stateful `update()` method which can be invoked multiple times and keeps the current checksum as a thread-local variable. If we exclude MAC digest which is 20 bytes, for other digests we can instead keep the current checksum in a local variable and pass it each time, avoiding all the thread-locals and also the need for writing the checksum result into a buffer. ### Benchmarks #### Before #3810 ``` Benchmark (entrySize) Mode Cnt Score Error Units DigestManagerBenchmark.verifyDigest 64 thrpt 3 13.450 ± 3.634 ops/us DigestManagerBenchmark.verifyDigest 1024 thrpt 3 7.908 ± 2.637 ops/us DigestManagerBenchmark.verifyDigest 4086 thrpt 3 3.233 ± 0.882 ops/us DigestManagerBenchmark.verifyDigest 8192 thrpt 3 1.846 ± 0.047 ops/us ``` #### After #3810 ``` Benchmark (entrySize) Mode Cnt Score Error Units DigestManagerBenchmark.verifyDigest 64 thrpt 3 46.312 ± 7.414 ops/us DigestManagerBenchmark.verifyDigest 1024 thrpt 3 13.379 ± 1.069 ops/us DigestManagerBenchmark.verifyDigest 4086 thrpt 3 3.787 ± 0.059 ops/us DigestManagerBenchmark.verifyDigest 8192 thrpt 3 1.956 ± 0.052 ops/us ``` #### After this change ``` Benchmark (entrySize) Mode Cnt Score Error Units DigestManagerBenchmark.verifyDigest 64 thrpt 3 130.108 ± 4.854 ops/us DigestManagerBenchmark.verifyDigest 1024 thrpt 3 17.744 ± 0.238 ops/us DigestManagerBenchmark.verifyDigest 4086 thrpt 3 4.104 ± 0.181 ops/us DigestManagerBenchmark.verifyDigest 8192 thrpt 3 2.050 ± 0.066 ops/us ```
### Motivation In #3810 the signature of `Crc32cIntChecksum.resumeChecksum()` was changed to accept `offset` & `len` in the buffer. Since this method is also used externally (in Pulsar), we should leave also the old method signature to avoid breaking the API when upgrading BK.
### Motivation In `DigestManager` there are several accesses to ThreadLocal variable per each entry processed. The reason is the mainly due to `DigestManager` API which exposes a stateful `update()` method which can be invoked multiple times and keeps the current checksum as a thread-local variable. If we exclude MAC digest which is 20 bytes, for other digests we can instead keep the current checksum in a local variable and pass it each time, avoiding all the thread-locals and also the need for writing the checksum result into a buffer. ### Benchmarks #### Before apache#3810 ``` Benchmark (entrySize) Mode Cnt Score Error Units DigestManagerBenchmark.verifyDigest 64 thrpt 3 13.450 ± 3.634 ops/us DigestManagerBenchmark.verifyDigest 1024 thrpt 3 7.908 ± 2.637 ops/us DigestManagerBenchmark.verifyDigest 4086 thrpt 3 3.233 ± 0.882 ops/us DigestManagerBenchmark.verifyDigest 8192 thrpt 3 1.846 ± 0.047 ops/us ``` #### After apache#3810 ``` Benchmark (entrySize) Mode Cnt Score Error Units DigestManagerBenchmark.verifyDigest 64 thrpt 3 46.312 ± 7.414 ops/us DigestManagerBenchmark.verifyDigest 1024 thrpt 3 13.379 ± 1.069 ops/us DigestManagerBenchmark.verifyDigest 4086 thrpt 3 3.787 ± 0.059 ops/us DigestManagerBenchmark.verifyDigest 8192 thrpt 3 1.956 ± 0.052 ops/us ``` #### After this change ``` Benchmark (entrySize) Mode Cnt Score Error Units DigestManagerBenchmark.verifyDigest 64 thrpt 3 130.108 ± 4.854 ops/us DigestManagerBenchmark.verifyDigest 1024 thrpt 3 17.744 ± 0.238 ops/us DigestManagerBenchmark.verifyDigest 4086 thrpt 3 4.104 ± 0.181 ops/us DigestManagerBenchmark.verifyDigest 8192 thrpt 3 2.050 ± 0.066 ops/us ```
### Motivation In apache#3810 the signature of `Crc32cIntChecksum.resumeChecksum()` was changed to accept `offset` & `len` in the buffer. Since this method is also used externally (in Pulsar), we should leave also the old method signature to avoid breaking the API when upgrading BK.
Motivation
In
DigestManager.verifyDigestAndReturnData()
there are a couple of simple improvements:Benchmarks:
Before
Allocations: 230 bytes per operation
After
Allocations: 64 bytes per operation
Note: these remaining 64 bytes are only due to the Java9 reflection access. It would not be present on the JNI-based CRC32c backend (though that does not work on Mac M1).