Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove State allocs in RedisStateMachine and reduce allocs in ByteArrayCodec #2768

Closed
wants to merge 2 commits into from

Conversation

shikharid
Copy link
Contributor

@shikharid shikharid commented Feb 23, 2024

Two changes:

  1. In RedisStateMachine, use pre-allocated State objects
  2. For ByteArrayCodec (or any codec that can tell exact encoded byte sizes), directly copy to target ByteBuf instead of creating a temporary single-use copy first

Have added benchmarks (results in comments)

Refer #2610

  • You have read the contribution guidelines.
  • You have created a feature request first to discuss your contribution intent. Please reference the feature request ticket number in the pull request.
  • You use the code formatters provided here and have them applied to your changes. Don’t submit any formatting related changes.
  • You submit test cases (unit or integration tests) that back your changes. (Already existed, added benchmark tests)

…te object allocs redis#2610

* adds gc and thrpt profiling in RedisStateMachine benchmark
* fixes a stale benchmark which caused compilation errors ClusterDistributionChannelWriterBenchmark
…redis#2610

* adds benchmarks to show perf gains
* about 10x improvement in perf, with no added gc overhead
@shikharid
Copy link
Contributor Author

shikharid commented Feb 23, 2024

Summary

  1. 25% perf improvement in RedisStateMachine with negligible heap allocs (before 130 ns/op, after 104 ns/op)
  2. ~10x (1000%) perf improvement in Key/Value encoding for ByteArrayCodec (before 310 ns/op, after 34 ns/op)

Benchmark Setup

  • JMH version: 1.21
  • VM version: JDK 1.8.0_342, OpenJDK 64-Bit Server VM, 25.342-b07
  • OS: MacOS Ventura 13.2.1
  • Arch: Apple M1 Max (32 gb, aarch64)
  • Warmup: 5 iterations, 10 s each
  • Measurement: 5 iterations, 10 s each
  • Timeout: 2 s per iteration
  • Threads: 1 thread, will synchronize iterations
RedisStateMachine Benchmark Results Before
Benchmark                                                                  Mode  Cnt    Score    Error   Units
RedisStateMachineBenchmark.measureDecode                                   avgt    5  130.654 ± 11.039   ns/op
RedisStateMachineBenchmark.measureDecode:·gc.alloc.rate                    avgt    5  667.181 ± 55.258  MB/sec
RedisStateMachineBenchmark.measureDecode:·gc.alloc.rate.norm               avgt    5   96.000 ±  0.001    B/op
RedisStateMachineBenchmark.measureDecode:·gc.churn.PS_Eden_Space           avgt    5  667.349 ± 56.391  MB/sec
RedisStateMachineBenchmark.measureDecode:·gc.churn.PS_Eden_Space.norm      avgt    5   96.024 ±  0.745    B/op
RedisStateMachineBenchmark.measureDecode:·gc.churn.PS_Survivor_Space       avgt    5    0.106 ±  0.084  MB/sec
RedisStateMachineBenchmark.measureDecode:·gc.churn.PS_Survivor_Space.norm  avgt    5    0.015 ±  0.011    B/op
RedisStateMachineBenchmark.measureDecode:·gc.count                         avgt    5  715.000           counts
RedisStateMachineBenchmark.measureDecode:·gc.time                          avgt    5  345.000               ms

After

Benchmark                                                     Mode  Cnt    Score    Error   Units
RedisStateMachineBenchmark.measureDecode                      avgt    5  104.293 ±  0.309   ns/op
RedisStateMachineBenchmark.measureDecode:·gc.alloc.rate       avgt    5   ≈ 10⁻⁴           MB/sec
RedisStateMachineBenchmark.measureDecode:·gc.alloc.rate.norm  avgt    5   ≈ 10⁻⁵             B/op
RedisStateMachineBenchmark.measureDecode:·gc.count            avgt    5      ≈ 0           counts

ByteArrayCodec Benchmark Results Before
ExactVsEstimatedSizeCodecBenchmark.encodeKeyEstimatedSize                                     avgt    5   310.842 ±  1.404   ns/op
ExactVsEstimatedSizeCodecBenchmark.encodeKeyEstimatedSize:·gc.alloc.rate                      avgt    5   972.380 ±  3.934  MB/sec
ExactVsEstimatedSizeCodecBenchmark.encodeKeyEstimatedSize:·gc.alloc.rate.norm                 avgt    5   332.994 ±  0.010    B/op
ExactVsEstimatedSizeCodecBenchmark.encodeKeyEstimatedSize:·gc.churn.PS_Eden_Space             avgt    5   971.755 ± 12.683  MB/sec
ExactVsEstimatedSizeCodecBenchmark.encodeKeyEstimatedSize:·gc.churn.PS_Eden_Space.norm        avgt    5   332.779 ±  3.172    B/op
ExactVsEstimatedSizeCodecBenchmark.encodeKeyEstimatedSize:·gc.churn.PS_Survivor_Space         avgt    5     0.134 ±  0.091  MB/sec
ExactVsEstimatedSizeCodecBenchmark.encodeKeyEstimatedSize:·gc.churn.PS_Survivor_Space.norm    avgt    5     0.046 ±  0.031    B/op
ExactVsEstimatedSizeCodecBenchmark.encodeKeyEstimatedSize:·gc.count                           avgt    5   759.000           counts
ExactVsEstimatedSizeCodecBenchmark.encodeKeyEstimatedSize:·gc.time                            avgt    5   370.000               ms

After

Benchmark                                                                                     Mode  Cnt     Score    Error   Units
ExactVsEstimatedSizeCodecBenchmark.encodeKeyExactSize                                         avgt    5    34.032 ±  0.191   ns/op
ExactVsEstimatedSizeCodecBenchmark.encodeKeyExactSize:·gc.alloc.rate                          avgt    5  4054.084 ± 20.099  MB/sec
ExactVsEstimatedSizeCodecBenchmark.encodeKeyExactSize:·gc.alloc.rate.norm                     avgt    5   152.000 ±  0.001    B/op
ExactVsEstimatedSizeCodecBenchmark.encodeKeyExactSize:·gc.churn.PS_Eden_Space                 avgt    5  4047.560 ± 97.328  MB/sec
ExactVsEstimatedSizeCodecBenchmark.encodeKeyExactSize:·gc.churn.PS_Eden_Space.norm            avgt    5   151.755 ±  3.222    B/op
ExactVsEstimatedSizeCodecBenchmark.encodeKeyExactSize:·gc.churn.PS_Survivor_Space             avgt    5     0.147 ±  0.085  MB/sec
ExactVsEstimatedSizeCodecBenchmark.encodeKeyExactSize:·gc.churn.PS_Survivor_Space.norm        avgt    5     0.006 ±  0.003    B/op
ExactVsEstimatedSizeCodecBenchmark.encodeKeyExactSize:·gc.count                               avgt    5   780.000           counts
ExactVsEstimatedSizeCodecBenchmark.encodeKeyExactSize:·gc.time                                avgt    5   427.000               ms

@shikharid shikharid changed the title perf: remove State allocs in RedisStateMachine and reduce allocs in ByteArrayCodec #2610 perf: remove State allocs in RedisStateMachine and reduce allocs in ByteArrayCodec Feb 23, 2024
@mp911de mp911de changed the title perf: remove State allocs in RedisStateMachine and reduce allocs in ByteArrayCodec Remove State allocs in RedisStateMachine and reduce allocs in ByteArrayCodec Feb 26, 2024
@mp911de mp911de changed the title Remove State allocs in RedisStateMachine and reduce allocs in ByteArrayCodec Remove State allocs in RedisStateMachine and reduce allocs in ByteArrayCodec Feb 26, 2024
@mp911de mp911de added the type: feature A new feature label Feb 26, 2024
@mp911de mp911de added this to the 6.3.2.RELEASE milestone Feb 26, 2024
mp911de pushed a commit that referenced this pull request Feb 26, 2024
…tate object allocs #2610

* adds gc and thrpt profiling in RedisStateMachine benchmark
* fixes a stale benchmark which caused compilation errors ClusterDistributionChannelWriterBenchmark

Original pull request: #2768
mp911de pushed a commit that referenced this pull request Feb 26, 2024
…zes #2610

* adds benchmarks to show perf gains
* about 10x improvement in perf, with no added gc overhead

Original pull request: #2768
mp911de added a commit that referenced this pull request Feb 26, 2024
Reduce code duplications. Add exact optimization to ASCII StringCodec. Tweak Javadoc.

Original pull request: #2768
mp911de pushed a commit that referenced this pull request Feb 26, 2024
…tate object allocs #2610

* adds gc and thrpt profiling in RedisStateMachine benchmark
* fixes a stale benchmark which caused compilation errors ClusterDistributionChannelWriterBenchmark

Original pull request: #2768
mp911de pushed a commit that referenced this pull request Feb 26, 2024
…zes #2610

* adds benchmarks to show perf gains
* about 10x improvement in perf, with no added gc overhead

Original pull request: #2768
mp911de added a commit that referenced this pull request Feb 26, 2024
Reduce code duplications. Add exact optimization to ASCII StringCodec. Tweak Javadoc.

Original pull request: #2768
@mp911de
Copy link
Collaborator

mp911de commented Feb 26, 2024

Thank you for your contribution. That's merged, polished, and backported now.

@mp911de mp911de closed this Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance: Encoding of keys/values in CommandArgs when using a codec that implements ToByteBufEncoder
2 participants