Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Encoding of keys/values in CommandArgs when using a codec that implements ToByteBufEncoder #2610

Closed
shikharid opened this issue Jan 24, 2024 · 13 comments
Labels
type: feature A new feature
Milestone

Comments

@shikharid
Copy link
Contributor

shikharid commented Jan 24, 2024

Feature Request

Remove unwanted allocation when encoding CommandArgs when using a codec of type ToByteBufEncoder

Is your feature request related to a problem? Please describe

I was analysing cpu/memory for a jvm server that uses Redis heavily (order of 10s of thousands hmget/sec on ~10 connections) and noticed a good amount of CPU usage of the VM (~30% of total and ~70% of eventloops cpu time) comes from CommandArgs.KeyArgument.encode

Screenshot 2024-01-24 at 9 35 39 PM

All the connections use a ByteArrayCodec. Majority of the time is basically Netty's bytebuf gc/pooling logic.
The reason seems to be our specific usecase where every hmget call has 100s of keys and this gets called for each key

I looked through the implementation and it felt like these allocs can be entirely avoided for the special case where user does all req/resp in byte arrays (like folks who use the ByteArrayCodec) or for users who can exactly estimate the size of key/val

Two disclaimers:

  • The jfr sample used to analyse this is small (5 mins during peak load)
  • I have not invested extensive time with lettuce code, so might have been foolish to assume my suggestion below works

Describe the solution you'd like

Relevant code is in CommandArgs.KeyArgument.encode()and CommandArgs.ValueArgument.encode()

            if (codec instanceof ToByteBufEncoder) {

                ToByteBufEncoder<K, V> toByteBufEncoder = (ToByteBufEncoder<K, V>) codec;
                ByteBuf temporaryBuffer = target.alloc().buffer(toByteBufEncoder.estimateSize(val) + 6);

                try {
                    toByteBufEncoder.encodeValue(val, temporaryBuffer);
                    ByteBufferArgument.writeByteBuf(target, temporaryBuffer);
                } finally {
                    temporaryBuffer.release();
                }

                return;
            }

The solution will be replacing above with something like:

          if (codec instanceof ToByteBufEncoder) {
             ToByteBufEncoder<K, V> toByteBufEncoder = (ToByteBufEncoder<K, V>) codec;
              // Below lines are basically what ByteBufferArgument.writeByteBuf does
              target.writeByte('$');
              IntegerArgument.writeInteger(target, toByteBufEncoder.estimateSize(val));
              target.writeBytes(CRLF);
              toByteBufEncoder.encodeValue(val, target);
              target.writeBytes(CRLF); 
              return;
            }

Now this has a caveat because of which I'm creating this issue, It assumes estimatedSize is not "estimated" but "exact".

Maybe we can give user the control by adding another method in ToByteBufEncoder which can tell us if the codec can predict exact sizes.
In ByteArrayCodec we always can.

This ensures no additional allocs and essentially makes it garbage free.
Happy to contribute and impl the solution.

Describe alternatives you've considered

  • Use a threadlocal on heap bytebuf as the temporary buffer, since we know that its lifecycle is method scoped (resizing as needed and discarding bytes occasionaly if it encodes a very large key)
  • Somehow encode keys/vals in "batches" so as to allocate and copy to target once. Seems tricky to do

Teachability, Documentation, Adoption, Migration Strategy

NA

@shikharid
Copy link
Contributor Author

shikharid commented Jan 24, 2024

Also noticed this:
Screenshot 2024-01-24 at 11 04 42 PM

i.e.
A good chunk of memory allocs done in the lettuce threads were from creating RedisStateMachine.State objects.
Since its a very simple class annd its entire usage is inside this RedisStateMachine class, why not go along the lines of what Netty does with such hot objects, cache them in a pool.
Can use netty's ObjectPool (used extensively across Netty).

Seems pretty simple:

  • Take from the pool whenever doing new State()
  • Reset its state and release the objects when its no longer needed ( reset() and remove())

If somehow we miss releasing to the netty ObjectPool (reset not called before RedisStateMachine object was garbage collected), it won't be a memory leak as the pool itself keeps no reference (see io.netty.util.Recycler)

Again happy to contribute and will create a separate issue if the optimisation makes sense

Setup Details

node: m6gd.xlarge ec2 nodes
pod: jvm running with 4c/8g and 2g max direct memory (jdk17)
lettuce:

  • 8 conns used by 8 different threads (1 conn/thread)
  • auto flush disabled, cmds pipelined and flushed at fixed intervals/queue-size/max-cmd-age
  • dedicated epoll event loop, no other compute in its threads

@shikharid
Copy link
Contributor Author

Have done a crudish untested version of the changes needed, if project owners agree then will add a more formal version in separate PRs

  1. Zero copy encode/decode of key/values
  2. RedisStateMachine.State object pool

@mp911de
Copy link
Collaborator

mp911de commented Jan 25, 2024

Thanks for looking into this. We follow this approach to ensure protocol compliance by creating the encoded representation first and then write the correct byte size to the argument buff. estimateSize is implemented on a best-effort basis. For the byte-array coded, estimateSize is precise. For all other implementations this is a guesstimate, especially if you take a look at the StringCodec implementation with averageBytesPerChar * value.length().

It could make sense to introduce the optimization in the form that ToByteBufEncoder reports whether the estimate is the exact size (boolean isEstimateExact() or the like) and then follow your proposal to write onto the output buffer directly.

Regarding the State stack in RedisStateMachine, it could be worth considering to make the array allocation-free by preinitializing the array with 32 instances of Stack. Instead of pooling instances per Thread, any release would just clear type and count.

In any case, before we proceed with code changes, it would be good to have some measurements before and after such changes to learn how much of an impact such a change provides.

@mp911de mp911de added type: feature A new feature status: waiting-for-feedback We need additional information before we can continue labels Jan 25, 2024
@shikharid
Copy link
Contributor Author

Thanks, the preallocated Stack makes good sense.
Will write some jmh benchmarks and changes and share a PR soon.
Should I combine the changes in 1 PR or create two separate ones?

@mp911de
Copy link
Collaborator

mp911de commented Jan 25, 2024

Let's go with a single pull request that has two commits for easier reviewing.

@shikharid
Copy link
Contributor Author

Just curious though, where does the number 32 comes from, like why won't we ever need more than that?
Couldn't figure out based on commit msgs and skimming through RedisStateMachine

@mp911de
Copy link
Collaborator

mp911de commented Jan 25, 2024

32 is an estimate of Redis' response depth (array-in-array-in-object-…). Most commands key/value commands have single-depth while some extended Stream commands (such as XPENDING) have a nesting level of 6. I think that even 16 should be fine, but 32 brings us more on the safe side.

@shikharid
Copy link
Contributor Author

Got it, Thanks.

@shikharid
Copy link
Contributor Author

Hey, I have been caught up in some work last 2 weeks. Will do it this week and send a PR around the end.

shikharid added a commit to shikharid/lettuce-core that referenced this issue Feb 23, 2024
…te object allocs redis#2610

* adds gc and thrpt profiling in RedisStateMachine benchmark
* fixes a stale benchmark which caused compilation errors ClusterDistributionChannelWriterBenchmark
shikharid added a commit to shikharid/lettuce-core that referenced this issue Feb 23, 2024
…redis#2610

* adds benchmarks to show perf gains
* about 10x improvement in perf, with no added gc overhead
@shikharid
Copy link
Contributor Author

@mp911de added changes and benchmarks #2768

Good gains as per benchmarks, have a look

@mp911de mp911de added this to the 6.3.2.RELEASE milestone Feb 26, 2024
mp911de pushed a commit that referenced this issue Feb 26, 2024
…tate object allocs #2610

* adds gc and thrpt profiling in RedisStateMachine benchmark
* fixes a stale benchmark which caused compilation errors ClusterDistributionChannelWriterBenchmark

Original pull request: #2768
mp911de pushed a commit that referenced this issue Feb 26, 2024
…zes #2610

* adds benchmarks to show perf gains
* about 10x improvement in perf, with no added gc overhead

Original pull request: #2768
mp911de added a commit that referenced this issue Feb 26, 2024
Reduce code duplications. Add exact optimization to ASCII StringCodec. Tweak Javadoc.

Original pull request: #2768
mp911de pushed a commit that referenced this issue Feb 26, 2024
…tate object allocs #2610

* adds gc and thrpt profiling in RedisStateMachine benchmark
* fixes a stale benchmark which caused compilation errors ClusterDistributionChannelWriterBenchmark

Original pull request: #2768
mp911de pushed a commit that referenced this issue Feb 26, 2024
…zes #2610

* adds benchmarks to show perf gains
* about 10x improvement in perf, with no added gc overhead

Original pull request: #2768
mp911de added a commit that referenced this issue Feb 26, 2024
Reduce code duplications. Add exact optimization to ASCII StringCodec. Tweak Javadoc.

Original pull request: #2768
@mp911de mp911de removed the status: waiting-for-feedback We need additional information before we can continue label Feb 26, 2024
@mp911de
Copy link
Collaborator

mp911de commented Feb 26, 2024

Thanks a lot, this was a decent improvement.

@mp911de mp911de closed this as completed Feb 26, 2024
@shikharid
Copy link
Contributor Author

Thanks, any chance of a release soon which includes these?

@mp911de
Copy link
Collaborator

mp911de commented Mar 1, 2024

6.3.2.RELEASE is scheduled for March 15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature A new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants