Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PubSubEndpoint.channels and patterns contain duplicate binary channel/pattern names #911

Closed
lwiddershoven opened this issue Nov 2, 2018 · 15 comments
Labels
type: bug A general bug
Milestone

Comments

@lwiddershoven
Copy link

Bug Report

When using spring-boot 2 (we've seen it in 2.0.6 and 2.1.0) with spring-session backed by spring-data-redis backed by lettuce, we see application crashes due to out-of-memory problems. These issues seem to be triggered by reconnection attempts.

Memory profiling does show a huge increase in old-gen space usage in a frequent reconnection scenario.

Current Behavior

We use spring-session with redis through lettuce, and they die about once a day due to memory issues. They did not when they were still spring-boot 1 / jedis based.

The issue was tracked to lettuce, and there is a very likely link to the reconnects our haproxy does when the connection is unused for a while. But not always by a long shot ; it seems to happen after some number of reconnects.

We have been able to reproduce with a stock-standard spring-boot application and a docker-based haproxy/redis configuration.

Input Code

https://github.com/lwiddershoven/lettuce-memory-issue

Contains the application, the docker-compose file for haproxy/redis , some logging from the error, screenshots from the YourKit profiles, and the pom for a Jedis based application to compare behaviour.

Expected behavior/code

I expect that memory usage, after gc, is the same before and after a reconnection cycle.

Environment

  • Lettuce version(s): io.lettuce:lettuce-core:jar:5.1.2.RELEASE
  • Redis version: 3

Possible Solution

n/a

Additional context

It seems that the PubSubEndpoint disconnectedBuffer is pretty large, and there appear to be a lot of AsyncCommands in memory for a system that is, from an end-user perspective, doing nothing.

@lwiddershoven
Copy link
Author

When the Reliability in DefaultEndpoint is to Reliability.AT_MOST_ONCE , this issue does not occur. Which points to something in the lifecycle of the RetryListener.

@mp911de
Copy link
Collaborator

mp911de commented Nov 6, 2018

Thanks a lot for the report.

The issue can be related to overall disconnects and Lettuce's retry queue. Commands, that are issued while Redis is not available, are buffered. The request queue is by default unbounded which can consume a lot of memory with an appropriate load/Redis downtime. You can do the following two:

  1. Set ClientOptions.disconnectedBehavior to REJECT_COMMANDS: Commands will immediately fail if Redis is not available
  2. Limit the ClientOptions.requestQueueSize: Providing a reasonable limit allows some fault-tolerance if Redis is down and you're protecting your application.

Care to give a try with the most applicable setting?

@mp911de mp911de added status: waiting-for-feedback We need additional information before we can continue status: waiting-for-triage labels Nov 6, 2018
@lwiddershoven
Copy link
Author

ClientOptions.requestQueueSize

This does not work. I have verified that the request queue size actually is 10, and it even shows up in an error message after the OutOfMemory exception.
It may also be that 10 is just too much, but I interpret this as the number of redis commands on the backlog, and 10 seems like small to me in that context.

Test setup:

mvn spring-boot:run -Dspring-boot.run.jvmArguments="-Xmx512m"

And results in:

java.lang.OutOfMemoryError: GC overhead limit exceeded 

before

io.lettuce.core.RedisException: Request queue size exceeded: 10. Commands are not accepted until the queue size drops.

Some thoughts

What I find interesting (but may be completely expected) is that after every reconnect it seems that there is an extra SUBSCRIBE and PSUBSCRIBE command executed. I would think a single subscription should be sufficient (after a reconnect).

Over time, I see that the following 2 log lines occur more often on every reconnect:

2018-11-07 11:16:12.712 DEBUG 43460 --- [ioEventLoop-4-6] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0xb0d94030, /127.0.0.1:55590 -> localhost/127.0.0.1:6379, epid=0x2] write() channelWrite command AsyncCommand [type=SUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]

2018-11-07 11:16:14.283 DEBUG 43460 --- [ioEventLoop-4-6] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0xb0d94030, /127.0.0.1:55590 -> localhost/127.0.0.1:6379, epid=0x2] writeToDisconnectedBuffer() buffering (disconnected) command AsyncCommand [type=SUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]

What is interesting that ithey are all the same channel:

2018-11-07 11:16:19.747  INFO 43460 --- [ioEventLoop-4-8] i.l.core.protocol.ReconnectionHandler    : Reconnected to localhost:6379
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] writeToDisconnectedBuffer() buffering (disconnected) command AsyncCommand [type=SUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] write() done
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] writeToDisconnectedBuffer() buffering (disconnected) command AsyncCommand [type=PSUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] write() done
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] writeToDisconnectedBuffer() buffering (disconnected) command AsyncCommand [type=SUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] write() done
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] writeToDisconnectedBuffer() buffering (disconnected) command AsyncCommand [type=PSUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] write() done
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] writeToDisconnectedBuffer() buffering (disconnected) command AsyncCommand [type=SUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]

Another fun fact: it seems that old-gen heap space increases in steps, exponentially. Every step is significantly larger than the previous one:

  • 27 -> 49
  • 49 -> 123
  • 123 -> 341 mb

ClientOptions.disconnectedBehavior

This is not an option that is appropriate for me, but I have tested it.

It gives a very irregular memory profile, and it generates a lot of objects on old-gen heap space, but it does not seem to crash the JVM with an OOM.
As expected, every reconnect issues a single SUBSCRIBE and PSUBSCRIBE.

While this does not seem to generate an OOM in a unused application, I do find it strange that the amount of old-gen heap space increases so dramatically.

Code

ClientOptions were configured using the spring-data-redis facilities:

@Configuration
public class RedisConfig {

    @Bean
    public LettuceClientConfigurationBuilderCustomizer customizer() {
        return clientConfigurationBuilder -> 
        	clientConfigurationBuilder.clientOptions(
	        	ClientOptions.builder()
	        		.disconnectedBehavior(ClientOptions.DisconnectedBehavior.REJECT_COMMANDS)
	        		.build()
        	);

    }
}

@mp911de
Copy link
Collaborator

mp911de commented Nov 7, 2018

Regarding resubscription: Lettuce re-subscribes to patterns and channels if there have been previous registrations to restore subscription state.

Thanks a lot for testing the options. I'm puzzled about disconnectedBehavior to increase old-gen. Rejected commands result in short-lived command objects that are immediately discarded when using the synchronous API.

Looking at the profiler graph, there's an excessive number of KeyArguments while the number of commands is reasonably small. Can you inspect the commands why there are so many keys?

@lwiddershoven
Copy link
Author

Checked with a queue size of 3. That stabilizes on 240 MB old-gen used. No crash.

When profiling the memory dump I do see that a significant amount of memory is held by the lettuce Command.

When checking the paths from gc roots, I find that there 5 distinct trees: 3 x type = SUBSCRIBE and 2 x type = PSUBSCRIBE.

Command type=SUBSCRIBE

The data in the Command is retained mostly in args.singularArguments, an arraylist of size 1.3 million rows with room for 1.8m.
That list contains KeyArguments, each of which is small and has '--keyevent@1--' as the key. I did not check all the KeyArguments though :)
The KeyArgyments also hold a ByteArrayCodec.

This is the case of Command entry points that are strongky reachable, suggesting to me that this is actually the same.

Calculating the paths for these 3 entry points:

  • command = AsyncCommand.command = ArrayList.elementData = stack local var in FastThreadLocalThread (lettuce eventloop thread)
  • command = AsyncCommand.command = RetryListener.sentCommand = DefaultFutureListeners.listeners = DefaultChannelPromise.listeners = ChannelOutboundBuffer.promise = ChannelOutboundEntry.flushedEntry = NioSocketChannel.outboundBuffer = stack local var in FastThreadLocalThread (lettuce eventloop thread)
  • command = stack local var in FastThreadLocalThread (lettuce eventloop thread)

It does look there are directly or indirectly, multiple paths to the retained commands.

Command type=PSUBSCRIBE

This is very similar, except that the key is 'spring:session:'

  • command = AsyncCommand.command = RetryListener.sentCommand = DefaultFutureListeners.listeners = DefaultChannelPromise.listeners = ChannelOutboundBuffer.promise = ChannelOutboundEntry.flushedEntry = NioSocketChannel.outboundBuffer = stack local var in FastThreadLocalThread (lettuce eventloop thread)
  • command = AsyncCommand.command = ArrayList.elementData = stack local var in FastThreadLocalThread (lettuce eventloop thread)

I have no clue why I don't see a stack local entry for this one, as the data is just as stack local as the SUBSCRIBE.

@lwiddershoven
Copy link
Author

Maybe the short version: There appear to be over a million rows in CommandArgs.singularArguments.

@mp911de
Copy link
Collaborator

mp911de commented Nov 7, 2018

How many subscriptions are there? It seems to me that you might be subscribed to over a million channels/patterns. StatefulRedisPubSubConnectionImpl.resubscribe() is the place to set a break point.

@lwiddershoven
Copy link
Author

lwiddershoven commented Nov 7, 2018

With every 15 seconds a connection reset, then after a minute or 2 there are:

  • endpoint.channels.size: 16
  • endpoint.patterns.size: 8

And a few minutes later:

  • endpoint.channels.size: 512
  • endpoint.patterns.size: 192

and after 7 minutes we're at 300.000 channels.

@mp911de
Copy link
Collaborator

mp911de commented Nov 7, 2018

Okay, so I think I found what's happening here. PubSubEndpoint uses a Set of channel and pattern names to keep a unique list of names. monitoring Redis with redis-cli reveals that the following command is sent:

541597058.630707 [0 127.0.0.1:62571] "SUBSCRIBE" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:del"

As you can see, __keyevent@1__:del repeats all over. This is, because Spring Data Redis uses a binary representation of channel names (byte[]) and comparing arrays via equals() and hashCode() does not work. So patterns and channels pile up over time.

I'm not exactly sure who is really causing that problem, however we should place a fix in Lettuce to properly compare channel and pattern names.

@lwiddershoven
Copy link
Author

Awesome. If you want me to test/verify a fix I'd be happy to.

@mp911de
Copy link
Collaborator

mp911de commented Nov 7, 2018

Thanks a lot for running these kind of tests. These issues typically go unnoticed.

mp911de added a commit that referenced this issue Nov 7, 2018
Lettuce now stores pattern and channel names to which a connection is subscribed to with a comparison wrapper. This allows to compute hashCode and check for equality for built-in channel name types that do not support equals/hashCode for their actual content, in particular byte arrays.

Using byte[] for channel names prevented a proper equality check regarding the binary content and caused duplicates in the channel list. With every subscription, channel names were added in a quadraric amount at excessive memory cost.
mp911de added a commit that referenced this issue Nov 7, 2018
Lettuce now stores pattern and channel names to which a connection is subscribed to with a comparison wrapper. This allows to compute hashCode and check for equality for built-in channel name types that do not support equals/hashCode for their actual content, in particular byte arrays.

Using byte[] for channel names prevented a proper equality check regarding the binary content and caused duplicates in the channel list. With every subscription, channel names were added in a quadraric amount at excessive memory cost.
@mp911de mp911de added type: bug A general bug and removed status: waiting-for-feedback We need additional information before we can continue status: waiting-for-triage labels Nov 7, 2018
@mp911de mp911de changed the title Possible memory issue with reconnects PubSubEndpoint.channels and patterns contain duplicate binary channel/pattern names Nov 7, 2018
@mp911de
Copy link
Collaborator

mp911de commented Nov 7, 2018

I pushed a change and a new build is available (5.1.3.BUILD-SNAPSHOT) from https://oss.sonatype.org/content/repositories/snapshots/. Care to upgrade and give the fix a try?

@lwiddershoven
Copy link
Author

Works like a charm. Thanks!

@mp911de
Copy link
Collaborator

mp911de commented Nov 7, 2018

Thanks for verifying the fix.

@mp911de mp911de closed this as completed Nov 7, 2018
@mp911de mp911de added this to the 5.1.3 milestone Nov 7, 2018
@MonkeySuncen
Copy link

Hi, i got this trouble too. We use spring-boot-starter-data-redis 2.4.1 to connect redis. 5.0.7.

My behavior
1.Start my code,is running good.
2.Shutdown a Redis node(redis is cluster mode)
3.Check log,find a greate many log to reconnect redis node.
4.Tow hours later,get memory dump.

Dump message:

468 instances of "reactor.core.publisher.FluxFlatMap$FlatMapMain", loaded by "sun.misc.Launcher$AppClassLoader @ 0x6f1b4b0f8" occupy 522,742,480 (82.04%) bytes. These instances are referenced from one instance of "java.util.concurrent.ConcurrentLinkedDeque", loaded by ""

It will lead to old-gen gc and outofmemory.

I found it will cause many of reconnect commands to be added to the disconnectBuffer without restriction.

I'm not sure this is my mistake configuration or a lettuce's issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

3 participants