PubSubEndpoint.channels and patterns contain duplicate binary channel/pattern names #911

lwiddershoven · 2018-11-02T13:35:30Z

Bug Report

When using spring-boot 2 (we've seen it in 2.0.6 and 2.1.0) with spring-session backed by spring-data-redis backed by lettuce, we see application crashes due to out-of-memory problems. These issues seem to be triggered by reconnection attempts.

Memory profiling does show a huge increase in old-gen space usage in a frequent reconnection scenario.

Current Behavior

We use spring-session with redis through lettuce, and they die about once a day due to memory issues. They did not when they were still spring-boot 1 / jedis based.

The issue was tracked to lettuce, and there is a very likely link to the reconnects our haproxy does when the connection is unused for a while. But not always by a long shot ; it seems to happen after some number of reconnects.

We have been able to reproduce with a stock-standard spring-boot application and a docker-based haproxy/redis configuration.

Input Code

https://github.com/lwiddershoven/lettuce-memory-issue

Contains the application, the docker-compose file for haproxy/redis , some logging from the error, screenshots from the YourKit profiles, and the pom for a Jedis based application to compare behaviour.

Expected behavior/code

I expect that memory usage, after gc, is the same before and after a reconnection cycle.

Environment

Lettuce version(s): io.lettuce:lettuce-core:jar:5.1.2.RELEASE
Redis version: 3

Possible Solution

n/a

Additional context

It seems that the PubSubEndpoint disconnectedBuffer is pretty large, and there appear to be a lot of AsyncCommands in memory for a system that is, from an end-user perspective, doing nothing.

lwiddershoven · 2018-11-02T15:34:37Z

When the Reliability in DefaultEndpoint is to Reliability.AT_MOST_ONCE , this issue does not occur. Which points to something in the lifecycle of the RetryListener.

mp911de · 2018-11-06T13:24:02Z

Thanks a lot for the report.

The issue can be related to overall disconnects and Lettuce's retry queue. Commands, that are issued while Redis is not available, are buffered. The request queue is by default unbounded which can consume a lot of memory with an appropriate load/Redis downtime. You can do the following two:

Set ClientOptions.disconnectedBehavior to REJECT_COMMANDS: Commands will immediately fail if Redis is not available
Limit the ClientOptions.requestQueueSize: Providing a reasonable limit allows some fault-tolerance if Redis is down and you're protecting your application.

Care to give a try with the most applicable setting?

lwiddershoven · 2018-11-07T11:03:10Z

ClientOptions.requestQueueSize

This does not work. I have verified that the request queue size actually is 10, and it even shows up in an error message after the OutOfMemory exception.
It may also be that 10 is just too much, but I interpret this as the number of redis commands on the backlog, and 10 seems like small to me in that context.

Test setup:

mvn spring-boot:run -Dspring-boot.run.jvmArguments="-Xmx512m"

And results in:

java.lang.OutOfMemoryError: GC overhead limit exceeded

before

io.lettuce.core.RedisException: Request queue size exceeded: 10. Commands are not accepted until the queue size drops.

Some thoughts

What I find interesting (but may be completely expected) is that after every reconnect it seems that there is an extra SUBSCRIBE and PSUBSCRIBE command executed. I would think a single subscription should be sufficient (after a reconnect).

Over time, I see that the following 2 log lines occur more often on every reconnect:

2018-11-07 11:16:12.712 DEBUG 43460 --- [ioEventLoop-4-6] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0xb0d94030, /127.0.0.1:55590 -> localhost/127.0.0.1:6379, epid=0x2] write() channelWrite command AsyncCommand [type=SUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]

2018-11-07 11:16:14.283 DEBUG 43460 --- [ioEventLoop-4-6] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0xb0d94030, /127.0.0.1:55590 -> localhost/127.0.0.1:6379, epid=0x2] writeToDisconnectedBuffer() buffering (disconnected) command AsyncCommand [type=SUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]

What is interesting that ithey are all the same channel:

2018-11-07 11:16:19.747  INFO 43460 --- [ioEventLoop-4-8] i.l.core.protocol.ReconnectionHandler    : Reconnected to localhost:6379
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] writeToDisconnectedBuffer() buffering (disconnected) command AsyncCommand [type=SUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] write() done
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] writeToDisconnectedBuffer() buffering (disconnected) command AsyncCommand [type=PSUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] write() done
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] writeToDisconnectedBuffer() buffering (disconnected) command AsyncCommand [type=SUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] write() done
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] writeToDisconnectedBuffer() buffering (disconnected) command AsyncCommand [type=PSUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] write() done
2018-11-07 11:16:19.770 DEBUG 43460 --- [ioEventLoop-4-8] i.lettuce.core.protocol.DefaultEndpoint  : [channel=0x6b001372, /127.0.0.1:55592 -> localhost/127.0.0.1:6379, epid=0x2] writeToDisconnectedBuffer() buffering (disconnected) command AsyncCommand [type=SUBSCRIBE, output=PubSubOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]

Another fun fact: it seems that old-gen heap space increases in steps, exponentially. Every step is significantly larger than the previous one:

27 -> 49
49 -> 123
123 -> 341 mb

ClientOptions.disconnectedBehavior

This is not an option that is appropriate for me, but I have tested it.

It gives a very irregular memory profile, and it generates a lot of objects on old-gen heap space, but it does not seem to crash the JVM with an OOM.
As expected, every reconnect issues a single SUBSCRIBE and PSUBSCRIBE.

While this does not seem to generate an OOM in a unused application, I do find it strange that the amount of old-gen heap space increases so dramatically.

Code

ClientOptions were configured using the spring-data-redis facilities:

@Configuration
public class RedisConfig {

    @Bean
    public LettuceClientConfigurationBuilderCustomizer customizer() {
        return clientConfigurationBuilder -> 
        	clientConfigurationBuilder.clientOptions(
	        	ClientOptions.builder()
	        		.disconnectedBehavior(ClientOptions.DisconnectedBehavior.REJECT_COMMANDS)
	        		.build()
        	);

    }
}

mp911de · 2018-11-07T11:29:17Z

Regarding resubscription: Lettuce re-subscribes to patterns and channels if there have been previous registrations to restore subscription state.

Thanks a lot for testing the options. I'm puzzled about disconnectedBehavior to increase old-gen. Rejected commands result in short-lived command objects that are immediately discarded when using the synchronous API.

Looking at the profiler graph, there's an excessive number of KeyArguments while the number of commands is reasonably small. Can you inspect the commands why there are so many keys?

lwiddershoven · 2018-11-07T12:53:10Z

Checked with a queue size of 3. That stabilizes on 240 MB old-gen used. No crash.

When profiling the memory dump I do see that a significant amount of memory is held by the lettuce Command.

When checking the paths from gc roots, I find that there 5 distinct trees: 3 x type = SUBSCRIBE and 2 x type = PSUBSCRIBE.

Command type=SUBSCRIBE

The data in the Command is retained mostly in args.singularArguments, an arraylist of size 1.3 million rows with room for 1.8m.
That list contains KeyArguments, each of which is small and has '--keyevent@1--' as the key. I did not check all the KeyArguments though :)
The KeyArgyments also hold a ByteArrayCodec.

This is the case of Command entry points that are strongky reachable, suggesting to me that this is actually the same.

Calculating the paths for these 3 entry points:

command = AsyncCommand.command = ArrayList.elementData = stack local var in FastThreadLocalThread (lettuce eventloop thread)
command = AsyncCommand.command = RetryListener.sentCommand = DefaultFutureListeners.listeners = DefaultChannelPromise.listeners = ChannelOutboundBuffer.promise = ChannelOutboundEntry.flushedEntry = NioSocketChannel.outboundBuffer = stack local var in FastThreadLocalThread (lettuce eventloop thread)
command = stack local var in FastThreadLocalThread (lettuce eventloop thread)

It does look there are directly or indirectly, multiple paths to the retained commands.

Command type=PSUBSCRIBE

This is very similar, except that the key is 'spring:session:'

command = AsyncCommand.command = RetryListener.sentCommand = DefaultFutureListeners.listeners = DefaultChannelPromise.listeners = ChannelOutboundBuffer.promise = ChannelOutboundEntry.flushedEntry = NioSocketChannel.outboundBuffer = stack local var in FastThreadLocalThread (lettuce eventloop thread)
command = AsyncCommand.command = ArrayList.elementData = stack local var in FastThreadLocalThread (lettuce eventloop thread)

I have no clue why I don't see a stack local entry for this one, as the data is just as stack local as the SUBSCRIBE.

lwiddershoven · 2018-11-07T12:55:54Z

Maybe the short version: There appear to be over a million rows in CommandArgs.singularArguments.

mp911de · 2018-11-07T13:12:52Z

How many subscriptions are there? It seems to me that you might be subscribed to over a million channels/patterns. StatefulRedisPubSubConnectionImpl.resubscribe() is the place to set a break point.

lwiddershoven · 2018-11-07T13:28:58Z

With every 15 seconds a connection reset, then after a minute or 2 there are:

endpoint.channels.size: 16
endpoint.patterns.size: 8

And a few minutes later:

endpoint.channels.size: 512
endpoint.patterns.size: 192

and after 7 minutes we're at 300.000 channels.

mp911de · 2018-11-07T13:31:02Z

Okay, so I think I found what's happening here. PubSubEndpoint uses a Set of channel and pattern names to keep a unique list of names. monitoring Redis with redis-cli reveals that the following command is sent:

541597058.630707 [0 127.0.0.1:62571] "SUBSCRIBE" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:del" "__keyevent@1__:del" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:expired" "__keyevent@1__:del"

As you can see, __keyevent@1__:del repeats all over. This is, because Spring Data Redis uses a binary representation of channel names (byte[]) and comparing arrays via equals() and hashCode() does not work. So patterns and channels pile up over time.

I'm not exactly sure who is really causing that problem, however we should place a fix in Lettuce to properly compare channel and pattern names.

lwiddershoven · 2018-11-07T13:34:38Z

Awesome. If you want me to test/verify a fix I'd be happy to.

mp911de · 2018-11-07T13:37:10Z

Thanks a lot for running these kind of tests. These issues typically go unnoticed.

Lettuce now stores pattern and channel names to which a connection is subscribed to with a comparison wrapper. This allows to compute hashCode and check for equality for built-in channel name types that do not support equals/hashCode for their actual content, in particular byte arrays. Using byte[] for channel names prevented a proper equality check regarding the binary content and caused duplicates in the channel list. With every subscription, channel names were added in a quadraric amount at excessive memory cost.

mp911de · 2018-11-07T14:37:40Z

I pushed a change and a new build is available (5.1.3.BUILD-SNAPSHOT) from https://oss.sonatype.org/content/repositories/snapshots/. Care to upgrade and give the fix a try?

lwiddershoven · 2018-11-07T15:23:59Z

Works like a charm. Thanks!

mp911de · 2018-11-07T17:10:49Z

Thanks for verifying the fix.

MonkeySuncen · 2020-12-10T08:52:56Z

Hi, i got this trouble too. We use spring-boot-starter-data-redis 2.4.1 to connect redis. 5.0.7.

My behavior
1.Start my code，is running good.
2.Shutdown a Redis node（redis is cluster mode）
3.Check log,find a greate many log to reconnect redis node.
4.Tow hours later，get memory dump.

Dump message:

468 instances of "reactor.core.publisher.FluxFlatMap$FlatMapMain", loaded by "sun.misc.Launcher$AppClassLoader @ 0x6f1b4b0f8" occupy 522,742,480 (82.04%) bytes. These instances are referenced from one instance of "java.util.concurrent.ConcurrentLinkedDeque", loaded by ""

It will lead to old-gen gc and outofmemory.

I found it will cause many of reconnect commands to be added to the disconnectBuffer without restriction.

I'm not sure this is my mistake configuration or a lettuce's issues.

mp911de added status: waiting-for-feedback We need additional information before we can continue status: waiting-for-triage labels Nov 6, 2018

mp911de added type: bug A general bug and removed status: waiting-for-feedback We need additional information before we can continue status: waiting-for-triage labels Nov 7, 2018

mp911de changed the title ~~Possible memory issue with reconnects~~ PubSubEndpoint.channels and patterns contain duplicate binary channel/pattern names Nov 7, 2018

mp911de closed this as completed Nov 7, 2018

mp911de added this to the 5.1.3 milestone Nov 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PubSubEndpoint.channels and patterns contain duplicate binary channel/pattern names #911

PubSubEndpoint.channels and patterns contain duplicate binary channel/pattern names #911

lwiddershoven commented Nov 2, 2018

lwiddershoven commented Nov 2, 2018

mp911de commented Nov 6, 2018

lwiddershoven commented Nov 7, 2018

mp911de commented Nov 7, 2018

lwiddershoven commented Nov 7, 2018

lwiddershoven commented Nov 7, 2018

mp911de commented Nov 7, 2018 •

edited

Loading

lwiddershoven commented Nov 7, 2018 •

edited

Loading

mp911de commented Nov 7, 2018

lwiddershoven commented Nov 7, 2018

mp911de commented Nov 7, 2018

mp911de commented Nov 7, 2018

lwiddershoven commented Nov 7, 2018

mp911de commented Nov 7, 2018

MonkeySuncen commented Dec 10, 2020

PubSubEndpoint.channels and patterns contain duplicate binary channel/pattern names #911

PubSubEndpoint.channels and patterns contain duplicate binary channel/pattern names #911

Comments

lwiddershoven commented Nov 2, 2018

Bug Report

Current Behavior

Input Code

Expected behavior/code

Environment

Possible Solution

Additional context

lwiddershoven commented Nov 2, 2018

mp911de commented Nov 6, 2018

lwiddershoven commented Nov 7, 2018

ClientOptions.requestQueueSize

Test setup:

Some thoughts

ClientOptions.disconnectedBehavior

Code

mp911de commented Nov 7, 2018

lwiddershoven commented Nov 7, 2018

Command type=SUBSCRIBE

Command type=PSUBSCRIBE

lwiddershoven commented Nov 7, 2018

mp911de commented Nov 7, 2018 • edited Loading

lwiddershoven commented Nov 7, 2018 • edited Loading

mp911de commented Nov 7, 2018

lwiddershoven commented Nov 7, 2018

mp911de commented Nov 7, 2018

mp911de commented Nov 7, 2018

lwiddershoven commented Nov 7, 2018

mp911de commented Nov 7, 2018

MonkeySuncen commented Dec 10, 2020

mp911de commented Nov 7, 2018 •

edited

Loading

lwiddershoven commented Nov 7, 2018 •

edited

Loading