-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StackOverflowError in RedisPublisher #1140
Comments
It looks like that the recursive implementation and the fact that data flows chunk by chunk cause the stack overflow. We need to change the implementation to avoid recursion and favor a drain loop instead. |
RedisSubscription.onDataAvailable(…) and RedisSubscription.read(…) now use a drain-loop instead of recursive reads for element emission. Recursive emission is error prone if the response contains many response elements. onDataAvailable(…) calls read(…) if there is demand and if data is available.
RedisSubscription.onDataAvailable(…) and RedisSubscription.read(…) now use a drain-loop instead of recursive reads for element emission. Recursive emission is error prone if the response contains many response elements. onDataAvailable(…) calls read(…) if there is demand and if data is available.
I pushed a fix to address the issue. Care to upgrade to 5.2.1.BUILD-SNAPSHOT and verify the fix? |
Thanks @mp911de. I'll get it out and let you know in a few days if we don't see any occurrences of the issue running the new version. |
No, I wasn’t able to reproduce a stack overflow. I mostly was concerned about not breaking existing usage patterns. It turned out that reproducing a race condition where processing gets stuck is incredibly hardy and therefore I raised the number of iterations to increase probability. |
onNext now uses decrement instead of compareAndSet to avoid races with request(n) calls ensuring that all elements get emitted. This allows simplification of readAndPublish() and eliminating another race where completion was potentially dropped. Completion now also considers the guard through READING instead of completing from any state to protect against active drain loops.
onNext now uses decrement instead of compareAndSet to avoid races with request(n) calls ensuring that all elements get emitted. This allows simplification of readAndPublish() and eliminating another race where completion was potentially dropped. Completion now also considers the guard through READING instead of completing from any state to protect against active drain loops.
@csunwold |
The exact behavior with the stack overflow has not occurred. But similar
behavior where the app stops making successful requests to Redis randomly
continues to happen. I haven't been able to find any logs to give more
details.
…On Mon, Apr 27, 2020, 5:29 PM Alex Lai ***@***.***> wrote:
@csunwold <https://github.com/csunwold>
After version up, did the same issue occurred again? I encountered the
same problem with version 5.2.0.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1140 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHWY5LT3YBZPCPTCABPRNLROYPNPANCNFSM4I3S7OSQ>
.
|
@csunwold |
No. There was no log output.
…On Mon, Apr 27, 2020, 7:56 PM Alex Lai ***@***.***> wrote:
@csunwold <https://github.com/csunwold>
Thanks for your information. When the similar behavior you mentioned
occurred, error level log will be output? Since our monitoring systems only
notify error level log, if the log level was warn, we cannot take action
immediately.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1140 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHWY5LHY74GT6PUS5PJXZ3ROZAVJANCNFSM4I3S7OSQ>
.
|
Bug Report
Occasionally our application is getting a StackOverflowError and then is unable to process any further commands. I've used the CommandLatencyEvent to validate that no commands are being processed.
Current Behavior
Stack trace
Input Code
So far I can't consistently reproduce it. I have logs that are observing it happen in production but I haven't been able to capture the exact set of circumstances that lead to this behavior. If you have any suggestions for additional logging that I can add which would help narrow down the root cause I would be happy to do so. It happens often enough (about once every 3-4 hours) that I should be able to help validate any potential fix within about 24 hours of pushing to production.
Expected behavior/code
No StackOverflowError should occur, or the error should be caught by the onError handler of the Flux.
Environment
Possible Solution
None known at this time.
Additional context
than 1500 keys per call.
More information about when this began
The issue only began to occur after changing the structure of our keys. Previously our keys were structured like this:
This way all keys in space "p" and "c" would end up on the same keyslot. In practice this lead to a very even distribution of keys across the cluster. Now the keys are formatted like this:
This was done because most of the time we would execute an MGET call, all keys would have the same id2, id3, and id4 values but different id1. By changing the hash tag used it put these keys that would frequently be queried together in the same keyslot. This lead to a large decrease in total MGETs (because each MGET through the reactive API gets split into a new MGET per shard). This was observed by metrics collected with the CommandLatencyEvent. In theory this should mean that while the total number of commands sent over the network went down, the average number of keys sent to a single node at a time went up.
The text was updated successfully, but these errors were encountered: