Fix Issues in Netty4MessageChannelHandler (#75861) #76531

original-brownbear · 2021-08-14T14:45:13Z

Fixes a few rough edges in this class:

we need to always pass a flush call down the pipeline and not just conditionally
if they apply to the message handler, otherwise we lose flushes e.g. when a channel
becomes not-writable due to a write from off the event-loop that exceeds the outbound
buffer size
- this is suspected of causing recently observed intermittent and unexplained slow message writes (logged by the outbound slow logger) where a message became stuck until a subsequent message was sent (e.g. during period leader checks or so)
Pass size 0 messages down the pipeline instead of just resolving their promise to avoid
unexpected behavior (though we don't make use of 0-length writes as of today
Avoid unnecessary flushes in queued-writes loop and only flush if the channel stops being
writable
Release buffers on queued writes that we fail on channel close (not doing this wasn't causing bugs today because we release the underlying bytes elsewhere but could cause trouble later)

Unfortunately, I was not able to reproduce the issue in the first point reliably as the timing is really tricky. I therefore tried to make this PR as short and uncontroversial as possible. I think there's possible further improvements here and this should have been caught by a test but it's not yet clear to me how to design a reliable reproducer here.

backport of #75861

Fixes a few rough edges in this class: * we need to always pass a flush call down the pipeline and not just conditionally if they apply to the message handler, otherwise we lose flushes e.g. when a channel becomes not-writable due to a write from off the event-loop that exceeds the outbound buffer size * this is suspected of causing recently observed intermittent and unexplained slow message writes (logged by the outbound slow logger) where a message became stuck until a subsequent message was sent (e.g. during period leader checks or so) * Pass size `0` messages down the pipeline instead of just resolving their promise to avoid unexpected behavior (though we don't make use of `0`-length writes as of today * Avoid unnecessary flushes in queued-writes loop and only flush if the channel stops being writable * Release buffers on queued writes that we fail on channel close (not doing this wasn't causing bugs today because we release the underlying bytes elsewhere but could cause trouble later) Unfortunately, I was not able to reproduce the issue in the first point reliably as the timing is really tricky. I therefore tried to make this PR as short and uncontroversial as possible. I think there's possible further improvements here and this should have been caught by a test but it's not yet clear to me how to design a reliable reproducer here.

elasticmachine · 2021-08-14T14:45:17Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear added :Distributed Coordination/Network Http and internode communication implementations backport labels Aug 14, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 14, 2021

elasticsearchmachine added the v7.14.1 label Aug 14, 2021

original-brownbear merged commit 10b8919 into elastic:7.14 Aug 14, 2021

original-brownbear deleted the 75861-7.14 branch August 14, 2021 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Issues in Netty4MessageChannelHandler (#75861) #76531

Fix Issues in Netty4MessageChannelHandler (#75861) #76531

original-brownbear commented Aug 14, 2021

elasticmachine commented Aug 14, 2021

Fix Issues in Netty4MessageChannelHandler (#75861) #76531

Fix Issues in Netty4MessageChannelHandler (#75861) #76531

Conversation

original-brownbear commented Aug 14, 2021

elasticmachine commented Aug 14, 2021