-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Issues in Netty4MessageChannelHandler #75861
Fix Issues in Netty4MessageChannelHandler #75861
Conversation
Fixes a few rough edges in this class: * we need to always pass a flush call down the pipeline and not just conditionally if they apply to the message handler, otherwise we lose flushes e.g. when a channel becomes not-writeable due to a write from off the event-loop that exceeds the outbound buffer size. * Pass size `0` messages down the pipeline instead of just resolving their promise to avoid unexpected behavior (though we don't make use of `0`-length writes as of today * Avoid unnecessary flushes in queued-writes loop and only flush if the channel stops being writeable
Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some (naïve) questions
if (channel.isActive() == false) { | ||
failQueuedWrites(); | ||
return; | ||
if (channel.isWritable()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suspicious about checking this flag twice (once here and once in the loop condition), can we reduce it to one read?
Indeed given the possibility to become unwritable at any time (on a different thread) do we need to check this at all? I mean we could become unwritable after this check but before calling ctx.flush()
on the next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suspicious about checking this flag twice (once here and once in the loop condition), can we reduce it to one read?
We do need to read twice in case we try to flush inside the loop to figure out if the flush made us writable again or not.
We could technically not check twice in case the channel was writable when checking inside the loop, but I couldn't really find a non-confusing way to do so and figured since it's just a cheap volatile read it doesn't really matter. Since we're now not missing a flush any more a concurrent change to not-writable between two checks just means that the next flush task enqueued after the currently executing task will make us writable again, then the writability changed hook will trigger another round of doFlush
and deal with the write and we're all good I think :)
...ort-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4MessageChannelHandler.java
Show resolved
Hide resolved
...ort-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4MessageChannelHandler.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks David! |
Fixes a few rough edges in this class: * we need to always pass a flush call down the pipeline and not just conditionally if they apply to the message handler, otherwise we lose flushes e.g. when a channel becomes not-writable due to a write from off the event-loop that exceeds the outbound buffer size * this is suspected of causing recently observed intermittent and unexplained slow message writes (logged by the outbound slow logger) where a message became stuck until a subsequent message was sent (e.g. during period leader checks or so) * Pass size `0` messages down the pipeline instead of just resolving their promise to avoid unexpected behavior (though we don't make use of `0`-length writes as of today * Avoid unnecessary flushes in queued-writes loop and only flush if the channel stops being writable * Release buffers on queued writes that we fail on channel close (not doing this wasn't causing bugs today because we release the underlying bytes elsewhere but could cause trouble later) Unfortunately, I was not able to reproduce the issue in the first point reliably as the timing is really tricky. I therefore tried to make this PR as short and uncontroversial as possible. I think there's possible further improvements here and this should have been caught by a test but it's not yet clear to me how to design a reliable reproducer here.
Fixes a few rough edges in this class: * we need to always pass a flush call down the pipeline and not just conditionally if they apply to the message handler, otherwise we lose flushes e.g. when a channel becomes not-writable due to a write from off the event-loop that exceeds the outbound buffer size * this is suspected of causing recently observed intermittent and unexplained slow message writes (logged by the outbound slow logger) where a message became stuck until a subsequent message was sent (e.g. during period leader checks or so) * Pass size `0` messages down the pipeline instead of just resolving their promise to avoid unexpected behavior (though we don't make use of `0`-length writes as of today * Avoid unnecessary flushes in queued-writes loop and only flush if the channel stops being writable * Release buffers on queued writes that we fail on channel close (not doing this wasn't causing bugs today because we release the underlying bytes elsewhere but could cause trouble later) Unfortunately, I was not able to reproduce the issue in the first point reliably as the timing is really tricky. I therefore tried to make this PR as short and uncontroversial as possible. I think there's possible further improvements here and this should have been caught by a test but it's not yet clear to me how to design a reliable reproducer here.
Fixes a few rough edges in this class: * we need to always pass a flush call down the pipeline and not just conditionally if they apply to the message handler, otherwise we lose flushes e.g. when a channel becomes not-writable due to a write from off the event-loop that exceeds the outbound buffer size * this is suspected of causing recently observed intermittent and unexplained slow message writes (logged by the outbound slow logger) where a message became stuck until a subsequent message was sent (e.g. during period leader checks or so) * Pass size `0` messages down the pipeline instead of just resolving their promise to avoid unexpected behavior (though we don't make use of `0`-length writes as of today * Avoid unnecessary flushes in queued-writes loop and only flush if the channel stops being writable * Release buffers on queued writes that we fail on channel close (not doing this wasn't causing bugs today because we release the underlying bytes elsewhere but could cause trouble later) Unfortunately, I was not able to reproduce the issue in the first point reliably as the timing is really tricky. I therefore tried to make this PR as short and uncontroversial as possible. I think there's possible further improvements here and this should have been caught by a test but it's not yet clear to me how to design a reliable reproducer here.
Fixes a few rough edges in this class: * we need to always pass a flush call down the pipeline and not just conditionally if they apply to the message handler, otherwise we lose flushes e.g. when a channel becomes not-writable due to a write from off the event-loop that exceeds the outbound buffer size * this is suspected of causing recently observed intermittent and unexplained slow message writes (logged by the outbound slow logger) where a message became stuck until a subsequent message was sent (e.g. during period leader checks or so) * Pass size `0` messages down the pipeline instead of just resolving their promise to avoid unexpected behavior (though we don't make use of `0`-length writes as of today * Avoid unnecessary flushes in queued-writes loop and only flush if the channel stops being writable * Release buffers on queued writes that we fail on channel close (not doing this wasn't causing bugs today because we release the underlying bytes elsewhere but could cause trouble later) Unfortunately, I was not able to reproduce the issue in the first point reliably as the timing is really tricky. I therefore tried to make this PR as short and uncontroversial as possible. I think there's possible further improvements here and this should have been caught by a test but it's not yet clear to me how to design a reliable reproducer here.
Fixes a few rough edges in this class:
if they apply to the message handler, otherwise we lose flushes e.g. when a channel
becomes not-writable due to a write from off the event-loop that exceeds the outbound
buffer size
0
messages down the pipeline instead of just resolving their promise to avoidunexpected behavior (though we don't make use of
0
-length writes as of todaywritable
Unfortunately, I was not able to reproduce the issue in the first point reliably as the timing is really tricky. I therefore tried to make this PR as short and uncontroversial as possible. I think there's possible further improvements here and this should have been caught by a test but it's not yet clear to me how to design a reliable reproducer here.