-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid Needless Forking when Closing Transports #66834
Avoid Needless Forking when Closing Transports #66834
Conversation
No need to fork off in the changed spots if we block the calling thread anyway. Also, some other minor cleanups.
Pinging @elastic/es-distributed (Team:Distributed) |
@@ -173,10 +170,6 @@ public ThreadPool getThreadPool() { | |||
return () -> circuitBreakerService.getBreaker(CircuitBreaker.IN_FLIGHT_REQUESTS); | |||
} | |||
|
|||
@Override | |||
protected void doStart() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was overridden by all but the adjusted test implementation.
@@ -279,8 +272,8 @@ public void openConnection(DiscoveryNode node, ConnectionProfile profile, Action | |||
} | |||
} | |||
|
|||
private List<TcpChannel> initiateConnection(DiscoveryNode node, ConnectionProfile connectionProfile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return was never used
closeLock.writeLock().lock(); | ||
try { | ||
keepAlive.close(); | ||
assert Transports.assertNotTransportThread("Must not block transport thread that might be needed for closing channels below"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this check since we might dead-lock when coming from a transport thread (more of a docs thing, we couldn't call this from a transport thread anyway due to other assertions). Other than that, there is no point in forking off here as far as I can tell if we then block anyway. Just makes ITs use more threads and could theoretically dead-lock when called from an already maxed out generic pool.
holderToNotify.handler().handleException(ex); | ||
} | ||
}); | ||
holderToNotify.handler().handleException(ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No point in forking off potentially multiple times here when we do all kinds of slow+blocking operations when closing the connection manager etc. above. All this does is potentially have tests fail when these tasks don't run before the threadpool is shut down and they do in fact release resources (which currently just works by accident).
responseHandlers.prune(h -> h.connection().getCacheKey().equals(connection.getCacheKey())); | ||
if (pruned.isEmpty() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly we don't have any open handlers here on close, no point forking off for an empty list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I left a handful of small comments.
// callback that an exception happened, but on a different thread since we don't | ||
// want handlers to worry about stack overflows | ||
getExecutorService().execute(new Runnable() { | ||
threadPool.generic().execute(new Runnable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the previous comment: if we know that the transport service is closed before the threadpool then can this be rejected? If not we should assert false
in the rejection handler. Also while we're here let's use an AbstractRunnable
rather than catching the EsRejectedExecutionException
ourselves and assert that handling the NodeDisconnectedException
doesn't throw I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's impossible for this to throw because the generic pool never rejects. Let's just assert false
on all exceptions via an AbstractRunnable
then we cover everything :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the generic pool never rejects
... except if shut down (hence the relationship to the previous comment)
if (pruned.isEmpty()) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
holderToNotify.handler().handleException(new SendRequestTransportException(holderToNotify.connection().getNode(), | ||
holderToNotify.action(), new NodeClosedException(localNode))); | ||
} catch (Exception e) { | ||
logger.warn(() -> new ParameterizedMessage("failed to notify response handler on exception, action: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assert false
here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea lets do it, I could see us running into this in some spots but if we do we can+should fix them :)
} | ||
}); | ||
try { | ||
holderToNotify.handler().handleException(new SendRequestTransportException(holderToNotify.connection().getNode(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we should respect handler().executor()
but then I looked at other call sites and it seems that we almost never do. Except sometimes. That might bite us one day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this one is a mess (it's similar to the threading discussion we had around the internal node client usage I guess) but this seems like the place where we might specifically not want to respect the executor to make the shutdown as safe as possible.
Thanks David, all points addressed I think :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one comment, otherwise looking good.
|
||
try { | ||
latch.await(30, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this adds risk that a slow shutdown of networking results in a delayed termination of the host in production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what could be slow in this process. Closing a channel should complete fairly promptly, we should just be dispatching a close()
call to the event loop and waiting for that to run. I didn't dig into Netty to verify and in particular to check that the close()
is passed down to the channel even if the channel is otherwise blocked. We're only closing inbound channels and server channels here so there's no response handlers to notify. And then stopping the event loop itself has its own 5-second timeout:
Lines 96 to 100 in bfcc93a
Future<?> shutdownFuture = eventLoopGroup.shutdownGracefully(0, 5, TimeUnit.SECONDS); | |
shutdownFuture.awaitUninterruptibly(); | |
if (shutdownFuture.isSuccess() == false) { | |
logger.warn("Error closing netty event loop group", shutdownFuture.cause()); | |
} |
I also pondered whether we might block on the closeLock
for any length of time and I think the answer to that is also no but I found other potential issues, not really relevant to this change tho so I spun them out into #77539.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 30s here was a fairly random choice a long time ago. I think the closing here should be almost instant as David points out there's an internal 5s timeout there and nothing that could really run for an extended period of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Jenkins run elasticsearch-ci/rest-compatibility — |
@elasticmachine update branch |
Thanks David & Henning! |
No need to fork off in the changed spots if we block the calling thread anyway. Also, some other minor cleanups.
No need to fork off in the changed spots if we block the calling thread anyway. All this does is make tests less deterministic and (in case of the fire and forget forking to
GENERIC
) introducing potential resource leaks if things are released in transport handlers.Also, some other minor cleanups of dead code.