Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid Needless Forking when Closing Transports #66834

Conversation

original-brownbear
Copy link
Member

No need to fork off in the changed spots if we block the calling thread anyway. All this does is make tests less deterministic and (in case of the fire and forget forking to GENERIC) introducing potential resource leaks if things are released in transport handlers.

Also, some other minor cleanups of dead code.

No need to fork off in the changed spots if we block the calling thread anyway.
Also, some other minor cleanups.
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Dec 28, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@@ -173,10 +170,6 @@ public ThreadPool getThreadPool() {
return () -> circuitBreakerService.getBreaker(CircuitBreaker.IN_FLIGHT_REQUESTS);
}

@Override
protected void doStart() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was overridden by all but the adjusted test implementation.

@@ -279,8 +272,8 @@ public void openConnection(DiscoveryNode node, ConnectionProfile profile, Action
}
}

private List<TcpChannel> initiateConnection(DiscoveryNode node, ConnectionProfile connectionProfile,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return was never used

closeLock.writeLock().lock();
try {
keepAlive.close();
assert Transports.assertNotTransportThread("Must not block transport thread that might be needed for closing channels below");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this check since we might dead-lock when coming from a transport thread (more of a docs thing, we couldn't call this from a transport thread anyway due to other assertions). Other than that, there is no point in forking off here as far as I can tell if we then block anyway. Just makes ITs use more threads and could theoretically dead-lock when called from an already maxed out generic pool.

holderToNotify.handler().handleException(ex);
}
});
holderToNotify.handler().handleException(ex);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point in forking off potentially multiple times here when we do all kinds of slow+blocking operations when closing the connection manager etc. above. All this does is potentially have tests fail when these tasks don't run before the threadpool is shut down and they do in fact release resources (which currently just works by accident).

responseHandlers.prune(h -> h.connection().getCacheKey().equals(connection.getCacheKey()));
if (pruned.isEmpty() == false) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly we don't have any open handlers here on close, no point forking off for an empty list.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left a handful of small comments.

// callback that an exception happened, but on a different thread since we don't
// want handlers to worry about stack overflows
getExecutorService().execute(new Runnable() {
threadPool.generic().execute(new Runnable() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the previous comment: if we know that the transport service is closed before the threadpool then can this be rejected? If not we should assert false in the rejection handler. Also while we're here let's use an AbstractRunnable rather than catching the EsRejectedExecutionException ourselves and assert that handling the NodeDisconnectedException doesn't throw I guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's impossible for this to throw because the generic pool never rejects. Let's just assert false on all exceptions via an AbstractRunnable then we cover everything :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the generic pool never rejects

... except if shut down (hence the relationship to the previous comment)

Comment on lines +998 to +999
if (pruned.isEmpty()) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

holderToNotify.handler().handleException(new SendRequestTransportException(holderToNotify.connection().getNode(),
holderToNotify.action(), new NodeClosedException(localNode)));
} catch (Exception e) {
logger.warn(() -> new ParameterizedMessage("failed to notify response handler on exception, action: {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert false here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea lets do it, I could see us running into this in some spots but if we do we can+should fix them :)

}
});
try {
holderToNotify.handler().handleException(new SendRequestTransportException(holderToNotify.connection().getNode(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we should respect handler().executor() but then I looked at other call sites and it seems that we almost never do. Except sometimes. That might bite us one day.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this one is a mess (it's similar to the threading discussion we had around the internal node client usage I guess) but this seems like the place where we might specifically not want to respect the executor to make the shutdown as safe as possible.

@original-brownbear
Copy link
Member Author

Thanks David, all points addressed I think :)

@henningandersen henningandersen self-requested a review September 10, 2021 05:34
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment, otherwise looking good.


try {
latch.await(30, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this adds risk that a slow shutdown of networking results in a delayed termination of the host in production?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what could be slow in this process. Closing a channel should complete fairly promptly, we should just be dispatching a close() call to the event loop and waiting for that to run. I didn't dig into Netty to verify and in particular to check that the close() is passed down to the channel even if the channel is otherwise blocked. We're only closing inbound channels and server channels here so there's no response handlers to notify. And then stopping the event loop itself has its own 5-second timeout:

Future<?> shutdownFuture = eventLoopGroup.shutdownGracefully(0, 5, TimeUnit.SECONDS);
shutdownFuture.awaitUninterruptibly();
if (shutdownFuture.isSuccess() == false) {
logger.warn("Error closing netty event loop group", shutdownFuture.cause());
}

I also pondered whether we might block on the closeLock for any length of time and I think the answer to that is also no but I found other potential issues, not really relevant to this change tho so I spun them out into #77539.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 30s here was a fairly random choice a long time ago. I think the closing here should be almost instant as David points out there's an internal 5s timeout there and nothing that could really run for an extended period of time.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/rest-compatibility —

@original-brownbear
Copy link
Member Author

@elasticmachine update branch

@original-brownbear
Copy link
Member Author

Thanks David & Henning!

@original-brownbear original-brownbear merged commit 0d54332 into elastic:master Sep 10, 2021
@original-brownbear original-brownbear deleted the less-forking-transport-close branch September 10, 2021 09:58
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Sep 10, 2021
No need to fork off in the changed spots if we block the calling thread anyway.
Also, some other minor cleanups.
original-brownbear added a commit that referenced this pull request Sep 10, 2021
No need to fork off in the changed spots if we block the calling thread anyway.
Also, some other minor cleanups.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >non-issue Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.16.0 v8.0.0-alpha2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants