Avoid Needless Forking when Closing Transports #66834

original-brownbear · 2020-12-28T14:58:33Z

No need to fork off in the changed spots if we block the calling thread anyway. All this does is make tests less deterministic and (in case of the fire and forget forking to GENERIC) introducing potential resource leaks if things are released in transport handlers.

Also, some other minor cleanups of dead code.

No need to fork off in the changed spots if we block the calling thread anyway. Also, some other minor cleanups.

elasticmachine · 2020-12-28T14:58:36Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear · 2020-12-28T14:59:04Z

server/src/main/java/org/elasticsearch/transport/TcpTransport.java

@@ -173,10 +170,6 @@ public ThreadPool getThreadPool() {
        return () -> circuitBreakerService.getBreaker(CircuitBreaker.IN_FLIGHT_REQUESTS);
    }

-    @Override
-    protected void doStart() {


This was overridden by all but the adjusted test implementation.

original-brownbear · 2020-12-28T14:59:17Z

server/src/main/java/org/elasticsearch/transport/TcpTransport.java

@@ -279,8 +272,8 @@ public void openConnection(DiscoveryNode node, ConnectionProfile profile, Action
        }
    }

-    private List<TcpChannel> initiateConnection(DiscoveryNode node, ConnectionProfile connectionProfile,


return was never used

original-brownbear · 2020-12-28T15:02:48Z

server/src/main/java/org/elasticsearch/transport/TcpTransport.java

-            closeLock.writeLock().lock();
-            try {
-                keepAlive.close();
+        assert Transports.assertNotTransportThread("Must not block transport thread that might be needed for closing channels below");


I added this check since we might dead-lock when coming from a transport thread (more of a docs thing, we couldn't call this from a transport thread anyway due to other assertions). Other than that, there is no point in forking off here as far as I can tell if we then block anyway. Just makes ITs use more threads and could theoretically dead-lock when called from an already maxed out generic pool.

original-brownbear · 2020-12-28T15:05:13Z

server/src/main/java/org/elasticsearch/transport/TransportService.java

-                        holderToNotify.handler().handleException(ex);
-                    }
-                });
+                    holderToNotify.handler().handleException(ex);


No point in forking off potentially multiple times here when we do all kinds of slow+blocking operations when closing the connection manager etc. above. All this does is potentially have tests fail when these tasks don't run before the threadpool is shut down and they do in fact release resources (which currently just works by accident).

original-brownbear · 2020-12-28T15:05:51Z

server/src/main/java/org/elasticsearch/transport/TransportService.java

                responseHandlers.prune(h -> h.connection().getCacheKey().equals(connection.getCacheKey()));
+        if (pruned.isEmpty() == false) {


Mostly we don't have any open handlers here on close, no point forking off for an empty list.

…port-close

DaveCTurner

Looks good, I left a handful of small comments.

server/src/main/java/org/elasticsearch/transport/StatsTracker.java

server/src/main/java/org/elasticsearch/transport/TcpTransport.java

DaveCTurner · 2021-09-09T18:18:08Z

server/src/main/java/org/elasticsearch/transport/TransportService.java

            // callback that an exception happened, but on a different thread since we don't
            // want handlers to worry about stack overflows
-            getExecutorService().execute(new Runnable() {
+            threadPool.generic().execute(new Runnable() {


Related to the previous comment: if we know that the transport service is closed before the threadpool then can this be rejected? If not we should assert false in the rejection handler. Also while we're here let's use an AbstractRunnable rather than catching the EsRejectedExecutionException ourselves and assert that handling the NodeDisconnectedException doesn't throw I guess.

Actually, it's impossible for this to throw because the generic pool never rejects. Let's just assert false on all exceptions via an AbstractRunnable then we cover everything :)

the generic pool never rejects

... except if shut down (hence the relationship to the previous comment)

DaveCTurner · 2021-09-09T18:18:33Z

server/src/main/java/org/elasticsearch/transport/TransportService.java

+        if (pruned.isEmpty()) {
+            return;


DaveCTurner · 2021-09-09T18:19:18Z

server/src/main/java/org/elasticsearch/transport/TransportService.java

+                    holderToNotify.handler().handleException(new SendRequestTransportException(holderToNotify.connection().getNode(),
+                            holderToNotify.action(), new NodeClosedException(localNode)));
+                } catch (Exception e) {
+                    logger.warn(() -> new ParameterizedMessage("failed to notify response handler on exception, action: {}",


Can we assert false here too?

Yea lets do it, I could see us running into this in some spots but if we do we can+should fix them :)

DaveCTurner · 2021-09-09T18:25:07Z

server/src/main/java/org/elasticsearch/transport/TransportService.java

-                    }
-                });
+                try {
+                    holderToNotify.handler().handleException(new SendRequestTransportException(holderToNotify.connection().getNode(),


I was wondering if we should respect handler().executor() but then I looked at other call sites and it seems that we almost never do. Except sometimes. That might bite us one day.

Yea this one is a mess (it's similar to the threading discussion we had around the internal node client usage I guess) but this seems like the place where we might specifically not want to respect the executor to make the shutdown as safe as possible.

…port-close

original-brownbear · 2021-09-10T04:48:02Z

Thanks David, all points addressed I think :)

henningandersen

Just one comment, otherwise looking good.

henningandersen · 2021-09-10T05:32:43Z

server/src/main/java/org/elasticsearch/transport/TcpTransport.java


-        try {
-            latch.await(30, TimeUnit.SECONDS);


I wonder if this adds risk that a slow shutdown of networking results in a delayed termination of the host in production?

I'm curious what could be slow in this process. Closing a channel should complete fairly promptly, we should just be dispatching a close() call to the event loop and waiting for that to run. I didn't dig into Netty to verify and in particular to check that the close() is passed down to the channel even if the channel is otherwise blocked. We're only closing inbound channels and server channels here so there's no response handlers to notify. And then stopping the event loop itself has its own 5-second timeout:

elasticsearch/modules/transport-netty4/src/main/java/org/elasticsearch/transport/SharedGroupFactory.java

Lines 96 to 100 in bfcc93a

Future<?> shutdownFuture = eventLoopGroup.shutdownGracefully(0, 5, TimeUnit.SECONDS);

shutdownFuture.awaitUninterruptibly();

if (shutdownFuture.isSuccess() == false) {

logger.warn("Error closing netty event loop group", shutdownFuture.cause());

}

I also pondered whether we might block on the closeLock for any length of time and I think the answer to that is also no but I found other potential issues, not really relevant to this change tho so I spun them out into #77539.

I think the 30s here was a fairly random choice a long time ago. I think the closing here should be almost instant as David points out there's an internal 5s timeout there and nothing that could really run for an extended period of time.

DaveCTurner

LGTM

original-brownbear · 2021-09-10T08:50:27Z

Jenkins run elasticsearch-ci/rest-compatibility —

original-brownbear · 2021-09-10T09:05:24Z

@elasticmachine update branch

original-brownbear · 2021-09-10T09:58:47Z

Thanks David & Henning!

No need to fork off in the changed spots if we block the calling thread anyway. Also, some other minor cleanups.

Avoid Needless Forking when Closing Transports

597e600

No need to fork off in the changed spots if we block the calling thread anyway. Also, some other minor cleanups.

original-brownbear added >non-issue :Distributed Coordination/Network Http and internode communication implementations v8.0.0 v7.12.0 labels Dec 28, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Dec 28, 2020

original-brownbear commented Dec 28, 2020

View reviewed changes

williamrandolph added v7.13.0 and removed v7.12.0 labels Feb 18, 2021

pugnascotia added v7.14.0 and removed v7.13.0 labels Apr 21, 2021

mark-vieira added v7.15.0 and removed v7.14.0 labels Jun 30, 2021

mark-vieira added v7.16.0 and removed v7.15.0 labels Aug 19, 2021

Merge remote-tracking branch 'elastic/master' into less-forking-trans…

5db50c9

…port-close

original-brownbear requested review from DaveCTurner and Tim-Brooks September 9, 2021 16:00

DaveCTurner reviewed Sep 9, 2021

View reviewed changes

original-brownbear added 2 commits September 10, 2021 06:23

Merge remote-tracking branch 'elastic/master' into less-forking-trans…

b394796

…port-close

CR: comments

66b9e6f

original-brownbear requested a review from DaveCTurner September 10, 2021 04:47

henningandersen self-requested a review September 10, 2021 05:34

henningandersen reviewed Sep 10, 2021

View reviewed changes

DaveCTurner approved these changes Sep 10, 2021

View reviewed changes

revert long adder

25ba490

Merge branch 'master' into less-forking-transport-close

9d7610d

original-brownbear merged commit 0d54332 into elastic:master Sep 10, 2021

original-brownbear deleted the less-forking-transport-close branch September 10, 2021 09:58

original-brownbear mentioned this pull request Sep 10, 2021

Avoid Needless Forking when Closing Transports (#66834) #77545

Merged

original-brownbear added a commit that referenced this pull request Sep 10, 2021

Avoid Needless Forking when Closing Transports (#66834) (#77545)

da3fd96

No need to fork off in the changed spots if we block the calling thread anyway. Also, some other minor cleanups.

original-brownbear mentioned this pull request Sep 15, 2021

[CI] FrozenSearchableSnapshotsIntegTests classMethod failing #77017

Closed

jakelandis added v8.0.0-alpha2 and removed v8.0.0 labels Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid Needless Forking when Closing Transports #66834

Avoid Needless Forking when Closing Transports #66834

original-brownbear commented Dec 28, 2020

elasticmachine commented Dec 28, 2020

original-brownbear Dec 28, 2020

original-brownbear Dec 28, 2020

original-brownbear Dec 28, 2020

original-brownbear Dec 28, 2020

original-brownbear Dec 28, 2020

DaveCTurner left a comment

DaveCTurner Sep 9, 2021

original-brownbear Sep 10, 2021

DaveCTurner Sep 10, 2021

DaveCTurner Sep 9, 2021

DaveCTurner Sep 9, 2021

original-brownbear Sep 10, 2021

DaveCTurner Sep 9, 2021

original-brownbear Sep 10, 2021

original-brownbear commented Sep 10, 2021

henningandersen left a comment

henningandersen Sep 10, 2021

DaveCTurner Sep 10, 2021

original-brownbear Sep 10, 2021

DaveCTurner left a comment

original-brownbear commented Sep 10, 2021

original-brownbear commented Sep 10, 2021

original-brownbear commented Sep 10, 2021

		responseHandlers.prune(h -> h.connection().getCacheKey().equals(connection.getCacheKey()));
		if (pruned.isEmpty() == false) {

	Future<?> shutdownFuture = eventLoopGroup.shutdownGracefully(0, 5, TimeUnit.SECONDS);
	shutdownFuture.awaitUninterruptibly();
	if (shutdownFuture.isSuccess() == false) {
	logger.warn("Error closing netty event loop group", shutdownFuture.cause());
	}

Avoid Needless Forking when Closing Transports #66834

Avoid Needless Forking when Closing Transports #66834

Conversation

original-brownbear commented Dec 28, 2020

elasticmachine commented Dec 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Sep 10, 2021

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

original-brownbear commented Sep 10, 2021

original-brownbear commented Sep 10, 2021

original-brownbear commented Sep 10, 2021