Dump Stacktrace on Slow IO-Thread Operations #42000

original-brownbear · 2019-05-09T10:01:34Z

Follow up to Add log warnings for long running event handling #39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
- Logging the stack-trace after the thread became unblocked is only of
  limited use because we don't know what happened in the slow callback
  from that (only whether we were blocked on a read,write,connect etc.)
Relates [CI] RemoteClusterClientTests#testConnectAndExecuteRequest fails #41745

* Follow up to elastic#39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates elastic#41745

elasticmachine · 2019-05-09T10:01:37Z

Pinging @elastic/es-distributed

original-brownbear · 2019-05-09T10:50:32Z

Jenkins run elasticsearch-ci/docbldesx

cbuescher

@original-brownbear thanks, I took a look since this would add great diagnostics in cases like the CI failure I am currently looking at. I don't know if the suggested 2s interval and warning threshold (seems 150ms) is long enough to not disturb normal test execution, my guess is that the mock transport should be fast but this is something somebody else like @tbrooks8 should comment on. I'd definitely like to have this kind of warning logging if it doesn't spam regular test execution error logs too much.

original-brownbear · 2019-05-10T18:59:48Z

@cbuescher

I don't know if the suggested 2s interval and warning threshold (seems 150ms) is long enough to not disturb normal test execution, my guess is that the mock transport should be fast

The warning threshold was 150ms before as well and I think it was a fine choice then by @tbrooks8.
The 2s interval I figured was short enough to give a little more insight in case of a long running action on that transport_worker thread (from multiple changing stack traces being printed) but wouldn't completely destroy the logs for a 30s or 60s request timeout with a dead-locked callback. But admittedly the choice of 2s was pretty arbitrary. We could also put more effort into this (I'm not 100% convinced it's worth it but also not opposed) if others disagree and lower the 2s check interval and deduplicate log messages?

original-brownbear · 2019-05-20T12:38:08Z

@tbrooks8 can you take a look here whenever you have a sec? I think you're the only one here comfortable reviewing this thing :)

Tim-Brooks · 2019-05-21T03:13:37Z

Yes I’ll take a look tomorrow.

Tim-Brooks · 2019-05-21T23:55:39Z

test/framework/src/main/java/org/elasticsearch/transport/nio/MockNioTransport.java

+                    final Thread thread = entry.getKey();
+                    logger.warn("Slow execution on network thread [{}] [{} milliseconds]: \n{}", thread.getName(),
+                        TimeUnit.NANOSECONDS.toMillis(elapsedTime),
+                        Arrays.stream(thread.getStackTrace()).map(Object::toString).collect(Collectors.joining("\n")));


Why does this not cause issues with the security manager to call getStackTrace?

I think it's because we grant

// needed for top threads handling permission org.elasticsearch.secure_sm.ThreadPermission "modifyArbitraryThreadGroup";

to codeBase "${codebase.randomizedtesting-runner}" { right? (in test-framework.policy)

We actually do use the same code to get stack traces in other tests too so I'm sure it works fine with the SM (+ manually verified it).

Tim-Brooks · 2019-05-22T00:01:26Z

test/framework/src/main/java/org/elasticsearch/transport/nio/MockNioTransport.java

+                        Arrays.stream(thread.getStackTrace()).map(Object::toString).collect(Collectors.joining("\n")));
+                }
+            }
+            if (registered.get() > 0) {


There is kind of a race here. It feels like we should just have the thread always run and always reschedule the task? It's not like a 2 second periodic task is coming to cause issues with performance for the integration tests.

The race is:

Generic thread pool task checks and sees 0 registered.

New thread calls register and then increments registered, does the other stuff, then fails on the running compare and set (because running is still true).

Generic thread pool task finishes setting running too false.

Obviously the next registered call will fix that. So things should eventually work out. But I don't see the significant value of making the concurrency handling this complicated when a periodic task seems fine.

Fair point, I was too paranoid here I guess :) Significantly simplified this now to always rescheduled and just added a flag to make it stop.

…ransport-monitoring

Tim-Brooks

LGTM

original-brownbear · 2019-05-22T13:31:21Z

thanks @tbrooks8 !

* Dump Stacktrace on Slow IO-Thread Operations * Follow up to elastic#39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates elastic#41745

* Dump Stacktrace on Slow IO-Thread Operations * Follow up to #39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates #41745

original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Network Http and internode communication implementations v8.0.0 v7.2.0 labels May 9, 2019

original-brownbear mentioned this pull request May 9, 2019

[CI] RemoteClusterClientTests#testConnectAndExecuteRequest fails #41745

Closed

original-brownbear requested review from Tim-Brooks and DaveCTurner May 9, 2019 12:03

cbuescher reviewed May 10, 2019

View reviewed changes

Tim-Brooks reviewed May 22, 2019

View reviewed changes

original-brownbear added 3 commits May 22, 2019 07:11

simplify io thread watchdog monitoring

7131dbb

Merge remote-tracking branch 'elastic/master' into stronger-blocked-t…

d7a49ad

…ransport-monitoring

simplify stopping

60c0a25

original-brownbear requested a review from Tim-Brooks May 22, 2019 05:33

Tim-Brooks approved these changes May 22, 2019

View reviewed changes

original-brownbear merged commit 94848d8 into elastic:master May 22, 2019

original-brownbear deleted the stronger-blocked-transport-monitoring branch May 22, 2019 13:31

original-brownbear added backport pending v7.3.0 and removed v7.2.0 backport pending labels May 22, 2019

original-brownbear mentioned this pull request May 27, 2019

Dump Stacktrace on Slow IO-Thread Operations (#42000) #42572

Merged

original-brownbear restored the stronger-blocked-transport-monitoring branch August 6, 2020 18:56

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dump Stacktrace on Slow IO-Thread Operations #42000

Dump Stacktrace on Slow IO-Thread Operations #42000

original-brownbear commented May 9, 2019

elasticmachine commented May 9, 2019

original-brownbear commented May 9, 2019

cbuescher left a comment

original-brownbear commented May 10, 2019

original-brownbear commented May 20, 2019

Tim-Brooks commented May 21, 2019

Tim-Brooks May 21, 2019

original-brownbear May 22, 2019

Tim-Brooks May 22, 2019

original-brownbear May 22, 2019

Tim-Brooks left a comment

original-brownbear commented May 22, 2019

Dump Stacktrace on Slow IO-Thread Operations #42000

Dump Stacktrace on Slow IO-Thread Operations #42000

Conversation

original-brownbear commented May 9, 2019

elasticmachine commented May 9, 2019

original-brownbear commented May 9, 2019

cbuescher left a comment

Choose a reason for hiding this comment

original-brownbear commented May 10, 2019

original-brownbear commented May 20, 2019

Tim-Brooks commented May 21, 2019

Tim-Brooks May 21, 2019

Choose a reason for hiding this comment

original-brownbear May 22, 2019

Choose a reason for hiding this comment

Tim-Brooks May 22, 2019

Choose a reason for hiding this comment

original-brownbear May 22, 2019

Choose a reason for hiding this comment

Tim-Brooks left a comment

Choose a reason for hiding this comment

original-brownbear commented May 22, 2019