Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dump Stacktrace on Slow IO-Thread Operations #42000

Conversation

original-brownbear
Copy link
Member

* Follow up to elastic#39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates elastic#41745
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Network Http and internode communication implementations v8.0.0 v7.2.0 labels May 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/docbldesx

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@original-brownbear thanks, I took a look since this would add great diagnostics in cases like the CI failure I am currently looking at. I don't know if the suggested 2s interval and warning threshold (seems 150ms) is long enough to not disturb normal test execution, my guess is that the mock transport should be fast but this is something somebody else like @tbrooks8 should comment on. I'd definitely like to have this kind of warning logging if it doesn't spam regular test execution error logs too much.

@original-brownbear
Copy link
Member Author

@cbuescher

I don't know if the suggested 2s interval and warning threshold (seems 150ms) is long enough to not disturb normal test execution, my guess is that the mock transport should be fast

The warning threshold was 150ms before as well and I think it was a fine choice then by @tbrooks8.
The 2s interval I figured was short enough to give a little more insight in case of a long running action on that transport_worker thread (from multiple changing stack traces being printed) but wouldn't completely destroy the logs for a 30s or 60s request timeout with a dead-locked callback. But admittedly the choice of 2s was pretty arbitrary. We could also put more effort into this (I'm not 100% convinced it's worth it but also not opposed) if others disagree and lower the 2s check interval and deduplicate log messages?

@original-brownbear
Copy link
Member Author

@tbrooks8 can you take a look here whenever you have a sec? I think you're the only one here comfortable reviewing this thing :)

@Tim-Brooks
Copy link
Contributor

Yes I’ll take a look tomorrow.

final Thread thread = entry.getKey();
logger.warn("Slow execution on network thread [{}] [{} milliseconds]: \n{}", thread.getName(),
TimeUnit.NANOSECONDS.toMillis(elapsedTime),
Arrays.stream(thread.getStackTrace()).map(Object::toString).collect(Collectors.joining("\n")));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this not cause issues with the security manager to call getStackTrace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because we grant

  // needed for top threads handling
  permission org.elasticsearch.secure_sm.ThreadPermission "modifyArbitraryThreadGroup";

to codeBase "${codebase.randomizedtesting-runner}" { right? (in test-framework.policy)

We actually do use the same code to get stack traces in other tests too so I'm sure it works fine with the SM (+ manually verified it).

Arrays.stream(thread.getStackTrace()).map(Object::toString).collect(Collectors.joining("\n")));
}
}
if (registered.get() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is kind of a race here. It feels like we should just have the thread always run and always reschedule the task? It's not like a 2 second periodic task is coming to cause issues with performance for the integration tests.

The race is:

  1. Generic thread pool task checks and sees 0 registered.
  2. New thread calls register and then increments registered, does the other stuff, then fails on the running compare and set (because running is still true).
  3. Generic thread pool task finishes setting running too false.

Obviously the next registered call will fix that. So things should eventually work out. But I don't see the significant value of making the concurrency handling this complicated when a periodic task seems fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I was too paranoid here I guess :) Significantly simplified this now to always rescheduled and just added a flag to make it stop.

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

thanks @tbrooks8 !

@original-brownbear original-brownbear merged commit 94848d8 into elastic:master May 22, 2019
@original-brownbear original-brownbear deleted the stronger-blocked-transport-monitoring branch May 22, 2019 13:31
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request May 27, 2019
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to elastic#39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates elastic#41745
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to elastic#39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates elastic#41745
original-brownbear added a commit that referenced this pull request May 27, 2019
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to #39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates #41745
cbuescher pushed a commit that referenced this pull request Jun 13, 2019
* Dump Stacktrace on Slow IO-Thread Operations

* Follow up to #39729 extending the functionality to actually dump the
stack when the thread is blocked not afterwards
   * Logging the stacktrace after the thread became unblocked is only of
limited use because we don't know what happened in the slow callback
from that (only whether we were  blocked on a read,write,connect etc.)
* Relates #41745
@original-brownbear original-brownbear restored the stronger-blocked-transport-monitoring branch August 6, 2020 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >test Issues or PRs that are addressing/adding tests v7.3.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants