-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dump Stacktrace on Slow IO-Thread Operations #42000
Dump Stacktrace on Slow IO-Thread Operations #42000
Conversation
* Follow up to elastic#39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates elastic#41745
Pinging @elastic/es-distributed |
Jenkins run elasticsearch-ci/docbldesx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@original-brownbear thanks, I took a look since this would add great diagnostics in cases like the CI failure I am currently looking at. I don't know if the suggested 2s interval and warning threshold (seems 150ms) is long enough to not disturb normal test execution, my guess is that the mock transport should be fast but this is something somebody else like @tbrooks8 should comment on. I'd definitely like to have this kind of warning logging if it doesn't spam regular test execution error logs too much.
The warning threshold was |
@tbrooks8 can you take a look here whenever you have a sec? I think you're the only one here comfortable reviewing this thing :) |
Yes I’ll take a look tomorrow. |
final Thread thread = entry.getKey(); | ||
logger.warn("Slow execution on network thread [{}] [{} milliseconds]: \n{}", thread.getName(), | ||
TimeUnit.NANOSECONDS.toMillis(elapsedTime), | ||
Arrays.stream(thread.getStackTrace()).map(Object::toString).collect(Collectors.joining("\n"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this not cause issues with the security manager to call getStackTrace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because we grant
// needed for top threads handling
permission org.elasticsearch.secure_sm.ThreadPermission "modifyArbitraryThreadGroup";
to codeBase "${codebase.randomizedtesting-runner}" {
right? (in test-framework.policy
)
We actually do use the same code to get stack traces in other tests too so I'm sure it works fine with the SM (+ manually verified it).
Arrays.stream(thread.getStackTrace()).map(Object::toString).collect(Collectors.joining("\n"))); | ||
} | ||
} | ||
if (registered.get() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is kind of a race here. It feels like we should just have the thread always run and always reschedule the task? It's not like a 2 second periodic task is coming to cause issues with performance for the integration tests.
The race is:
- Generic thread pool task checks and sees 0 registered.
- New thread calls register and then increments registered, does the other stuff, then fails on the running compare and set (because running is still true).
- Generic thread pool task finishes setting running too false.
Obviously the next registered call will fix that. So things should eventually work out. But I don't see the significant value of making the concurrency handling this complicated when a periodic task seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, I was too paranoid here I guess :) Significantly simplified this now to always rescheduled and just added a flag to make it stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thanks @tbrooks8 ! |
* Dump Stacktrace on Slow IO-Thread Operations * Follow up to elastic#39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates elastic#41745
* Dump Stacktrace on Slow IO-Thread Operations * Follow up to elastic#39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates elastic#41745
* Dump Stacktrace on Slow IO-Thread Operations * Follow up to #39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates #41745
* Dump Stacktrace on Slow IO-Thread Operations * Follow up to #39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates #41745
stack when the thread is blocked not afterwards
limited use because we don't know what happened in the slow callback
from that (only whether we were blocked on a read,write,connect etc.)