Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add functionality for an automatic thread dump upon thread pool exhaustion #137

Open
wants to merge 2 commits into
base: 2.4
Choose a base branch
from

Conversation

aogburn
Copy link
Contributor

@aogburn aogburn commented Oct 12, 2023

No description provided.

Copy link
Contributor

@dmlloyd dmlloyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in a comment, the pool filling up is not an error condition. But I wonder if maybe we should instead have a periodic deadlock detection task, or something like that, if deadlocks are the problem being detected here?

@@ -1514,6 +1520,9 @@ int tryAllocateThread(final float growthResistance) {
oldSize = currentSizeOf(oldStat);
if (oldSize >= maxSizeOf(oldStat)) {
// max threads already reached
if (reachedMax.compareAndSet(false, true)) {
ThreadDumpUtil.threadDump(maxSizeOf(oldStat));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be unconditional. Thread pools reaching capacity is not an error condition, it just means that the behavior changes from submitting tasks to new or waiting threads, to instead adding the task to a queue that the worker threads read from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be the condition? This has a boolean to only log on the first occurrence of exhaustion. But if we'd accept such a thread dump utility, I think there should be some further configuration required to enable it for the pool if desired. So this would then prompt another change primarily in XNIO to enable this thread dumping for its task pool where such dumping would be desired. Such a dump would likely not be desired from the MSC thread pool upon its exhaustion for instance.

While the behavior of queuing upon is exhaustion is the expected behavior, the reality for customers using JBoss and support engineers assisting them is that this is universally a critical issue state for the XNIO task thread pool used by undertow as that begins to immediately degrade response times if not lead to full request unresponsiveness as well if already processing requests are not quickly completing. We often see multiple cases a day requiring diagnosis of such thread exhaustion induced by many varied issues and most often not from a conventional deadlock. Further details about this is recorded at https://issues.redhat.com/browse/EAPSUP-1325.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition as you suggest should be a configuration switch, and I'd also propose that there should be either two switches or a single three-state switch to separately configure whether a warning is logged when the thread pool queues, and separately whether the warning includes the thread dump.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved this to make it configurable. Let me know any suggestions.

@@ -97,6 +97,25 @@ interface Messages extends BasicLogger {
@LogMessage(level = Logger.Level.ERROR)
void taskSubmitFailed(@Cause RejectedExecutionException e, Runnable task);

@Message(id = 15, value = "Thread pool has reached %d max threads in use. Performance may be impacted. Thread dump:\n"+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These messages probably shouldn't have IDs like this; a thread dump should probably just go directly to the output, or be formatted otherwise like a normal JDK thread dump.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, IDs have previously been assigned to auto generated thread dump to logging functions like in eap/wildfly-core itself:

https://github.com/aogburn/wildfly-core/blob/b3cfcb741c8ced604d8895de48ec920fc130057d/controller/src/main/java/org/jboss/as/controller/logging/ControllerLogger.java#L3710

Or in activemq-artemis:

https://github.com/apache/activemq-artemis/blob/main/artemis-server/src/main/java/org/apache/activemq/artemis/core/server/ActiveMQServerLogger.java#L901

Message ids can be quite helpful to support and I and others use them in simple tooling (https://github.com/aogburn/yala) to quickly search and handle critical known message ids or sum up key information in logging. So for a usability perspective, I'd request that it have an ID.

I think to be truly formatted exactly like a normal JDK thread dump it would probably require being generated through normal JDK utilities, like forking an external process that's calling the JDK jstack command if available (then a dump can be missed if someone has not installed the jdk-devel package or is running a JRE).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then let's make it one big message with one ID instead of multiple. Loggers can accept a StringBuilder as a string argument, if that is helpful.

Copy link
Contributor Author

@aogburn aogburn Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to just one message id for the thread dump messages. I did leave one additional id to handle any exception from a dump attempt. Is that suitable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmlloyd Just wanted to check if you'd have any further feedback here? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been ruminating on this. For too long I guess. :) I have a couple more comments.

@dmlloyd
Copy link
Contributor

dmlloyd commented Dec 6, 2023

I'm trying to reconcile the design issues around what it means to exhaust a pool, with the practical correlation between a pool with queued tasks and performance problems.

There are many many ways that a thread pool can be expected to work in the presence of a variety of workloads. I'm just not ready to declare that having queued tasks means that we've hit a "knee" in the performance graph in all or even most cases. And yet there is a definite observation that this correlation does exist in the real world, at least in some cases. So, how to resolve the dilemma? I have a couple of ideas.

  • Could we consider just having a general-purpose dumpAllThreadStacks utility method on EQE? Then any debugging code in any situation can easily trigger a thread dump, and we could have variations for writing to a file, an output stream, or the log.
  • As a part of WildFly configuration, could we configure a monitor (perhaps via a dedicated thread) to poll the thread pool state periodically and trigger the thread stack dump based on the configured conditions? (For example it could be a certain queue depth, a certain pool size, etc.)

WDYT?

@carterkozak
Copy link
Contributor

I've been following this conversation from the sidelines, figured I'd share my thoughts:

When a webserver worker pool becomes fully saturated and queues work, a thread dump is often helpful for debugging. However, there are many scenarios where this executor is used which are designed to expect queueing in order to limit concurrent task execution outside of the webserver use case. I think part of the issue is that we have a specific failure mode in which the queueing predicate is met, where it's helpful to have a thread dump, however we don't the data to understand how often queueing occurs when we're not in a degenerate state that requires debugging.

As an aside, if queueing suggests a serious problem, would it be better not to allow any queueing? The queue allows us to linearize spiky workloads at the cost of predictable response times -- in some cases we can better distribute that latency cost by increasing the maximum thread count instead (at the limit, to the previous max-threads + queue-size values, and removing the queue) such that the scheduler may allow all work to progress (though thread resource costs must be considered).
Integrating with Wildfly instead of the executor itself allows for more interesting experimentation due to greater context into the specific webserver use-case -- it may not be problematic if a large spike of near-instantaneous requests are received, queued, and processed quickly, however perhaps monitoring the queue with a temporal bound is able to produce stronger signal compared to a predicate on any queueing at all.

@aogburn
Copy link
Contributor Author

aogburn commented Dec 6, 2023

Thanks @dmlloyd.

There are many many ways that a thread pool can be expected to work in the presence of a variety of workloads. I'm just not ready to declare that having queued tasks means that we've hit a "knee" in the performance graph in all or even most cases.

Which is why it can be disabled by default and enabled in an appropriate context where needed (like being enabled by default by Undertow/XNIO for easier web request slowness RCAs). It could also have other configurable capture points if we wish (a certain queue amount or duration at a queued amount, a time gap since the last task was fetched from the queue...). But my request on behalf of support and customers is really just limited to the context of an automatic thread exhaustion dump for the vast majority of web request slowness RCA cases we see. I certainly understand from a general development standpoint a thread pool could have varied use and behaviors that could impact desires for any automatic thread dump if ever. But I'm not aware of anyone else in any other context requesting automatic dumps (at least not that I come across in the day to day working with lots of JBoss customers). So I think trying to expound too much on the automatic dump capabilities and trigger points for other potentially varied workloads outside of my web request context may ultimately lead to overthinking solutions to problems that don't exist or aren't significant in comparison.

Could we consider just having a general-purpose dumpAllThreadStacks utility method on EQE?

We could, but this would only have the desired impact if it is ultimately hooked into something that will automatically invoke the dump with default EAP configurations not requiring any other manual commands or config changes (like Undertow/XNIO flipping on the dump for its task pool by default). And thread exhaustion I think would be the most universally beneficial option for a default capture point for web requests. If this enhancement gets back to a place where it requires manual configuration or manual intervention for commands at an issue time, then data would still be missed in issue states and it defeats the purpose of implementing it.

As a part of WildFly configuration, could we configure a monitor (perhaps via a dedicated thread) to poll the thread pool state periodically and trigger the thread stack dump based on the configured conditions?

I suppose, but WildFly/EAP/XNIO/Undertow teams are already happy to absorb such an auto dumping feature if in the EQE itself. So if we really want to push this back to some other additional WildFly level feature to invoke, then this would have to be pushed back through the WildFly/EAP teams to review and would likely take more overall Eng/QE resources to get that out the door for likely the same end result (capturing an automatic dump for typical web request thread exhaustion RCAs).

As an aside, if queueing suggests a serious problem, would it be better not to allow any queueing?

The problem is not specifically the queuing, it's the unsuccessful web requests with an exhausted thread pool. We can disable the queuing already, but that just trades the problem symptom of indeterminably growing response times in queue (EAP 7's default) for immediately rejected requests instead (EAP 6's default). Neither problem symptoms are really tolerable for the majority of typical web request work loads so it leads to an outage type RCA either way. For a possibly temporary thread exhaustion issue, queuing can lead to impacts that are more minimal or less visible with possibly small response time impacts compared to easily visible immediate rejections. But a lot of the nominal state web request workloads we see do not oscillate often between exhaustion/queuing and no queuing. They're generally tuned to avoid thread exhaustion points and any queuing so that becomes an obvious degraded response time point once it occurs compared to the nominal no queuing states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants