This repository has been archived by the owner on Nov 14, 2024. It is now read-only.
Avoid creating thousands of get-ranges threads #5224
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Our metrics show services with thousands of threads in the
serializabletransactionmanager-get-ranges
pool, however theexecutor was not instrumented with Tritium, so it's not clear
how saturated it is, or if it is used at all. Threads are incredibly
expensive, it's generally a sign of failure when a service reaches
1000 total threads.
Using PTExecutors factories we get tracing and execution metrics
for free, as well as resource utilization improvements to share
a slice of an underlying cached executor so threads are only
used as needed. The provided
numThreads
is still an upper limitfor the ExecutorService instance, however idle threads can be
used elsewhere.
The existing queue size warning logic is preserved using a counter
rather than instrumenting the queue itself in much the same way
tritium estimates ExecutorService queue size.
Goals (and why):
Vastly reduce memory overhead for several services.
Implementation Description (bullets):
Use the standard ptexecutors factory with a wrapper to support queue size warnings. Ideally this would move to Hyperion instead, but that's out of scope here.
Testing (What was existing testing like? What have you done to improve it?):
No behavior change, only a reduction in resource utilization.
Concerns (what feedback would you like?):
Where should we start reviewing?:
Priority (whenever / two weeks / yesterday):