-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track histogram of transport handling times #80581
Track histogram of transport handling times #80581
Conversation
Adds to the transport node stats a record of the distribution of the times for which a transport thread was handling a message, represented as a histogram. Closes elastic#80428
Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some questions
public class HandlingTimeTracker { | ||
|
||
public static int[] getBucketUpperBounds() { | ||
// Default clock resolution is 200ms so we have buckets for the 0-tick and 1-tick cases, then go up in powers of two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the raw time rather than the cached time so that we get better resolution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm we'd have to if we actually want fine grained resolution. I'd say it's ok, having fine grained information here is quite valuable/interesting and relative to the cost of reading from a channel etc, System.nanoTime
should be trivial.
@@ -199,6 +209,7 @@ private void maybeLogSlowMessage(boolean success) { | |||
final long logThreshold = slowLogThresholdMs; | |||
if (logThreshold > 0) { | |||
final long took = threadPool.relativeTimeInMillis() - startTime; | |||
handlingTimeTracker.addHandlingTime(took); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be spurious since it counts time spent waiting for the channel to become writeable (cf #77838). Should we track it separately from the inbound time tracking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's going to be possible to differentiate between waiting for writable and time spent actually writing really (I know I promised differently a while back sorry about that). Definitely not easily. If you think about it, you could run into a non-writable channel and start counting from there. But then once it becomes writable again, you might not be the first in line to get your bytes flushed because some other write comes before you and may turn the channel not-writable again. And then that other write takes CPU for TLS and such, making it really hard to cleanly define what time was spent waiting.
I really like the current number for the simple fact that it gives an indication for the latency overall on a transport thread (while the inbound handler check indicates individual per message slowness). I don't really see how we could cleanly identify the fact that a channel isn't writable for an extended period of time and pinpoint that on the network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah acking that we can't easily compute exactly what we want, but I still worry that we're putting two different numbers into the one histogram. Should we have two histograms, one for inbound things (which is purely handling-time) and one for outbound things (which potentially includes channel-blocked time)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. Should we have two histograms, one for inbound things (which is purely handling-time) and one for outbound things (which potentially includes channel-blocked time)?
Yea that would be optimal actually. Can we do that here? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 done in c775e4e.
Another related observation is that we're not tracking outbound time for HTTP responses AFAICT. Should we? Can we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we? Can we?
We could. I guess it would be nice to have but probably not all that valuable. The distribution on the outbound side for HTTP will be the same as that for sending transport messages. For REST outbound I'm almost thinking I'd rather like the distribution of serialization times there, because those have historically been the problem and it would give us more information when going on the hunt for why we have a bad distribution of outbound handling times.
IMO that'd be a worthwhile follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd rather have a lower resolution here and eat the cost of uncached time lookups, but other than that LGTM (just some random questions/comments).
A finer resolution seems really valuable when it comes to debugging performance issues. If we're just over-under 200ms, then that wouldn't even allow us to cover current Authz issues much (they'd all fall under 200ms but still be an issue), nor would it allow measuring improvements. WDYT?
@@ -199,6 +209,7 @@ private void maybeLogSlowMessage(boolean success) { | |||
final long logThreshold = slowLogThresholdMs; | |||
if (logThreshold > 0) { | |||
final long took = threadPool.relativeTimeInMillis() - startTime; | |||
handlingTimeTracker.addHandlingTime(took); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's going to be possible to differentiate between waiting for writable and time spent actually writing really (I know I promised differently a while back sorry about that). Definitely not easily. If you think about it, you could run into a non-writable channel and start counting from there. But then once it becomes writable again, you might not be the first in line to get your bytes flushed because some other write comes before you and may turn the channel not-writable again. And then that other write takes CPU for TLS and such, making it really hard to cleanly define what time was spent waiting.
I really like the current number for the simple fact that it gives an indication for the latency overall on a transport thread (while the inbound handler check indicates individual per message slowness). I don't really see how we could cleanly identify the fact that a channel isn't writable for an extended period of time and pinpoint that on the network.
public int[] getHandlingTimeBucketBounds() { | ||
final int[] bounds = new int[handlingTimeBucketBounds.length]; | ||
System.arraycopy(handlingTimeBucketBounds, 0, bounds, 0, handlingTimeBucketBounds.length); | ||
return bounds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I suppose you could use Arrays.copyOf
or clone
to shorten these to a single line :)
@@ -42,6 +57,14 @@ public TransportStats(StreamInput in) throws IOException { | |||
rxSize = in.readVLong(); | |||
txCount = in.readVLong(); | |||
txSize = in.readVLong(); | |||
if (in.getVersion().onOrAfter(Version.V_8_1_0)) { | |||
handlingTimeBucketFrequencies = in.readLongArray(); | |||
handlingTimeBucketBounds = in.readIntArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why write these over the wire when they're hard coded per version anyway? :) I guess we need it to make BwC safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything addressed I think
@@ -199,6 +209,7 @@ private void maybeLogSlowMessage(boolean success) { | |||
final long logThreshold = slowLogThresholdMs; | |||
if (logThreshold > 0) { | |||
final long took = threadPool.relativeTimeInMillis() - startTime; | |||
handlingTimeTracker.addHandlingTime(took); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah acking that we can't easily compute exactly what we want, but I still worry that we're putting two different numbers into the one histogram. Should we have two histograms, one for inbound things (which is purely handling-time) and one for outbound things (which potentially includes channel-blocked time)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'm assuming the BwC test is unrelated.
…1-05-blocked-time-histogram
SLES workers are broken it seems. @elasticmachine please run elasticsearch-ci/rest-compatibility |
SLES workers still failing to report successful builds. @elasticmachine please run elasticsearch-ci/part-1 |
* upstream/master: (150 commits) Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864) Optimize DLS bitset building for matchAll query (elastic#81030) URL option for BaseRunAsSuperuserCommand (elastic#81025) Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942) Fix shadowed vars pt7 (elastic#80996) Fail shards early when we can detect a type missmatch (elastic#79869) Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096) Clarify `unassigned.reason` docs (elastic#81017) Strip blocks from settings for reindex targets (elastic#80887) Split off the values supplier for ScriptDocValues (elastic#80635) [ML] Switch message and detail for model snapshot deprecations (elastic#81108) [DOCS] Update xrefs for snapshot restore docs (elastic#81023) [ML] Updates visiblity of validate API (elastic#81061) Track histogram of transport handling times (elastic#80581) [ML] Fix datafeed preview with remote indices (elastic#81099) [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060) [ML] Add logging for failing PyTorch test (elastic#81044) Extending the timeout waiting for snapshot to be ready (elastic#81018) [ML] Fix incorrect logging of unexpected model size error (elastic#81089) [ML] Make inference timeout test more reliable (elastic#81094) ... # Conflicts: # server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java
* upstream/master: (55 commits) Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864) Optimize DLS bitset building for matchAll query (elastic#81030) URL option for BaseRunAsSuperuserCommand (elastic#81025) Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942) Fix shadowed vars pt7 (elastic#80996) Fail shards early when we can detect a type missmatch (elastic#79869) Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096) Clarify `unassigned.reason` docs (elastic#81017) Strip blocks from settings for reindex targets (elastic#80887) Split off the values supplier for ScriptDocValues (elastic#80635) [ML] Switch message and detail for model snapshot deprecations (elastic#81108) [DOCS] Update xrefs for snapshot restore docs (elastic#81023) [ML] Updates visiblity of validate API (elastic#81061) Track histogram of transport handling times (elastic#80581) [ML] Fix datafeed preview with remote indices (elastic#81099) [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060) [ML] Add logging for failing PyTorch test (elastic#81044) Extending the timeout waiting for snapshot to be ready (elastic#81018) [ML] Fix incorrect logging of unexpected model size error (elastic#81089) [ML] Make inference timeout test more reliable (elastic#81094) ...
* upstream/master: (55 commits) Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864) Optimize DLS bitset building for matchAll query (elastic#81030) URL option for BaseRunAsSuperuserCommand (elastic#81025) Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942) Fix shadowed vars pt7 (elastic#80996) Fail shards early when we can detect a type missmatch (elastic#79869) Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096) Clarify `unassigned.reason` docs (elastic#81017) Strip blocks from settings for reindex targets (elastic#80887) Split off the values supplier for ScriptDocValues (elastic#80635) [ML] Switch message and detail for model snapshot deprecations (elastic#81108) [DOCS] Update xrefs for snapshot restore docs (elastic#81023) [ML] Updates visiblity of validate API (elastic#81061) Track histogram of transport handling times (elastic#80581) [ML] Fix datafeed preview with remote indices (elastic#81099) [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060) [ML] Add logging for failing PyTorch test (elastic#81044) Extending the timeout waiting for snapshot to be ready (elastic#81018) [ML] Fix incorrect logging of unexpected model size error (elastic#81089) [ML] Make inference timeout test more reliable (elastic#81094) ...
Transport handling times were added in elastic#80581 (8.1), we don't need assertions for version prior to that in 9.0
Adds to the transport node stats a record of the distribution of the
times for which a transport thread was handling a message, represented
as a histogram.
Closes #80428