Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track histogram of transport handling times #80581

Merged

Conversation

DaveCTurner
Copy link
Contributor

Adds to the transport node stats a record of the distribution of the
times for which a transport thread was handling a message, represented
as a histogram.

Closes #80428

Adds to the transport node stats a record of the distribution of the
times for which a transport thread was handling a message, represented
as a histogram.

Closes elastic#80428
@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Network Http and internode communication implementations v8.1.0 labels Nov 10, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 10, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some questions

public class HandlingTimeTracker {

public static int[] getBucketUpperBounds() {
// Default clock resolution is 200ms so we have buckets for the 0-tick and 1-tick cases, then go up in powers of two
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the raw time rather than the cached time so that we get better resolution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we'd have to if we actually want fine grained resolution. I'd say it's ok, having fine grained information here is quite valuable/interesting and relative to the cost of reading from a channel etc, System.nanoTime should be trivial.

@@ -199,6 +209,7 @@ private void maybeLogSlowMessage(boolean success) {
final long logThreshold = slowLogThresholdMs;
if (logThreshold > 0) {
final long took = threadPool.relativeTimeInMillis() - startTime;
handlingTimeTracker.addHandlingTime(took);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be spurious since it counts time spent waiting for the channel to become writeable (cf #77838). Should we track it separately from the inbound time tracking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's going to be possible to differentiate between waiting for writable and time spent actually writing really (I know I promised differently a while back sorry about that). Definitely not easily. If you think about it, you could run into a non-writable channel and start counting from there. But then once it becomes writable again, you might not be the first in line to get your bytes flushed because some other write comes before you and may turn the channel not-writable again. And then that other write takes CPU for TLS and such, making it really hard to cleanly define what time was spent waiting.

I really like the current number for the simple fact that it gives an indication for the latency overall on a transport thread (while the inbound handler check indicates individual per message slowness). I don't really see how we could cleanly identify the fact that a channel isn't writable for an extended period of time and pinpoint that on the network.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah acking that we can't easily compute exactly what we want, but I still worry that we're putting two different numbers into the one histogram. Should we have two histograms, one for inbound things (which is purely handling-time) and one for outbound things (which potentially includes channel-blocked time)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. Should we have two histograms, one for inbound things (which is purely handling-time) and one for outbound things (which potentially includes channel-blocked time)?

Yea that would be optimal actually. Can we do that here? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 done in c775e4e.

Another related observation is that we're not tracking outbound time for HTTP responses AFAICT. Should we? Can we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we? Can we?

We could. I guess it would be nice to have but probably not all that valuable. The distribution on the outbound side for HTTP will be the same as that for sending transport messages. For REST outbound I'm almost thinking I'd rather like the distribution of serialization times there, because those have historically been the problem and it would give us more information when going on the hunt for why we have a bad distribution of outbound handling times.
IMO that'd be a worthwhile follow-up.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather have a lower resolution here and eat the cost of uncached time lookups, but other than that LGTM (just some random questions/comments).
A finer resolution seems really valuable when it comes to debugging performance issues. If we're just over-under 200ms, then that wouldn't even allow us to cover current Authz issues much (they'd all fall under 200ms but still be an issue), nor would it allow measuring improvements. WDYT?

@@ -199,6 +209,7 @@ private void maybeLogSlowMessage(boolean success) {
final long logThreshold = slowLogThresholdMs;
if (logThreshold > 0) {
final long took = threadPool.relativeTimeInMillis() - startTime;
handlingTimeTracker.addHandlingTime(took);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's going to be possible to differentiate between waiting for writable and time spent actually writing really (I know I promised differently a while back sorry about that). Definitely not easily. If you think about it, you could run into a non-writable channel and start counting from there. But then once it becomes writable again, you might not be the first in line to get your bytes flushed because some other write comes before you and may turn the channel not-writable again. And then that other write takes CPU for TLS and such, making it really hard to cleanly define what time was spent waiting.

I really like the current number for the simple fact that it gives an indication for the latency overall on a transport thread (while the inbound handler check indicates individual per message slowness). I don't really see how we could cleanly identify the fact that a channel isn't writable for an extended period of time and pinpoint that on the network.

public int[] getHandlingTimeBucketBounds() {
final int[] bounds = new int[handlingTimeBucketBounds.length];
System.arraycopy(handlingTimeBucketBounds, 0, bounds, 0, handlingTimeBucketBounds.length);
return bounds;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I suppose you could use Arrays.copyOf or clone to shorten these to a single line :)

@@ -42,6 +57,14 @@ public TransportStats(StreamInput in) throws IOException {
rxSize = in.readVLong();
txCount = in.readVLong();
txSize = in.readVLong();
if (in.getVersion().onOrAfter(Version.V_8_1_0)) {
handlingTimeBucketFrequencies = in.readLongArray();
handlingTimeBucketBounds = in.readIntArray();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why write these over the wire when they're hard coded per version anyway? :) I guess we need it to make BwC safe?

Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything addressed I think

@@ -199,6 +209,7 @@ private void maybeLogSlowMessage(boolean success) {
final long logThreshold = slowLogThresholdMs;
if (logThreshold > 0) {
final long took = threadPool.relativeTimeInMillis() - startTime;
handlingTimeTracker.addHandlingTime(took);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah acking that we can't easily compute exactly what we want, but I still worry that we're putting two different numbers into the one histogram. Should we have two histograms, one for inbound things (which is purely handling-time) and one for outbound things (which potentially includes channel-blocked time)?

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'm assuming the BwC test is unrelated.

@DaveCTurner
Copy link
Contributor Author

SLES workers are broken it seems. @elasticmachine please run elasticsearch-ci/rest-compatibility

@DaveCTurner
Copy link
Contributor Author

SLES workers still failing to report successful builds.

@elasticmachine please run elasticsearch-ci/part-1
@elasticmachine please run elasticsearch-ci/rest-compatibility

@DaveCTurner DaveCTurner merged commit 54e0370 into elastic:master Nov 29, 2021
@DaveCTurner DaveCTurner deleted the 2021-11-05-blocked-time-histogram branch November 29, 2021 15:41
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 30, 2021
* upstream/master: (150 commits)
  Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864)
  Optimize DLS bitset building for matchAll query (elastic#81030)
  URL option for BaseRunAsSuperuserCommand (elastic#81025)
  Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942)
  Fix shadowed vars pt7 (elastic#80996)
  Fail shards early when we can detect a type missmatch (elastic#79869)
  Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096)
  Clarify `unassigned.reason` docs (elastic#81017)
  Strip blocks from settings for reindex targets (elastic#80887)
  Split off the values supplier for ScriptDocValues (elastic#80635)
  [ML] Switch message and detail for model snapshot deprecations (elastic#81108)
  [DOCS] Update xrefs for snapshot restore docs (elastic#81023)
  [ML] Updates visiblity of validate API (elastic#81061)
  Track histogram of transport handling times (elastic#80581)
  [ML] Fix datafeed preview with remote indices (elastic#81099)
  [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060)
  [ML] Add logging for failing PyTorch test (elastic#81044)
  Extending the timeout waiting for snapshot to be ready (elastic#81018)
  [ML] Fix incorrect logging of unexpected model size error (elastic#81089)
  [ML] Make inference timeout test more reliable (elastic#81094)
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 30, 2021
* upstream/master: (55 commits)
  Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864)
  Optimize DLS bitset building for matchAll query (elastic#81030)
  URL option for BaseRunAsSuperuserCommand (elastic#81025)
  Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942)
  Fix shadowed vars pt7 (elastic#80996)
  Fail shards early when we can detect a type missmatch (elastic#79869)
  Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096)
  Clarify `unassigned.reason` docs (elastic#81017)
  Strip blocks from settings for reindex targets (elastic#80887)
  Split off the values supplier for ScriptDocValues (elastic#80635)
  [ML] Switch message and detail for model snapshot deprecations (elastic#81108)
  [DOCS] Update xrefs for snapshot restore docs (elastic#81023)
  [ML] Updates visiblity of validate API (elastic#81061)
  Track histogram of transport handling times (elastic#80581)
  [ML] Fix datafeed preview with remote indices (elastic#81099)
  [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060)
  [ML] Add logging for failing PyTorch test (elastic#81044)
  Extending the timeout waiting for snapshot to be ready (elastic#81018)
  [ML] Fix incorrect logging of unexpected model size error (elastic#81089)
  [ML] Make inference timeout test more reliable (elastic#81094)
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 30, 2021
* upstream/master: (55 commits)
  Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864)
  Optimize DLS bitset building for matchAll query (elastic#81030)
  URL option for BaseRunAsSuperuserCommand (elastic#81025)
  Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942)
  Fix shadowed vars pt7 (elastic#80996)
  Fail shards early when we can detect a type missmatch (elastic#79869)
  Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096)
  Clarify `unassigned.reason` docs (elastic#81017)
  Strip blocks from settings for reindex targets (elastic#80887)
  Split off the values supplier for ScriptDocValues (elastic#80635)
  [ML] Switch message and detail for model snapshot deprecations (elastic#81108)
  [DOCS] Update xrefs for snapshot restore docs (elastic#81023)
  [ML] Updates visiblity of validate API (elastic#81061)
  Track histogram of transport handling times (elastic#80581)
  [ML] Fix datafeed preview with remote indices (elastic#81099)
  [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060)
  [ML] Add logging for failing PyTorch test (elastic#81044)
  Extending the timeout waiting for snapshot to be ready (elastic#81018)
  [ML] Fix incorrect logging of unexpected model size error (elastic#81089)
  [ML] Make inference timeout test more reliable (elastic#81094)
  ...
arteam added a commit to arteam/elasticsearch that referenced this pull request Oct 14, 2024
Transport handling times were added in elastic#80581 (8.1),
we don't need assertions for version prior to that in 9.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Histogram of Transport Worker Time that is Spent per-Message
3 participants