Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record more detailed HTTP stats #99852

Merged
merged 25 commits into from
Oct 12, 2023
Merged

Conversation

ywangd
Copy link
Member

@ywangd ywangd commented Sep 25, 2023

This PR adds more details HTTP stats breaking down by HTTP routes.

Resolves: #95739

@ywangd ywangd added >enhancement :Distributed Coordination/Network Http and internode communication implementations v8.11.0 labels Sep 25, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @ywangd, I've created a changelog YAML for you.

@ywangd
Copy link
Member Author

ywangd commented Sep 25, 2023

@DaveCTurner This PR is marked as draft because I am seeking high level feedback on the approach. The issue (#95739) says

add to the node/cluster stats some histograms of response times and request/response sizes broken down by the HTTP endpoint

This PR currently has support for the request and response sizes. I'd appreciate to verify whether the proposed changes make sense. I'd also like to verify whether we are only interested in "response" times, i.e. how long it takes for the response to be ready after the HTTP request is dispatched? Thanks!

long[] requestSizeHistogram,
long responseCount,
long totalResponseSize,
long[] responseSizeHistogram
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is mentioned in the original issue, but is it possible or beneficial to track response statuses counts as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is relatively easy to track the number of different response status. But we are probably more interested in their recent trends rather than the overall stats from last restart because we want to know whether the node is "currently" experiencing problem. This means we need to compute some moving averages instead of the overall average which is what we are doing here for request/response sizes. I have not yet found an existing example of computing moving averages for stats collection. It's definitely doable. But I also wonder whether it starts getting into the territory of APM and should be handled externally. This might be why we haven't done it? I'll dig a bit more. For the purpose of this PR, I think it's better to keep them separate.

@@ -89,7 +89,8 @@ public void testHttpResponseMapper() {
.map(HttpStats::clientStats)
.map(Collection::stream)
.reduce(Stream.of(), Stream::concat)
.toList()
.toList(),
Map.of()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the change itself, just surprised to see .map(Collection::stream).reduce(Stream.of(), Stream::concat). Is there any benefit over flatMap(Collection::stream)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Sometimes a test can be deliberately written to avoid using the same pattern as the production code. But this does not seem to be the case either.

@ywangd
Copy link
Member Author

ywangd commented Sep 26, 2023

This PR is mostly ready. I'd like to get the approval for the overall approach before proceed with a few further polishments including (1) some more tests and (2) potentially merge the two classes of HttpRouteStats and TransportActionStats. Thanks!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Left a few comments & suggestions.

Comment on lines 209 to 213
var result = chunkIterator.hasNext() == false;
if (result && sizeConsumer != null) {
sizeConsumer.accept(size);
sizeConsumer = null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we added some logic to DefaultRestChannel to grab the size when the response is closed - for instance we may not get all the way to isDone() before the client closes the channel.

Copy link
Member Author

@ywangd ywangd Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt. I have now moved these logic entirely into RestController and I think it looks better than before. I considered DefaultRestChannel as well, but I think having it done within RestController is better due to easier access to different pieces of information. Plese refer to 5c1fdb7

this.delegate = delegate;
this.circuitBreakerService = circuitBreakerService;
this.contentLength = contentLength;
this.methodHandlers = methodHandlers;
this.startTime = System.currentTimeMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clock is not monotonic, we should be using System::nanoTime (or, better, ThreadPool#rawRelativeTimeInMillis)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadPool is not available in RestController. So I chose to use System::nanoTime. In fact, I copied what ThreadPool#rawRelativeTimeInMillis does. This method can potentially be static and reused here. But it seems some tests rely on it being an instance method so that it can be overriden. So I decided to just copy it for now since the duplication is minimal.
Please refer to 4a4cd4a

@ywangd ywangd marked this pull request as ready for review September 27, 2023 04:06
@ywangd ywangd requested a review from DaveCTurner September 27, 2023 04:06
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 27, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@ywangd ywangd requested a review from idegtiarenko September 27, 2023 04:06
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left a few nits, the only real problem remaining is that we don't handle responses larger than Integer.MAX_VALUE properly yet.

public class HttpRouteStatsTracker {

/*
* default http.max_content_length is 100 MB so that the last histogram bucket is > 64MB (2^26)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the maximum request size but for responses we can return much more (maybe even GiBs) - suggest adding another 4 or 5 buckets at least.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I added 4 more buckets so that the last bucket is for > 1.0GB.

Comment on lines 366 to 368
return StreamSupport.stream(Spliterators.spliteratorUnknownSize(handlers.allNodeValues(), Spliterator.ORDERED), false)
.filter(mh -> mh.getStats().requestCount() > 0 || mh.getStats().responseCount() > 0)
.collect(Maps.toUnmodifiableSortedMap(MethodHandlers::getPath, MethodHandlers::getStats));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this with streams seems fairly awkward, I feel a regular loop to build a TreeMap would be simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to use a while loop with the iterator to build a TreeMap.

@@ -89,7 +91,12 @@ public void testHttpResponseMapper() {
.map(HttpStats::clientStats)
.map(Collection::stream)
.reduce(Stream.of(), Stream::concat)
.toList()
.toList(),
nodeStats.stream().map(NodeStats::getHttp).map(HttpStats::httpRouteStats).reduce(Map.of(), (l, r) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just test code, but still I think we can simplify the two-maps-and-a-reduce by moving the mapped functions inside the lambda called within the reduce. Also this looks to do an awful lot of copying of maps to convert them between mutable and immutable versions, do we really need that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving the two maps inside reduce is more clunky because the reduce needs to change type and Java asks you to supply both accumulator and combinator which do pretty much the same thing for this use case. I ended up moving it outside and build it separately with a for loop (see here).


private final ChunkedRestResponseBody delegate;
private final Consumer<Integer> encodedLengthConsumer;
private int encodedLength = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could end up exceeding Integer.MAX_VALUE

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. It is now handled with Math.addExact. The value is also capped to Integer.MAX_VALUE in case of overflow. I think it is OK to not track the exact number since (I think?) it is rare and we don't really need the exact number for histogram. The total response size will be off. But I feel that's OK.
Please see code changes here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I think that's going to be trappy, we might well want to add another bucket to the end of the histogram later. Why not just use long here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think it is a case that requires accurate handling. But upgrading to long is also fine. I updated it in 49b95f8

@ywangd
Copy link
Member Author

ywangd commented Oct 3, 2023

@elasticmachine update branch

@ywangd
Copy link
Member Author

ywangd commented Oct 3, 2023

Failure is unrelated and tracked at #96805

@mattc58 mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023
@ywangd
Copy link
Member Author

ywangd commented Oct 4, 2023

Bump for reviews. Thanks!

Btw, 8.11 branch just got created. I don't think we need this for 8.11 anyway. So 8.12 is just fine.

@ywangd
Copy link
Member Author

ywangd commented Oct 11, 2023

@DaveCTurner Merge conflict for TransportVersions is now fixed.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 12, 2023
Comment on lines +164 to +185
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
HttpRouteStats that = (HttpRouteStats) o;
return requestCount == that.requestCount
&& totalRequestSize == that.totalRequestSize
&& responseCount == that.responseCount
&& totalResponseSize == that.totalResponseSize
&& Arrays.equals(requestSizeHistogram, that.requestSizeHistogram)
&& Arrays.equals(responseSizeHistogram, that.responseSizeHistogram)
&& Arrays.equals(responseTimeHistogram, that.responseTimeHistogram);
}

@Override
public int hashCode() {
int result = Objects.hash(requestCount, totalRequestSize, responseCount, totalResponseSize);
result = 31 * result + Arrays.hashCode(requestSizeHistogram);
result = 31 * result + Arrays.hashCode(responseSizeHistogram);
result = 31 * result + Arrays.hashCode(responseTimeHistogram);
return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a record, do we need override equals and hashCode? Or they do not handle arrays?
In such case would it be beneficial to wrap array into Histogram object?
This could also simplify merging them and serializing toXContent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's because the default does not handle arrays. I can look into wrapping it as a separete Histogram in a follow-up.

@elasticsearchmachine elasticsearchmachine merged commit 6ed4ad5 into elastic:main Oct 12, 2023
@ywangd ywangd deleted the es-95739 branch October 12, 2023 11:27
@@ -52,5 +54,8 @@ interface Dispatcher {
*/
void dispatchBadRequest(RestChannel channel, ThreadContext threadContext, Throwable cause);

default Map<String, HttpRouteStats> getStats() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I think it is worth renaming to routeStats or perRouteStats to be consistent with stats method. That one could also have some reasonable default now.

Copy link
Contributor

@idegtiarenko idegtiarenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left couple of suggestions around the time the pr was automerged.
They could be applied as a followup if considered worthy.

@ywangd
Copy link
Member Author

ywangd commented Oct 12, 2023

@idegtiarenko Sorry this got merged before your comments. I will address your points in a follow-up. Thanks!

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jan 26, 2024
In elastic#99852 we introduced a layer of wrapping around the `RestResponse` to
capture the length of a chunked-encoded response, but this wrapping does
not preserve the headers of the original response. This commit fixes the
bug.
DaveCTurner added a commit that referenced this pull request Jan 29, 2024
In #99852 we introduced a layer of wrapping around the `RestResponse` to
capture the length of a chunked-encoded response, but this wrapping does
not preserve the headers of the original response. This commit fixes the
bug.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jan 29, 2024
In elastic#99852 we introduced a layer of wrapping around the `RestResponse` to
capture the length of a chunked-encoded response, but this wrapping does
not preserve the headers of the original response. This commit fixes the
bug.

Backport of elastic#104808 to 8.12
elasticsearchmachine pushed a commit that referenced this pull request Jan 29, 2024
* Fix lost headers with chunked responses

In #99852 we introduced a layer of wrapping around the `RestResponse` to
capture the length of a chunked-encoded response, but this wrapping does
not preserve the headers of the original response. This commit fixes the
bug.

Backport of #104808 to 8.12

* Fix compile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Network Http and internode communication implementations >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More detailed HTTP stats
6 participants