-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record more detailed HTTP stats #99852
Conversation
Hi @ywangd, I've created a changelog YAML for you. |
@DaveCTurner This PR is marked as draft because I am seeking high level feedback on the approach. The issue (#95739) says
This PR currently has support for the request and response sizes. I'd appreciate to verify whether the proposed changes make sense. I'd also like to verify whether we are only interested in "response" times, i.e. how long it takes for the response to be ready after the HTTP request is dispatched? Thanks! |
long[] requestSizeHistogram, | ||
long responseCount, | ||
long totalResponseSize, | ||
long[] responseSizeHistogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is mentioned in the original issue, but is it possible or beneficial to track response statuses counts as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is relatively easy to track the number of different response status. But we are probably more interested in their recent trends rather than the overall stats from last restart because we want to know whether the node is "currently" experiencing problem. This means we need to compute some moving averages instead of the overall average which is what we are doing here for request/response sizes. I have not yet found an existing example of computing moving averages for stats collection. It's definitely doable. But I also wonder whether it starts getting into the territory of APM and should be handled externally. This might be why we haven't done it? I'll dig a bit more. For the purpose of this PR, I think it's better to keep them separate.
@@ -89,7 +89,8 @@ public void testHttpResponseMapper() { | |||
.map(HttpStats::clientStats) | |||
.map(Collection::stream) | |||
.reduce(Stream.of(), Stream::concat) | |||
.toList() | |||
.toList(), | |||
Map.of() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to the change itself, just surprised to see .map(Collection::stream).reduce(Stream.of(), Stream::concat)
. Is there any benefit over flatMap(Collection::stream)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Sometimes a test can be deliberately written to avoid using the same pattern as the production code. But this does not seem to be the case either.
server/src/main/java/org/elasticsearch/rest/ChunkedRestResponseBody.java
Outdated
Show resolved
Hide resolved
This PR is mostly ready. I'd like to get the approval for the overall approach before proceed with a few further polishments including (1) some more tests and (2) potentially merge the two classes of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it. Left a few comments & suggestions.
var result = chunkIterator.hasNext() == false; | ||
if (result && sizeConsumer != null) { | ||
sizeConsumer.accept(size); | ||
sizeConsumer = null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather we added some logic to DefaultRestChannel
to grab the size when the response is closed - for instance we may not get all the way to isDone()
before the client closes the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt. I have now moved these logic entirely into RestController and I think it looks better than before. I considered DefaultRestChannel as well, but I think having it done within RestController is better due to easier access to different pieces of information. Plese refer to 5c1fdb7
this.delegate = delegate; | ||
this.circuitBreakerService = circuitBreakerService; | ||
this.contentLength = contentLength; | ||
this.methodHandlers = methodHandlers; | ||
this.startTime = System.currentTimeMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clock is not monotonic, we should be using System::nanoTime
(or, better, ThreadPool#rawRelativeTimeInMillis
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ThreadPool is not available in RestController. So I chose to use System::nanoTime
. In fact, I copied what ThreadPool#rawRelativeTimeInMillis
does. This method can potentially be static and reused here. But it seems some tests rely on it being an instance method so that it can be overriden. So I decided to just copy it for now since the duplication is minimal.
Please refer to 4a4cd4a
Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I left a few nits, the only real problem remaining is that we don't handle responses larger than Integer.MAX_VALUE
properly yet.
public class HttpRouteStatsTracker { | ||
|
||
/* | ||
* default http.max_content_length is 100 MB so that the last histogram bucket is > 64MB (2^26) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the maximum request size but for responses we can return much more (maybe even GiBs) - suggest adding another 4 or 5 buckets at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I added 4 more buckets so that the last bucket is for > 1.0GB.
return StreamSupport.stream(Spliterators.spliteratorUnknownSize(handlers.allNodeValues(), Spliterator.ORDERED), false) | ||
.filter(mh -> mh.getStats().requestCount() > 0 || mh.getStats().responseCount() > 0) | ||
.collect(Maps.toUnmodifiableSortedMap(MethodHandlers::getPath, MethodHandlers::getStats)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this with streams seems fairly awkward, I feel a regular loop to build a TreeMap
would be simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to use a while loop with the iterator to build a TreeMap.
@@ -89,7 +91,12 @@ public void testHttpResponseMapper() { | |||
.map(HttpStats::clientStats) | |||
.map(Collection::stream) | |||
.reduce(Stream.of(), Stream::concat) | |||
.toList() | |||
.toList(), | |||
nodeStats.stream().map(NodeStats::getHttp).map(HttpStats::httpRouteStats).reduce(Map.of(), (l, r) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is just test code, but still I think we can simplify the two-maps-and-a-reduce by moving the mapped functions inside the lambda called within the reduce. Also this looks to do an awful lot of copying of maps to convert them between mutable and immutable versions, do we really need that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving the two map
s inside reduce
is more clunky because the reduce needs to change type and Java asks you to supply both accumulator and combinator which do pretty much the same thing for this use case. I ended up moving it outside and build it separately with a for loop (see here).
|
||
private final ChunkedRestResponseBody delegate; | ||
private final Consumer<Integer> encodedLengthConsumer; | ||
private int encodedLength = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could end up exceeding Integer.MAX_VALUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. It is now handled with Math.addExact
. The value is also capped to Integer.MAX_VALUE in case of overflow. I think it is OK to not track the exact number since (I think?) it is rare and we don't really need the exact number for histogram. The total response size will be off. But I feel that's OK.
Please see code changes here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I think that's going to be trappy, we might well want to add another bucket to the end of the histogram later. Why not just use long
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think it is a case that requires accurate handling. But upgrading to long is also fine. I updated it in 49b95f8
server/src/main/java/org/elasticsearch/rest/RestController.java
Outdated
Show resolved
Hide resolved
@elasticmachine update branch |
Failure is unrelated and tracked at #96805 |
Bump for reviews. Thanks! Btw, 8.11 branch just got created. I don't think we need this for 8.11 anyway. So 8.12 is just fine. |
@DaveCTurner Merge conflict for TransportVersions is now fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
qa/smoke-test-http/src/javaRestTest/java/org/elasticsearch/http/HttpSmokeTestCase.java
Outdated
Show resolved
Hide resolved
…p/HttpSmokeTestCase.java Co-authored-by: David Turner <[email protected]>
@Override | ||
public boolean equals(Object o) { | ||
if (this == o) return true; | ||
if (o == null || getClass() != o.getClass()) return false; | ||
HttpRouteStats that = (HttpRouteStats) o; | ||
return requestCount == that.requestCount | ||
&& totalRequestSize == that.totalRequestSize | ||
&& responseCount == that.responseCount | ||
&& totalResponseSize == that.totalResponseSize | ||
&& Arrays.equals(requestSizeHistogram, that.requestSizeHistogram) | ||
&& Arrays.equals(responseSizeHistogram, that.responseSizeHistogram) | ||
&& Arrays.equals(responseTimeHistogram, that.responseTimeHistogram); | ||
} | ||
|
||
@Override | ||
public int hashCode() { | ||
int result = Objects.hash(requestCount, totalRequestSize, responseCount, totalResponseSize); | ||
result = 31 * result + Arrays.hashCode(requestSizeHistogram); | ||
result = 31 * result + Arrays.hashCode(responseSizeHistogram); | ||
result = 31 * result + Arrays.hashCode(responseTimeHistogram); | ||
return result; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a record, do we need override equals and hashCode? Or they do not handle arrays?
In such case would it be beneficial to wrap array into Histogram object?
This could also simplify merging them and serializing toXContent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's because the default does not handle arrays. I can look into wrapping it as a separete Histogram in a follow-up.
@@ -52,5 +54,8 @@ interface Dispatcher { | |||
*/ | |||
void dispatchBadRequest(RestChannel channel, ThreadContext threadContext, Throwable cause); | |||
|
|||
default Map<String, HttpRouteStats> getStats() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I think it is worth renaming to routeStats
or perRouteStats
to be consistent with stats
method. That one could also have some reasonable default now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left couple of suggestions around the time the pr was automerged.
They could be applied as a followup if considered worthy.
@idegtiarenko Sorry this got merged before your comments. I will address your points in a follow-up. Thanks! |
In elastic#99852 we introduced a layer of wrapping around the `RestResponse` to capture the length of a chunked-encoded response, but this wrapping does not preserve the headers of the original response. This commit fixes the bug.
In #99852 we introduced a layer of wrapping around the `RestResponse` to capture the length of a chunked-encoded response, but this wrapping does not preserve the headers of the original response. This commit fixes the bug.
In elastic#99852 we introduced a layer of wrapping around the `RestResponse` to capture the length of a chunked-encoded response, but this wrapping does not preserve the headers of the original response. This commit fixes the bug. Backport of elastic#104808 to 8.12
* Fix lost headers with chunked responses In #99852 we introduced a layer of wrapping around the `RestResponse` to capture the length of a chunked-encoded response, but this wrapping does not preserve the headers of the original response. This commit fixes the bug. Backport of #104808 to 8.12 * Fix compile
This PR adds more details HTTP stats breaking down by HTTP routes.
Resolves: #95739