Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Profiling] Can we reduce server-side processing time for flamegraphs? #176208

Open
danielmitterdorfer opened this issue Feb 5, 2024 · 9 comments
Labels
bug Fixes for quality problems that affect the customer experience needs-refinement A reason and acceptance criteria need to be defined for this issue performance Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team

Comments

@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Feb 5, 2024

Kibana version: 8.12.0

Elasticsearch version:8.12.0

Server OS version: Linux (ESS)

Browser version: N/A (issue is browser-independent)

Browser OS version: N/A (issue is browser-independent)

Original install method (e.g. download page, yum, from source, etc.): ESS

Describe the feature:

When we render a flamegraph in Universal Profiling, an Elasticsearch API sends the response already in a format that is suitable for rendering directly in the browser. Looking at APM we can see that we spend some time still in Kibana server when it basically "only" needs to pass the response from Elasticsearch to the browser as is. In the example APM trace below that time is around 2 seconds (the white gap after POST /_profiling/flamegraph ends):

apm-timeline-flamegraph

If we profile what Kibana server is doing, it seems that it is deserializing the response from Elasticsearch and then serializes it again. However, I believe we can eliminate that step because the response should already be in the exact format that the browser expects (if not I propose we adapt the ES API). Below is an annotated flamegraph that shows where Kibana server spends time serializing and deserializing the response from Elasticsearch:

flamegraph-kibana

@danielmitterdorfer danielmitterdorfer added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Feb 5, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@thomasneirynck
Copy link
Contributor

thomasneirynck commented Apr 3, 2024

EDIT: for more up2date feedback, see #176208 (comment)


this is some initial feedback, purely based on a code audit and some offline discussion with @danielmitterdorfer


The flame-graph server side code in kibana uses the ES-client.

return esClient.transport.request(
{
method: 'POST',
path: encodeURI('/_profiling/flamegraph'),
body: {
query,
sample_size: sampleSize,
requested_duration: durationSeconds,
co2_per_kwh: co2PerKWH,
per_core_watt_x86: pervCPUWattX86,
per_core_watt_arm64: pervCPUWattArm64,
datacenter_pue: datacenterPUE,
aws_cost_factor: awsCostDiscountRate,
cost_per_core_hour: costPervCPUPerHour,
azure_cost_factor: azureCostDiscountRate,
indices,
stacktrace_ids_field: stacktraceIdsField,
},
},
{
signal: controller.signal,
meta: true,
}
);

It does not specify the asStream option:

https://github.com/elastic/elasticsearch-js/blob/4aa00e03e1fac5385387dfa7cfd6e811d5307028/docs/examples/asStream.asciidoc

I am pretty sure, that when this is not set to true, the client will unpack the result.

When setting it to true, you can stream it back through the kibana server.

To do this, you must ensure the content-* headers match the output from Elasticsearch.


To see an example from the current Kibana code-base, please check tile-routes:

set asStream = true

(optionally, you can gzip here)

pass on the raw response through the kibana-server response object. Ensure content-* headers match.

if (tileStream) {
// use the content-encoding and content-length headers from elasticsearch if they exist
const { 'content-length': contentLength, 'content-encoding': contentEncoding } = headers;
return response.ok({
body: tileStream,
headers: {
'content-disposition': 'inline',
...(contentLength && { 'content-length': contentLength }),
...(contentEncoding && { 'content-encoding': contentEncoding }),
'Content-Type': 'application/x-protobuf',
'Cache-Control': cacheControl,
'Last-Modified': lastModified,
},
});
} else {

note: tileStream is the raw ES-response body

return { stream: tile.body as Stream, headers: tile.headers, statusCode: tile.statusCode };
)


So, again, just from code-audit, I think we should make two changes:

use asStream in the ES-client here

pass the raw resp.body, but match the content-* headers

(this would be here (????))

@thomasneirynck
Copy link
Contributor

thomasneirynck commented Apr 3, 2024

Following up on earlier comment.

Problem: the Elasticsearch client re-encodes the response.

The ES-client parses the response and re-encodes it. See here:

https://github.com/elastic/elastic-transport-js/blob/main/src/Transport.ts#L525
https://github.com/elastic/elastic-transport-js/blob/main/src/Serializer.ts#L62

Resolution: use asStream

The ideal solution would be to use the asStream option to stream the result back.

Add an asStream-option here:

In order for this to work, the ES-response handling needs to change (see below)

is totalSeconds really needed?

This adds a new property. totalSeconds, wrapping it in a new response object. This would prevent streaming back the result.

However, it is only used for tooltips, and it seems this value is optional (?)

Ensure response-object has correct content-headers

Ensure content-* encodings are transferred from the ES-response headers.

Adjust client in browser.

The client now expects a simple JSON.

return createFlameGraph(baseFlamegraph, showErrorFrames);

This will need to be adjusted to handle a stream.


^ this is a broad investigation. I may have missed some intricacies, so it would be good to get some additional feedback from Obs-team on this as well.

@smith smith added bug Fixes for quality problems that affect the customer experience performance labels Apr 4, 2024
@rockdaboot
Copy link
Contributor

Thanks for the investigation @thomasneirynck and @danielmitterdorfer!

Ensure content-* encodings are transferred from the ES-response headers.

The ES response content is gzip compressed while the Kibana response is br (brotli) compressed. We did this to reduce the content size by ~50% and reducing the transfer time to the Kibana client side. See #142334

(From what I remember, in my hotel room with a slow network I had to wait ~50s for a gzip-compressed flamegraph and only ~25s when brotli-compressed. But fya, the compression rate also depends on the payload and the response content format has changed since then.)

Another caveat with simply transferring the headers is that you can not assume that the browser (or client) negotiate exactly the same content-* fields as Kibana server and ES negotiate.
Sure enough, most browsers are lenient enough to accept a gzip content-encoding even when brotli has been negotiated. But "most" is not "all" - and we possibly don't won't to run into support issues here.

So I would vote for re-compression in case that content-encoding isn't the same on both sides. This is a simple decompression, compression without parsing and should be fast enough.

A possible future follow-up could be to allow brotli and/or zstd on the ES side (zstd has been included in ES recently, so that's just a matter of time/effort until we can use it).

@rockdaboot
Copy link
Contributor

rockdaboot commented Apr 15, 2024

is totalSeconds really needed?

The totalSeconds value is derived from user settings (UI) only. It is timeTo-timeFrom, both request params. So this can be calculated on the client side as well (alternatively, it can be added to the flamegraph response if that is a better option).

@roshan-elastic
Copy link

Hi all,

Thanks for raising this. Is anyone able to summarise the kind of speeds we're seeing and the benefit we could gain?

Trying to get a sense of impact on customers

@danielmitterdorfer
Copy link
Member Author

Thanks for raising this. Is anyone able to summarise the kind of speeds we're seeing and the benefit we could gain?

Hey @roshan-elastic! Sure, Kibana adds around 2 seconds of overhead. So in the scenario that I've tested we would reduce the response time from roughly 6 to roughly 4 seconds.

@roshan-elastic
Copy link

Thanks @danielmitterdorfer

@roshan-elastic roshan-elastic added the needs-refinement A reason and acceptance criteria need to be defined for this issue label Aug 19, 2024
@roshan-elastic
Copy link

FYI I've added this as a top-20 issue in the backlog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience needs-refinement A reason and acceptance criteria need to be defined for this issue performance Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team
Projects
None yet
Development

No branches or pull requests

6 participants