Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Research] Data format improvements for charting (arrow) #175695

Closed
Tracked by #166211
thomasneirynck opened this issue Jan 26, 2024 · 13 comments
Closed
Tracked by #166211

[Research] Data format improvements for charting (arrow) #175695

thomasneirynck opened this issue Jan 26, 2024 · 13 comments
Assignees
Labels
research Team:Visualizations Visualization editors, elastic-charts and infrastructure

Comments

@thomasneirynck
Copy link
Contributor

thomasneirynck commented Jan 26, 2024

Currently, the majority of all charts use the default Json output from Elasticsearch. These responses by default have a row-like (in the case of es|ql or doc-search) or nested (in the case of aggs) layout.

Internally, Kibana will reformat these to something more usable. e.g. a format understood by elastic/charts, nested-array tables for easier ergonomics, etc...

These client-side reformattings introduce an overhead.

Is it possible to have a more efficient pipeline (?), either by reducing network traffic, reducing reconversions (or both).

Goal

Investigate impact of data format on kibana data visualization (specifically, Lens & Dashboard).

Consider both the context of:

  • _search
  • _query (ES|QL)

Consider alternatives:

  • Already supported by Elasticsearch: e.g. (cbor, smile, ..) or column based SQL output
  • Other possibilities (yet unsupported by Elasticsearch) e.g. arrow flight, parquet
@botelastic botelastic bot added the needs-team Issues missing a team label label Jan 26, 2024
@thomasneirynck thomasneirynck added the Team:Visualizations Visualization editors, elastic-charts and infrastructure label Jan 26, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-visualizations (Team:Visualizations)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 26, 2024
@stratoula
Copy link
Contributor

@nik9000 was working on their ON week on exposing the ESQL results in arrow format. I think it is awesome to continue investigations in this front. Can make our visualizations much more performant and dense!

@drewdaemon
Copy link
Contributor

These client-side reformattings introduce an overhead.

Agree, both in terms of performance and complexity.

@markov00
Copy link
Member

markov00 commented Apr 2, 2024

linked to #178471

@teresaalvarezsoler teresaalvarezsoler changed the title [Research] Data format improvements for charting [Research] Data format improvements for charting (arrow) Apr 30, 2024
@ppisljar
Copy link
Member

ppisljar commented May 22, 2024

adding (basic) arrow support to expressions: #183909

This showcases that it is not very hard to convert from arrow to datatable and vice versa, which would allow us to gradually migrate our code to the new format.

@markov00 markov00 removed their assignment May 29, 2024
@thomasneirynck
Copy link
Contributor Author

Thanks @ppisljar for #175695 (comment). This was super useful.

Below a follow from an offline convo with @markov00 and @ppisljar. Apart fromthis initial look into arrow, there are a few more open questions. I think it might also be useful to recap some of the underlying reasons for this research for wider visibility.


1. We should build up our knowledge arrow because of its strategic value in contemporary tech stacks

Arrow has strategic value because it the main data-interchange formats for interprocess data analytics (e.g. ML with pandas in Python), GPU-based charting (e.g. dense scatterplots), or in a web context to do client-side analytics (e.g. duckdb-wasm https://duckdb.org/docs/api/wasm/overview.html)

For that reason alone, it is important to gain a better understanding of this format.

From @ppisljar initial investigation (#183909), the short term take-away seems to be that a "backend swap" of JSON vs Arrow may not be hard technically, but it would not be the right choice in the short term.
(a) poor client support in the browser (e.g. having to use unsafe-eval)
(b) existing data-pipeline in Lens (ie. "expressions") - which needs to marshall the data into a new table and which does some intermediate data-enrichment - requires a full read of the arrow-table, remarshaling everything to JSON anyway. This conversion is slow.

Kibana has a very low investment in GPU-technology today (except flamegraph and maps), and introducing a new model of client-side analytics (e.g. one which runs in WASM with duckdb) is not directly on the horizon either. imho it is OK with postponing further investigation in these long term topics. We can always pick up those aspects up once it becomes more tactically relevant (e.g. when scatterplots are prioritized)

What is not answered though is whether:

  • arrow has any meaningful space savings in amount of data sent over the wire. This size-comparison would still be useful to get some numbers on (see Apache arrow support for ES|QL elasticsearch#104877 to test wrt ES|QL).
  • any blockers on kibana-server which would prevent streaming arrow data to browser without having to unpack it first

2. JSON vs Binary. Is there any low-hanging fruit for saving space in size of data transmitted over the wire?

arrow is just one example of a binary format. Other examples could be cbor or smile, which are supported by Elasticsearch.

Any gains we can make in transfer format can be meaningful, especially since it would get our stack to closer deliver data in a streaming-fashion: ie. an elasticsearch response stream should just be streamed back as-is to the browser, without further modification, especially if that modification is redundant.

It seems there is some additional processing in Kibana server (specifically for async searches (?)), which would prevent us from doing this.

Whether the Elasticsearch-js client supports formats other than JSON is imo less relevant. Users can always unpack the data manually by using the as-stream option (https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/as_stream_examples.html). If we can demonstrate value, we can always push this support down in the ES-client as well.

So for this, I think we are missing answers to:

  • are there size benefits to using binary format, like cbor or smile, over JSON?
  • can footprint from Kibana server be reduced more? Ideally, entirely, in the case of sync search (async search can be a special case)?

3. column vs row layout ("visualization friendly" format)

This is more of an orthogonal issue to binary/JSON. This is about data layout.

This would need to be investigated in the context of expressions and elastic/charts.

I believe this is already the default for ES|QL (?).

4. ES|QL versus DSL

(1) and (3) the questions above only have relevance for ES|QL.

  • arrow is only on the horizon for ES|QL
  • column-layouts are only be supported by ES|QL (?)

(2) applies to both, and imo is therefore important.

5. The "big picture" - thinning kibana server

The big picture for 1, 2, 3, and 4 is that we should aim to remove as much intermediate, redundant processing on data, especially on Kibana-server. Processing-in-browser is only felt by one particular user, while load on kibana server affects all userss. Does Lens really need enrichment of data in kibana-server?

e.g.: a performant data-viz architecture
image

6. What team does this belong?

@markov00 raised whether the @elastic/kibana-visualizations should own this research. imho, yes, but with an asterix.

Yes, because visualizations are the main consumer of Elasticsearch-agg responses, and we would expect changes to be motivated by reducing the time it takes to render data on screen in a chart.

The asterix is that if there any resource constraints we can always see if we can distribute these investigations more broadly (e.g. @elastic/kibana-data-discovery, @davismcphee @kertal @lukasolson)


So to recap; I see following open questions:

  1. Size comparison of arrow versus current ES|QL (Using Apache arrow support for ES|QL elasticsearch#104877 may be helpful here)
  2. What are the size/performance advantages of cbor/smile? Are they supported by ES|QL?
  3. What are the blockers to adopt cbor/smile? Specifically, what is going on in Kibana-server that requires enrichment of the Elasticsearch-agg response? Is it necessary?
  4. Is there anything more that needs be done wrt column-based layouts?

@ppisljar
Copy link
Member

  1. Arrow is a binary format, so it will generally be more efficient from size perspective than json. In some tests i did i saw around 30% reduction of size. However important note here is that we are using gzip compression, and after compression the filesizes are mostly the same, or arrow format actually becomes bigger.

  2. Havent tested this yet, but from resources on the internet it looks its similar to arrow, there is a significant reduction if you dont gzip, but after gziping reduction is less noticable.

http://zderadicka.eu/comparison-of-json-like-serializations-json-vs-ubjson-vs-messagepack-vs-cbor/
https://gist.github.com/kajuberdut/0191ec20f14253094792cd3c00f06257
https://medium.com/@ayushguptadtu/gzip-smile-json-gives-a-better-size-reduction-over-smile-uncompressed-for-sure-6c5060a670a5

@vadimkibana
Copy link
Contributor

vadimkibana commented May 31, 2024

The most performant way would be to request data from ES in CBOR and pass it through the Kibana server without any parsing (or minimal parsing) straight to the client. So this is the key question:

Specifically, what is going on in Kibana-server that requires enrichment of the Elasticsearch-agg response? Is it necessary?

If we can make it such that ES CBOR response is passed-through directly to the client-side we will save on request/response copying, UTF8 decoding, JSON decoding, JSON encoding, UTF8 encoding; and all the memory savings if we don't need to hold those intermediate representations.

@lukasolson
Copy link
Member

If we can make it such that ES CBOR response is passed-through directly to the client-side we will save on request/response copying, UTF8 decoding, JSON decoding, JSON encoding, UTF8 encoding; and all the memory savings if we don't need to hold those intermediate representations.

Related: #170062

@thomasneirynck
Copy link
Contributor Author

thx @ppisljar - if arrow is larger gzipped, I think it's another argument against arrow being a pathway for a tactical improvement.

@vadimkibana agreed. The key part of these investigations is whether we can slim down the data pipeline from Elasticsearch all the way to the browser. Reduction in size of the data format (faster delivery, cheaper too), wasted cycles of encoding/decoding (faster), and removing redundant enrichment (wasted processing) are all pathways to get there. Any footprint on kibana-server is particularly bad because it is felt by all users, and any impact from processing doesn't scale favorably due to single threaded execution (e.g. by delaying other requests, and this compounds)

@thomasneirynck
Copy link
Contributor Author

Let's consider this done.

tl:dr;

  • cbor can show marginal improvements (due to speed up of browser-decoding)
  • arrow will not give benefits in current architecture
  • more impactful improvements will come from moving out all unnecessary encoding/decoding of ES-responses from Kibana-server

@swallez
Copy link
Member

swallez commented Sep 30, 2024

After the ping from @thomasneirynck in elastic/elasticsearch#109576 (comment) I took a closer look at the experiments done by @ppisljar in PR #193803.

In particular I was surprised by the full copy of the Arrow dataframe into a new array, which obviously isn't ideal, so I went digging 😉

First of all, the Arrow Table type, which represents the dataframe, has a toArray() method that apparently hasn't been evaluated. It is specifically targeted at applications that process arrays of objects to avoid the refactoring needed to use dataframes directly. It builds an array of proxies to the dataframe vectors that make them look like regular objects.

Running some benchmarks showed that using Table.toArray() reduced memory usage for the data table by a factor of 3!

Still, I was surprised by the amount of heap memory used by this method, so looked further and found that we could even eliminate this allocation (see PR apache/arrow#44247). With this change, memory usage for data tables is reduced by a factor of 4.5.

The benchmarks also show reduced performance caused by the indirection layer added by proxying the dataframe. Whether it is acceptable has to be evaluated. But here again, digging in the code showed that it could be improved significantly.

The fact that toArray() and associated code has room for improvement shouldn't be considered as reflecting a poor quality of this library, and as shown in this PR, the maintainers are open to improvements. This is isn't the primary intended usage of dataframes, and iterating on table columns obtained using table.getChild() shows performance on par with plain object property access with, as shown above, huge memory savings. Arrow also eliminates the need to parse data sent by ES, as the dataframe just wraps the byte buffer received over the network.

@ppisljar
Copy link
Member

ppisljar commented Oct 8, 2024

The reason why toArray() was not used in the linked PR is that it produces a generic js array, which does not match the kibana datatable structure. The purpose of #183909 was to evaluate conversion between arrow table and kibana datatable specifically. toArray() produces quite different structure, converting that one to kibana datatable was not any faster in my tests.

But the main thing keeping us from starting to us the library is not the performance reduction (our actual table sizes at the moment are way smaller than what i was testing in my PR) but the fact that its using unsafe eval. I haven't looked into how hard would it be to address that in the library as i was using it as a black box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects
None yet
Development

No branches or pull requests

9 participants