Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable prefix compression for arrays of values #85893

Closed
rockdaboot opened this issue Apr 14, 2022 · 7 comments
Closed

Enable prefix compression for arrays of values #85893

rockdaboot opened this issue Apr 14, 2022 · 7 comments
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@rockdaboot
Copy link
Contributor

Description

For continuous profiling we store many documents (order of > 100 million) and thus strive for better compression to reduce the storage size / costs.

We recognized that prefix compression works well for single value fields, but not for arrays.
The arrays we store often start with the same values, so a prefix compression seems to be a natural choice to reduce storage size.

We tested "flattening" the arrays (concatenating the values and storing a single value) to "forcefully" enable prefix compression. This reduced the storage size by >50% (before: 59.7 bytes/record; after: 28.7 bytes/record).

Example of the field mapping (its a doc_values field):

      "FrameID": {
        "type": "keyword",
        "index": false
      },
...

cc @jpountz

@rockdaboot rockdaboot added >enhancement needs:triage Requires assignment of a team area label labels Apr 14, 2022
@not-napoleon not-napoleon added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Apr 14, 2022
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Apr 14, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@not-napoleon not-napoleon removed the needs:triage Requires assignment of a team area label label Apr 14, 2022
@not-napoleon
Copy link
Member

I think it's also worth considering a dedicated mapping type for this, rather than an array field of keywords. That opens up more options for customizing aggregations and finer grained control of the storage, I think.

@jpountz
Copy link
Contributor

jpountz commented Apr 19, 2022

+1 @not-napoleon to achieve this via a dedicated field type (path?) that treats the whole list as a single value as far as Elasticsearch is concerned

@rockdaboot
Copy link
Contributor Author

+1 Given the constraints on "synthetic source" regarding reconstructing arrays, a path type hopefully allows to get identical data (order and size) here (arrays become reordered and identical values are merged into one).

@jpountz
Copy link
Contributor

jpountz commented May 6, 2022

To provide some more context about this issue, the goal is to store path-like data such as file paths or stack traces, where we expect lots of redundancy across prefixes. There seems to be some users doing this, and I recently found about this path hierarchy plugin, which gives the ability to create a tree of the file structure from an index that indexes file paths as keywords.

At a high-level, I can think of two main routes to improve support for this use-case:

  • Recommend users to index this sort of information via keyword fields.
  • Introduce a new field type that would take its input as a list (["etc", "hosts"] rather than /etc/hosts).

My intuition is that we'd index data exactly the same way in both cases, so this is really about how we think it would be best exposed?

@rockdaboot
Copy link
Contributor Author

Since I started the discussion it's just fair to mention that we meanwhile 'flatten' the arrays into a single value by concatenating the array values.

There are two reasons to do so:

  • get prefix compression right now
  • prepare for the synthetic source feature coming with the 8.3 release (that feature can't restore arrays)

So this issue doesn't need to be kept open just for us.

@jpountz
Copy link
Contributor

jpountz commented Jun 13, 2022

Agreed, closing.

@jpountz jpountz closed this as completed Jun 13, 2022
@javanna javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

5 participants