-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimate segment field usages #112760
Estimate segment field usages #112760
Conversation
e3e5128
to
481f542
Compare
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
if (fi.hasNorms()) { | ||
usages++; | ||
} | ||
if (fi.hasVectors()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably skip term vectors: their memory usage does not scale with the number of fields (like stored fields).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have pushed 5cc4f35 to remove this.
The change looks fine. I wonder if we need to go this granular though, or if we should assume that all fields that exist have an index (either terms, points or vectors) and doc values. Some fields may only have doc values, but then it's fine if we overestimate a bit? |
Thanks, Adrien. I considered this option, but it would require overestimating the current estimate by 20% in all cases, which might prevent us from running 2GB instances. I'll merge this PR and discuss the follow-up changes for serverless. If we find a different solution, I'll revert this change. |
We have introduced a new memory estimation method in serverless, based on the number of segments and the fields within them. This new approach works well overall, but it still falls short in cases where most fields are used more than once - for example, in both doc_values and postings, or doc_values and points. This change exposes the total usage of fields in segments, allowing us to adjust the memory estimate for these cases.
💚 Backport successful
|
We have introduced a new memory estimation method in serverless, based on the number of segments and the fields within them. This new approach works well overall, but it still falls short in cases where most fields are used more than once - for example, in both doc_values and postings, or doc_values and points. This change exposes the total usage of fields in segments, allowing us to adjust the memory estimate for these cases.
…tion-ironbank-ubi * upstream/main: (302 commits) Deduplicate BucketOrder when deserializing (elastic#112707) Introduce test utils for ingest pipelines (elastic#112733) [Test] Account for auto-repairing for shard gen file (elastic#112778) Do not throw in task enqueued by CancellableRunner (elastic#112780) Mute org.elasticsearch.script.StatsSummaryTests testEqualsAndHashCode elastic#112439 Mute org.elasticsearch.repositories.blobstore.testkit.integrity.RepositoryVerifyIntegrityIT testTransportException elastic#112779 Use a dedicated test executor in MockTransportService (elastic#112748) Estimate segment field usages (elastic#112760) (Doc+) Inference Pipeline ignores Mapping Analyzers (elastic#112522) Fix verifyVersions task (elastic#112765) (Doc+) Terminating Exit Codes (elastic#112530) (Doc+) CAT Nodes default columns (elastic#112715) [DOCS] Augment installation warnings (elastic#112756) Mute org.elasticsearch.repositories.blobstore.testkit.integrity.RepositoryVerifyIntegrityIT testCorruption elastic#112769 Bump Elasticsearch to a minimum of JDK 21 (elastic#112252) ESQL: Compute support for filtering ungrouped aggs (elastic#112717) Bump Elasticsearch version to 9.0.0 (elastic#112570) add CDR related data streams to kibana_system priviliges (elastic#112655) Support widening of numeric types in union-types (elastic#112610) Introduce data stream options and failure store configuration classes (elastic#109515) ...
We have introduced a new memory estimation method in serverless, based on the number of segments and the fields within them. This new approach works well overall, but it still falls short in cases where most fields are used more than once - for example, in both doc_values and postings, or doc_values and points. This change exposes the total usage of fields in segments, allowing us to adjust the memory estimate for these cases.
We have introduced a new memory estimation method in serverless, based on the number of segments and the fields within them. This new approach works well overall, but it still falls short in cases where most fields are used more than once - for example, in both doc_values and postings, or doc_values and points. This change exposes the total usage of fields in segments, allowing us to adjust the memory estimate for these cases.