Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate segment field usages #112760

Merged
merged 2 commits into from
Sep 11, 2024
Merged

Estimate segment field usages #112760

merged 2 commits into from
Sep 11, 2024

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Sep 11, 2024

We have introduced a new memory estimation method in serverless, based on the number of segments and the fields within them. This new approach works well overall, but it still falls short in cases where most fields are used more than once - for example, in both doc_values and postings, or doc_values and points. This change exposes the total usage of fields in segments, allowing us to adjust the memory estimate for these cases.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

if (fi.hasNorms()) {
usages++;
}
if (fi.hasVectors()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably skip term vectors: their memory usage does not scale with the number of fields (like stored fields).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have pushed 5cc4f35 to remove this.

@jpountz
Copy link
Contributor

jpountz commented Sep 11, 2024

The change looks fine. I wonder if we need to go this granular though, or if we should assume that all fields that exist have an index (either terms, points or vectors) and doc values. Some fields may only have doc values, but then it's fine if we overestimate a bit?

@dnhatn
Copy link
Member Author

dnhatn commented Sep 11, 2024

Thanks, Adrien. I considered this option, but it would require overestimating the current estimate by 20% in all cases, which might prevent us from running 2GB instances. I'll merge this PR and discuss the follow-up changes for serverless. If we find a different solution, I'll revert this change.

@dnhatn dnhatn merged commit ed41445 into elastic:main Sep 11, 2024
15 checks passed
@dnhatn dnhatn deleted the field-infos-usage branch September 11, 2024 23:18
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Sep 11, 2024
We have introduced a new memory estimation method in serverless, based 
on the number of segments and the fields within them. This new approach
works well overall, but it still falls short in cases where most fields
are used more than once - for example, in both doc_values and postings,
or doc_values and points. This change exposes the total usage of fields
in segments, allowing us to adjust the memory estimate for these cases.
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

elasticsearchmachine pushed a commit that referenced this pull request Sep 12, 2024
We have introduced a new memory estimation method in serverless, based 
on the number of segments and the fields within them. This new approach
works well overall, but it still falls short in cases where most fields
are used more than once - for example, in both doc_values and postings,
or doc_values and points. This change exposes the total usage of fields
in segments, allowing us to adjust the memory estimate for these cases.
v1v added a commit to v1v/elasticsearch that referenced this pull request Sep 12, 2024
…tion-ironbank-ubi

* upstream/main: (302 commits)
  Deduplicate BucketOrder when deserializing (elastic#112707)
  Introduce test utils for ingest pipelines (elastic#112733)
  [Test] Account for auto-repairing for shard gen file (elastic#112778)
  Do not throw in task enqueued by CancellableRunner (elastic#112780)
  Mute org.elasticsearch.script.StatsSummaryTests testEqualsAndHashCode elastic#112439
  Mute org.elasticsearch.repositories.blobstore.testkit.integrity.RepositoryVerifyIntegrityIT testTransportException elastic#112779
  Use a dedicated test executor in MockTransportService (elastic#112748)
  Estimate segment field usages (elastic#112760)
  (Doc+) Inference Pipeline ignores Mapping Analyzers (elastic#112522)
  Fix verifyVersions task (elastic#112765)
  (Doc+) Terminating Exit Codes (elastic#112530)
  (Doc+) CAT Nodes default columns (elastic#112715)
  [DOCS] Augment installation warnings (elastic#112756)
  Mute org.elasticsearch.repositories.blobstore.testkit.integrity.RepositoryVerifyIntegrityIT testCorruption elastic#112769
  Bump Elasticsearch to a minimum of JDK 21 (elastic#112252)
  ESQL: Compute support for filtering ungrouped aggs (elastic#112717)
  Bump Elasticsearch version to 9.0.0 (elastic#112570)
  add CDR related data streams to kibana_system priviliges (elastic#112655)
  Support widening of numeric types in union-types (elastic#112610)
  Introduce data stream options and failure store configuration classes (elastic#109515)
  ...
davidkyle pushed a commit that referenced this pull request Sep 12, 2024
We have introduced a new memory estimation method in serverless, based 
on the number of segments and the fields within them. This new approach
works well overall, but it still falls short in cases where most fields
are used more than once - for example, in both doc_values and postings,
or doc_values and points. This change exposes the total usage of fields
in segments, allowing us to adjust the memory estimate for these cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants