Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService #111730

Merged

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Aug 9, 2024

Relates: ES-9067

The additional metrics allow us to record throughput, and total time/amount read when populating the cache. We can distinguish between population due to cache-misses and pre-warming, and we can distinguish between populating the cache from the blob store or a peer node.

@elasticsearchmachine elasticsearchmachine added v8.16.0 needs:triage Requires assignment of a team area label labels Aug 9, 2024
@nicktindall nicktindall added :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. >enhancement v8.16.0 and removed needs:triage Requires assignment of a team area label v8.16.0 labels Aug 9, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team (obsolete) label Aug 9, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

* When fetching a new commit
*/
LoadCommit
}

This comment was marked as outdated.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick look and it makes sense to me. Will take a closer read later. Thanks!

Comment on lines 31 to 42
/**
* When warming the cache
*/
Warming,
/**
* When fetching a new commit
*/
LoadCommit,
/**
* When the data we need is not in the cache
*/
CacheMiss
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warming is good. But I am not sure about the other names. In theory, they are just non-warming triggered by all sorts of activities, such as opening engine, indexing, search etc. Just tossing some random idea, what about IndexInput?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I will update these, I had LoadCommit in there for the downloadCommit use-case (which, as you pointed out, is disabled).

Am I right in thinking the other loads you mentioned are all triggered by a cache-miss? i.e. the caller asks the SearchIndexInput for the bytes and it loads any missing bytes from the blob-store? (that's why I called it CacheMiss)

What about OnDemand?, that would seem to cover anything that was loaded because it was needed immediately (i.e. not warming)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it's always triggered by cache miss other than warming. We likely will load more than what is needed for that particular cache miss. So there will be data loaded but not needed immediately. But they are still triggered by cache miss.

Copy link
Contributor

@idegtiarenko idegtiarenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall 👍 from me

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a question. Also, could we change the label to >non-issue instead of enhancement since the updated code is not used in stateful and end-user should see no difference. The label enhancement generates an entry in release log which is not necessary if the change is not related to stateful.

Comment on lines 68 to 71
meterRegistry.registerLongCounter(
"es.blob_cache.populate_time.total",
"The time spent copying data into the cache",
"milliseconds"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed the discussion somewhere: I thought this should be a histogram similar to s3 http request time?

Copy link
Contributor Author

@nicktindall nicktindall Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did discuss this, the feeling was that because we've got the throughput distribution, it might give us more flexibility to record population bytes and time as raw totals. Leaving them as raw totals leaves more options for aggregation in the charts (e.g. how much did we download when warming shard X, how long did we spend warming index Y, how much did we download due to warming when that node failed) I don't think you can answer those questions with bytes/time histograms, (I think) they can only tell us the distribution of chunk sizes or chunk download times in some window.

double totalSeconds = totalNanoseconds / 1_000_000_000.0;
double totalMegabytes = totalBytes / 1_048_576.0;
return totalMegabytes / totalSeconds;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Mebibytes because that's what ByteSizeValue#ofMb uses

* When fetching data from a peer node
*/
Peer
}
Copy link
Contributor Author

@nicktindall nicktindall Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got extracted out to make it easier to use in InputStreamWithSource in stateless, you could argue that CachePopulationReason should be extracted also for consistency, and I would be open to that, but it's not used elsewhere yet so I left it in BlobCacheMetrics.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I have some minor comments for your consideration.

Comment on lines 139 to 140
// This is almost certainly paranoid, but if we had a very fast/small copy with a very coarse nanosecond timer it might happen?
if (totalCopyTimeNanos > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add a warning log in the else branch similar to how we log a warning if s3 metric does not have a valid request time metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b17f316

I couldn't find the warning you were referring to, but I did add one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.warn("Expected HttpRequestTime to be tracked for request [{}] but found no count.", request);

@nicktindall nicktindall changed the title Add callback for copy-to-cache metrics, additional BlobCacheMetrics Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService Aug 15, 2024
@nicktindall nicktindall merged commit 5934190 into elastic:main Aug 15, 2024
15 checks passed
@nicktindall nicktindall deleted the feature/ES-9067_expose_cache_copy_metrics branch August 15, 2024 07:02
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Aug 16, 2024
* upstream/main: (91 commits)
  Mute org.elasticsearch.xpack.test.rest.XPackRestIT org.elasticsearch.xpack.test.rest.XPackRestIT elastic#111944
  Add audit_unenrolled_* attributes to fleet-agents template (elastic#111909)
  Fix windows memory locking (elastic#111866)
  Update OAuth2 OIDC SDK (elastic#108799)
  Adds a warning about manually mounting snapshots managed by ILM (elastic#111883)
  Update geoip fixture files and utility methods (elastic#111913)
  Updated Function Score Query Test with Explain Fixes for 8.15.1 (elastic#111929)
  Mute org.elasticsearch.xpack.sql.qa.security.JdbcCsvSpecIT org.elasticsearch.xpack.sql.qa.security.JdbcCsvSpecIT elastic#111923
  [ESQL] date nanos binary comparisons (elastic#111908)
  [DOCS] Documents output_field behavior after multiple inference runs (elastic#111875)
  Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService (elastic#111730)
  Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT elastic#111923
  Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT test {agg-ordering.testHistogramDateTimeWithCountAndOrder_2} elastic#111919
  Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT test {date.testDateParseHaving} elastic#111921
  Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT test {agg-ordering.testHistogramDateTimeWithCountAndOrder_1} elastic#111918
  Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT test {datetime.testDateTimeParseHaving} elastic#111922
  Mute org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT elastic#111923
  Mute org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT test {agg-ordering.testHistogramDateTimeWithCountAndOrder_1} elastic#111918
  Mute org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT test {datetime.testDateTimeParseHaving} elastic#111922
  Mute org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT test {date.testDateParseHaving} elastic#111921
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this pull request Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. >non-issue Team:Distributed Meta label for distributed team (obsolete) v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants