Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService #111730

nicktindall · 2024-08-09T01:10:59Z

Relates: ES-9067

The additional metrics allow us to record throughput, and total time/amount read when populating the cache. We can distinguish between population due to cache-misses and pre-warming, and we can distinguish between populating the cache from the blob store or a peer node.

Relates: ES-9067

elasticsearchmachine · 2024-08-09T01:12:04Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2024-08-09T01:12:05Z

Hi @nicktindall, I've created a changelog YAML for you.

.../src/main/java/org/elasticsearch/xpack/searchablesnapshots/store/input/FrozenIndexInput.java

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java

+         * When fetching a new commit
+         */
+        LoadCommit
+    }


ywangd

I had a quick look and it makes sense to me. Will take a closer read later. Thanks!

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBytes.java

This reverts commit 07dfc5a.

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBytes.java

…e_cache_copy_metrics

ywangd · 2024-08-12T07:41:38Z

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java

+        /**
+         * When warming the cache
+         */
+        Warming,
+        /**
+         * When fetching a new commit
+         */
+        LoadCommit,
+        /**
+         * When the data we need is not in the cache
+         */
+        CacheMiss


Warming is good. But I am not sure about the other names. In theory, they are just non-warming triggered by all sorts of activities, such as opening engine, indexing, search etc. Just tossing some random idea, what about IndexInput?

Yeah I will update these, I had LoadCommit in there for the downloadCommit use-case (which, as you pointed out, is disabled).

Am I right in thinking the other loads you mentioned are all triggered by a cache-miss? i.e. the caller asks the SearchIndexInput for the bytes and it loads any missing bytes from the blob-store? (that's why I called it CacheMiss)

What about OnDemand?, that would seem to cover anything that was loaded because it was needed immediately (i.e. not warming)

In theory, it's always triggered by cache miss other than warming. We likely will load more than what is needed for that particular cache miss. So there will be data loaded but not needed immediately. But they are still triggered by cache miss.

idegtiarenko

Overall 👍 from me

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java

ywangd

I left a question. Also, could we change the label to >non-issue instead of enhancement since the updated code is not used in stateful and end-user should see no difference. The label enhancement generates an entry in release log which is not necessary if the change is not related to stateful.

ywangd · 2024-08-15T00:44:15Z

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java

+            meterRegistry.registerLongCounter(
+                "es.blob_cache.populate_time.total",
+                "The time spent copying data into the cache",
+                "milliseconds"


Maybe I missed the discussion somewhere: I thought this should be a histogram similar to s3 http request time?

We did discuss this, the feeling was that because we've got the throughput distribution, it might give us more flexibility to record population bytes and time as raw totals. Leaving them as raw totals leaves more options for aggregation in the charts (e.g. how much did we download when warming shard X, how long did we spend warming index Y, how much did we download due to warming when that node failed) I don't think you can answer those questions with bytes/time histograms, (I think) they can only tell us the distribution of chunk sizes or chunk download times in some window.

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java

nicktindall · 2024-08-15T04:19:55Z

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java

+        double totalSeconds = totalNanoseconds / 1_000_000_000.0;
+        double totalMegabytes = totalBytes / 1_048_576.0;
+        return totalMegabytes / totalSeconds;
+    }


I used Mebibytes because that's what ByteSizeValue#ofMb uses

nicktindall · 2024-08-15T04:21:14Z

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/CachePopulationSource.java

+     * When fetching data from a peer node
+     */
+    Peer
+}


This got extracted out to make it easier to use in InputStreamWithSource in stateless, you could argue that CachePopulationReason should be extracted also for consistency, and I would be open to that, but it's not used elsewhere yet so I left it in BlobCacheMetrics.

ywangd

LGTM

I have some minor comments for your consideration.

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java

ywangd · 2024-08-15T04:44:01Z

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java

+        // This is almost certainly paranoid, but if we had a very fast/small copy with a very coarse nanosecond timer it might happen?
+        if (totalCopyTimeNanos > 0) {


I think we could add a warning log in the else branch similar to how we log a warning if s3 metric does not have a valid request time metric.

Addressed in b17f316

I couldn't find the warning you were referring to, but I did add one

elasticsearch/modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobStore.java

Line 229 in 5934190

logger.warn("Expected HttpRequestTime to be tracked for request [{}] but found no count.", request);

x-pack/plugin/blob-cache/src/test/java/org/elasticsearch/blobcache/BlobCacheMetricsTests.java

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java

* upstream/main: (91 commits) Mute org.elasticsearch.xpack.test.rest.XPackRestIT org.elasticsearch.xpack.test.rest.XPackRestIT elastic#111944 Add audit_unenrolled_* attributes to fleet-agents template (elastic#111909) Fix windows memory locking (elastic#111866) Update OAuth2 OIDC SDK (elastic#108799) Adds a warning about manually mounting snapshots managed by ILM (elastic#111883) Update geoip fixture files and utility methods (elastic#111913) Updated Function Score Query Test with Explain Fixes for 8.15.1 (elastic#111929) Mute org.elasticsearch.xpack.sql.qa.security.JdbcCsvSpecIT org.elasticsearch.xpack.sql.qa.security.JdbcCsvSpecIT elastic#111923 [ESQL] date nanos binary comparisons (elastic#111908) [DOCS] Documents output_field behavior after multiple inference runs (elastic#111875) Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService (elastic#111730) Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT elastic#111923 Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT test {agg-ordering.testHistogramDateTimeWithCountAndOrder_2} elastic#111919 Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT test {date.testDateParseHaving} elastic#111921 Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT test {agg-ordering.testHistogramDateTimeWithCountAndOrder_1} elastic#111918 Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT test {datetime.testDateTimeParseHaving} elastic#111922 Mute org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT elastic#111923 Mute org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT test {agg-ordering.testHistogramDateTimeWithCountAndOrder_1} elastic#111918 Mute org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT test {datetime.testDateTimeParseHaving} elastic#111922 Mute org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT test {date.testDateParseHaving} elastic#111921 ... # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

…obCacheService (elastic#111730) Relates: ES-9067

Add callback for copy-to-cache metrics, additional BlobCacheMetrics

5dacb69

Relates: ES-9067

elasticsearchmachine added v8.16.0 needs:triage Requires assignment of a team area label labels Aug 9, 2024

nicktindall added :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. >enhancement v8.16.0 and removed needs:triage Requires assignment of a team area label v8.16.0 labels Aug 9, 2024

elasticsearchmachine added the Team:Distributed Meta label for distributed team (obsolete) label Aug 9, 2024

Update docs/changelog/111730.yaml

c4b6487

nicktindall requested review from ywangd and idegtiarenko August 9, 2024 01:12

nicktindall added 2 commits August 9, 2024 11:21

Shorten metric names (exceeded max length)

ad68e99

Align attribute key name

181b958

nicktindall commented Aug 9, 2024

View reviewed changes

.../src/main/java/org/elasticsearch/xpack/searchablesnapshots/store/input/FrozenIndexInput.java Outdated Show resolved Hide resolved

nicktindall commented Aug 9, 2024

View reviewed changes

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java

* When fetching a new commit

*/

LoadCommit

}

This comment was marked as outdated.

Sign in to view

ywangd reviewed Aug 9, 2024

View reviewed changes

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBytes.java Outdated Show resolved Hide resolved

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBytes.java Outdated Show resolved Hide resolved

nicktindall added 4 commits August 9, 2024 15:39

Remove callback from method not used by stateless

07dfc5a

Add CachePopulationReason.CacheMiss

9c8ee42

Revert "Remove callback from method not used by stateless"

752b1ef

This reverts commit 07dfc5a.

NO_OP -> NOOP

c0189f4

idegtiarenko reviewed Aug 9, 2024

View reviewed changes

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBytes.java Outdated Show resolved Hide resolved

nicktindall added 4 commits August 12, 2024 12:36

Remove metrics from copyToCacheFileAligned not used

7c34720

Merge remote-tracking branch 'origin/main' into feature/ES-9067_expos…

69d58ec

…e_cache_copy_metrics

Add todo

c823f75

Protect against divide-by-zero

10a32f6

ywangd reviewed Aug 12, 2024

View reviewed changes

Remove unused CachePopulationReason#LoadCommit

439152c

idegtiarenko approved these changes Aug 13, 2024

View reviewed changes

nicktindall added 3 commits August 14, 2024 12:35

Add test for SharedBytes#copyToCacheFileAligned

95d2e5f

Fix comment, ensure room for at least a byte

e18a283

Use strings for metric values

168efe1

nicktindall commented Aug 14, 2024

View reviewed changes

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java Show resolved Hide resolved

Take shard ID as a string, to reduce string concatenation

f60d341

ywangd reviewed Aug 15, 2024

View reviewed changes

nicktindall added 5 commits August 15, 2024 12:15

Remove BlobCachePopulationListener

c938e89

Add source to BlobCacheMetrics

8f5967a

Add test, extract CachePopulationSource

45fb734

Expose BlobCacheMetrics from SharedBlobCacheService

71c0b2c

Fix test name

91405f2

nicktindall added >non-issue and removed >enhancement labels Aug 15, 2024

nicktindall and others added 2 commits August 15, 2024 14:10

Delete docs/changelog/111730.yaml

03a0140

Improve metric and attribute names

df97c59

nicktindall commented Aug 15, 2024

View reviewed changes

nicktindall added 4 commits August 15, 2024 14:25

Randomise BlobCacheMetricsTests

8295d35

De-duplicate

a903534

Fix spotless

e8edc63

Add Unknown CachePopulationSource

81e2776

ywangd approved these changes Aug 15, 2024

View reviewed changes

nicktindall changed the title ~~Add callback for copy-to-cache metrics, additional BlobCacheMetrics~~ Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService Aug 15, 2024

Apply feedback

b17f316

nicktindall merged commit 5934190 into elastic:main Aug 15, 2024
15 checks passed

nicktindall deleted the feature/ES-9067_expose_cache_copy_metrics branch August 15, 2024 07:02

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024

Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBl…

b822152

…obCacheService (elastic#111730) Relates: ES-9067

davidkyle pushed a commit to davidkyle/elasticsearch that referenced this pull request Sep 5, 2024

Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBl…

76755be

…obCacheService (elastic#111730) Relates: ES-9067

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService #111730

Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService #111730

nicktindall commented Aug 9, 2024 •

edited

Loading

elasticsearchmachine commented Aug 9, 2024

elasticsearchmachine commented Aug 9, 2024

This comment was marked as outdated.

ywangd left a comment

ywangd Aug 12, 2024

nicktindall Aug 12, 2024

ywangd Aug 12, 2024

idegtiarenko left a comment

ywangd left a comment

ywangd Aug 15, 2024

nicktindall Aug 15, 2024 •

edited

Loading

nicktindall Aug 15, 2024

nicktindall Aug 15, 2024 •

edited

Loading

ywangd left a comment

ywangd Aug 15, 2024

nicktindall Aug 15, 2024

ywangd Aug 15, 2024

		// This is almost certainly paranoid, but if we had a very fast/small copy with a very coarse nanosecond timer it might happen?
		if (totalCopyTimeNanos > 0) {

Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService #111730

Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService #111730

Conversation

nicktindall commented Aug 9, 2024 • edited Loading

elasticsearchmachine commented Aug 9, 2024

elasticsearchmachine commented Aug 9, 2024

This comment was marked as outdated.

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idegtiarenko left a comment

Choose a reason for hiding this comment

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall commented Aug 9, 2024 •

edited

Loading

nicktindall Aug 15, 2024 •

edited

Loading

nicktindall Aug 15, 2024 •

edited

Loading