-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService #111730
Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService #111730
Conversation
Pinging @elastic/es-distributed (Team:Distributed) |
Hi @nicktindall, I've created a changelog YAML for you. |
.../src/main/java/org/elasticsearch/xpack/searchablesnapshots/store/input/FrozenIndexInput.java
Outdated
Show resolved
Hide resolved
* When fetching a new commit | ||
*/ | ||
LoadCommit | ||
} |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a quick look and it makes sense to me. Will take a closer read later. Thanks!
x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBytes.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBytes.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBytes.java
Outdated
Show resolved
Hide resolved
/** | ||
* When warming the cache | ||
*/ | ||
Warming, | ||
/** | ||
* When fetching a new commit | ||
*/ | ||
LoadCommit, | ||
/** | ||
* When the data we need is not in the cache | ||
*/ | ||
CacheMiss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warming
is good. But I am not sure about the other names. In theory, they are just non-warming triggered by all sorts of activities, such as opening engine, indexing, search etc. Just tossing some random idea, what about IndexInput?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I will update these, I had LoadCommit
in there for the downloadCommit
use-case (which, as you pointed out, is disabled).
Am I right in thinking the other loads you mentioned are all triggered by a cache-miss? i.e. the caller asks the SearchIndexInput
for the bytes and it loads any missing bytes from the blob-store? (that's why I called it CacheMiss
)
What about OnDemand
?, that would seem to cover anything that was loaded because it was needed immediately (i.e. not warming)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, it's always triggered by cache miss other than warming. We likely will load more than what is needed for that particular cache miss. So there will be data loaded but not needed immediately. But they are still triggered by cache miss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall 👍 from me
x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a question. Also, could we change the label to >non-issue
instead of enhancement
since the updated code is not used in stateful and end-user should see no difference. The label enhancement
generates an entry in release log which is not necessary if the change is not related to stateful.
meterRegistry.registerLongCounter( | ||
"es.blob_cache.populate_time.total", | ||
"The time spent copying data into the cache", | ||
"milliseconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed the discussion somewhere: I thought this should be a histogram similar to s3 http request time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did discuss this, the feeling was that because we've got the throughput distribution, it might give us more flexibility to record population bytes and time as raw totals. Leaving them as raw totals leaves more options for aggregation in the charts (e.g. how much did we download when warming shard X, how long did we spend warming index Y, how much did we download due to warming when that node failed) I don't think you can answer those questions with bytes/time histograms, (I think) they can only tell us the distribution of chunk sizes or chunk download times in some window.
x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java
Outdated
Show resolved
Hide resolved
double totalSeconds = totalNanoseconds / 1_000_000_000.0; | ||
double totalMegabytes = totalBytes / 1_048_576.0; | ||
return totalMegabytes / totalSeconds; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used Mebibytes because that's what ByteSizeValue#ofMb
uses
* When fetching data from a peer node | ||
*/ | ||
Peer | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got extracted out to make it easier to use in InputStreamWithSource
in stateless, you could argue that CachePopulationReason
should be extracted also for consistency, and I would be open to that, but it's not used elsewhere yet so I left it in BlobCacheMetrics
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I have some minor comments for your consideration.
x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java
Outdated
Show resolved
Hide resolved
// This is almost certainly paranoid, but if we had a very fast/small copy with a very coarse nanosecond timer it might happen? | ||
if (totalCopyTimeNanos > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could add a warning log in the else
branch similar to how we log a warning if s3 metric does not have a valid request time metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in b17f316
I couldn't find the warning you were referring to, but I did add one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elasticsearch/modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobStore.java
Line 229 in 5934190
logger.warn("Expected HttpRequestTime to be tracked for request [{}] but found no count.", request); |
x-pack/plugin/blob-cache/src/test/java/org/elasticsearch/blobcache/BlobCacheMetricsTests.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/blob-cache/src/test/java/org/elasticsearch/blobcache/BlobCacheMetricsTests.java
Show resolved
Hide resolved
x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/BlobCacheMetrics.java
Outdated
Show resolved
Hide resolved
* upstream/main: (91 commits) Mute org.elasticsearch.xpack.test.rest.XPackRestIT org.elasticsearch.xpack.test.rest.XPackRestIT elastic#111944 Add audit_unenrolled_* attributes to fleet-agents template (elastic#111909) Fix windows memory locking (elastic#111866) Update OAuth2 OIDC SDK (elastic#108799) Adds a warning about manually mounting snapshots managed by ILM (elastic#111883) Update geoip fixture files and utility methods (elastic#111913) Updated Function Score Query Test with Explain Fixes for 8.15.1 (elastic#111929) Mute org.elasticsearch.xpack.sql.qa.security.JdbcCsvSpecIT org.elasticsearch.xpack.sql.qa.security.JdbcCsvSpecIT elastic#111923 [ESQL] date nanos binary comparisons (elastic#111908) [DOCS] Documents output_field behavior after multiple inference runs (elastic#111875) Add additional BlobCacheMetrics, expose BlobCacheMetrics via SharedBlobCacheService (elastic#111730) Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT elastic#111923 Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT test {agg-ordering.testHistogramDateTimeWithCountAndOrder_2} elastic#111919 Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT test {date.testDateParseHaving} elastic#111921 Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT test {agg-ordering.testHistogramDateTimeWithCountAndOrder_1} elastic#111918 Mute org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.JdbcCsvSpecIT test {datetime.testDateTimeParseHaving} elastic#111922 Mute org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT elastic#111923 Mute org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT test {agg-ordering.testHistogramDateTimeWithCountAndOrder_1} elastic#111918 Mute org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT test {datetime.testDateTimeParseHaving} elastic#111922 Mute org.elasticsearch.xpack.sql.qa.single_node.JdbcCsvSpecIT test {date.testDateParseHaving} elastic#111921 ... # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
…obCacheService (elastic#111730) Relates: ES-9067
…obCacheService (elastic#111730) Relates: ES-9067
Relates: ES-9067
The additional metrics allow us to record throughput, and total time/amount read when populating the cache. We can distinguish between population due to cache-misses and pre-warming, and we can distinguish between populating the cache from the blob store or a peer node.