-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script: Time series compile and cache evict metrics #79078
Script: Time series compile and cache evict metrics #79078
Conversation
Collects compilation and cache eviction metrics for each script context. Metrics are available in _nodes/stats in 5m/15m/1d buckets. Refs: elastic#62899
Pinging @elastic/es-core-infra (Team:Core/Infra) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some initial comments. I need more time to process the time series logic.
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work here! I think this will be a good improvement in introspectability for users, helping make more informed decisions about script cache size adjustments. I also really like having separate precision buckets to save on memory. I have a few general comments about the approach:
- Simplify and clarify terminology. There are several terms that are not defined, and I'm not sure they are needed. You removed authoritative, but left delegate. I don't think either of these are necessary, and in fact make understanding the code more difficult. There are also several helper methods that are cryptically named. Overall I think the structure needs to be better described, perhaps with ascii diagrams to explain the relationship of one array being a zoomed in version of one bucket from another array, and what the rollover and skew behaviors are.
- Remove flexibility. While it may be that we want to reuse this counter in the future for other cases, right now we have a very specific use case, 24h/15m/5m. I think we should tailor the implementation to work with these constraints. It will simplify the implementation (and testing necessary!).
- Simplify the implementation. Recalculating the current index is unnecessary if we keep track of our current active bucket within each array. Having the current index has a bunch of advantages:
- The offset from the current index can be easily calculated, based on the delta of latest and current time.
- The logic for rolling over (and other operations like summing) can be generalized into a shared method. If the offset goes past the end of the array, it means a rollover is needed.
- If the offset is 0, we can just increment the value at the current index. We don't need the
adder
for the most immediate counter separate from the array.
- Consider using an array per metric. While the high/low precision is interesting, it makes the implementation more difficult to understand. I think it would be more straightforward to have the arrays named based on the metric they are keeping track of. Each of these could even be generalized to a tiny implementation class that wraps the array and current bucket index. This way a lot of these utilities could be implemented directly on this class, like summing the array, advancing the bucket, etc. They can have a "parent" as well, where rolling over propagates to the current bucket of the parent. For this to work we should also have the higher precision array map to a single bucket in the lower precision, I'm not sure if this is actually enforced right now (I would have expected 15 min precision on the low precision bucket, but 30 mins is used).
server/src/main/java/org/elasticsearch/script/ScriptMetrics.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java
Outdated
Show resolved
Hide resolved
Re: 4 above. We'll start with the simplest implementation, which is triplicate the write. Once to We can always decide to increase implementation complexity to avoid the writing multiple times. |
…ing/cache-metrics-collect-pr
…ing/cache-metrics-collect-pr
…ing/cache-metrics-collect-pr
…ing/cache-metrics-collect-pr
Updated the PR based on the feedback above.
The following concepts are used:
There are ascii diagrams representing increment within a bucket, roll over to a new bucket, skipping buckets and moving to a new epoch.
The API only exposes 24h/15m/5m.
The active bucket is tracked via
The internal implementation is called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stu-elastic I left some comments. Additionally, could we add documentation to this PR so we have documentation explaining these stats to users?
import java.io.IOException; | ||
import java.util.Objects; | ||
|
||
public class TimeSeries implements Writeable, ToXContentFragment { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we add a javadoc here explaining that this is the response object and that the metrics are collected by TimeSeriesCounter
. This avoids a couple of "find usages" calls in the IDE to link the two if you don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
} | ||
|
||
/** | ||
* The total number of events for all time covered gby the counters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The total number of events for all time covered gby the counters. | |
* The total number of events for all time covered by the counters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
* 300[c]-> 320[f] | ||
* | ||
* [a] Beginning of the current epoch | ||
* startOfEpoch = 200 = (t / duration) * duration = (235 / 100) * 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we avoid spacing this out so much so it's easier to read? (applies to below ones too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed excess spacing.
adder.increment(); | ||
lock.writeLock().lock(); | ||
try { | ||
if (t < twentyFourHours.earliestTimeInCounter()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why we would ever expect this to happen? Since this counter is always called with now()
I would have thought that we would never expect t to go backwards between successive calls and even if there is a race condition I would not have thought we would expect it to go back so far?
IF the above is true then I'm not sure the right behaviour is to trash all the current stats and start again if we get an increment from a "long" time in the past? Erroring probably also isn't a good option here since we don't want to stop the compilation or execution of the script (I think? though it might be worthy of an assert to ensure we catch it in tests) but maybe we should just not increment if this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using ThreadPool.absoluteTimeInMillis()
. I'll switch to ThreadPool.relativeTimeInMillis
.
If we have a very large odd update we have three options:
A) Increment the current bucket assuming it will catch back up soon
B) Ignore the update assuming it will catch back up soon
C) Clear the bucket assuming this is the "new normal"
A & B are better with temporary blips, C is good if there's a one-time adjustment but bad if the weird adjustments keep happening.
I chose C to avoid the odd "getting stuck" possibility.
*/ | ||
public long sum(long end) { | ||
long start = end - duration; | ||
if (start >= nextBucketStartTime() || start < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow the thinking on returning 0
if start < 0
here? Below we are saying we will emit incomplete buckets if the start is before the earliest time in the counter so I'm not sure why start < 0
is different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to avoid issues with math on negative time values. In the current version, TimeSeriesCounter.now()
ensures time is never negative.
public void testOnePerSecond() { | ||
long time = now; | ||
long t; | ||
long next = randomLongBetween(1, HOUR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to something like nextAssertCheck
so its easier to see that this is just controlling when we run the asserts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
server/src/test/java/org/elasticsearch/script/TimeSeriesCounterTests.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/script/TimeSeriesCounterTests.java
Show resolved
Hide resolved
After a chat with @colings86, here's the next steps:
|
…ing/cache-metrics-collect-pr
…ing/cache-metrics-collect-pr
@jdconrad and @colings86 I've addressed all outstanding comments. Please re-review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for walking me through it again! Changed LGTM.
Collects compilation and cache eviction metrics for
each script context.
Metrics are available in _nodes/stats in 5m/15m/1d
buckets.
Refs: #62899