Script: Time series compile and cache evict metrics #79078

stu-elastic · 2021-10-13T14:57:58Z

Collects compilation and cache eviction metrics for
each script context.

Metrics are available in _nodes/stats in 5m/15m/1d
buckets.

Collects compilation and cache eviction metrics for each script context. Metrics are available in _nodes/stats in 5m/15m/1d buckets. Refs: elastic#62899

elasticmachine · 2021-10-13T15:02:32Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

jdconrad

Just some initial comments. I need more time to process the time series logic.

server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java

stu-elastic · 2021-10-18T14:22:07Z

@elasticmachine update branch

…llision

rjernst

Thanks for all the work here! I think this will be a good improvement in introspectability for users, helping make more informed decisions about script cache size adjustments. I also really like having separate precision buckets to save on memory. I have a few general comments about the approach:

Simplify and clarify terminology. There are several terms that are not defined, and I'm not sure they are needed. You removed authoritative, but left delegate. I don't think either of these are necessary, and in fact make understanding the code more difficult. There are also several helper methods that are cryptically named. Overall I think the structure needs to be better described, perhaps with ascii diagrams to explain the relationship of one array being a zoomed in version of one bucket from another array, and what the rollover and skew behaviors are.
Remove flexibility. While it may be that we want to reuse this counter in the future for other cases, right now we have a very specific use case, 24h/15m/5m. I think we should tailor the implementation to work with these constraints. It will simplify the implementation (and testing necessary!).
Simplify the implementation. Recalculating the current index is unnecessary if we keep track of our current active bucket within each array. Having the current index has a bunch of advantages:
- The offset from the current index can be easily calculated, based on the delta of latest and current time.
- The logic for rolling over (and other operations like summing) can be generalized into a shared method. If the offset goes past the end of the array, it means a rollover is needed.
- If the offset is 0, we can just increment the value at the current index. We don't need the adder for the most immediate counter separate from the array.
Consider using an array per metric. While the high/low precision is interesting, it makes the implementation more difficult to understand. I think it would be more straightforward to have the arrays named based on the metric they are keeping track of. Each of these could even be generalized to a tiny implementation class that wraps the array and current bucket index. This way a lot of these utilities could be implemented directly on this class, like summing the array, advancing the bucket, etc. They can have a "parent" as well, where rolling over propagates to the current bucket of the parent. For this to work we should also have the higher precision array map to a single bucket in the lower precision, I'm not sure if this is actually enforced right now (I would have expected 15 min precision on the low precision bucket, but 30 mins is used).

server/src/main/java/org/elasticsearch/script/ScriptMetrics.java

server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java

stu-elastic · 2021-10-20T21:53:50Z

Re: 4 above. We'll start with the simplest implementation, which is triplicate the write. Once to 5m, once to 15m and once to 24h, this will have worse performance than writing once but that cost implementation complexity in the current version.

We can always decide to increase implementation complexity to avoid the writing multiple times.

…ing/cache-metrics-collect-pr

stu-elastic · 2021-10-28T02:26:11Z

Updated the PR based on the feedback above.

Simplify and clarify terminology.

The following concepts are used: bucket, epoch, earliestTimeInCounter, counterExpired, nextBucketStartTime.

Overall I think the structure needs to be better described, perhaps with ascii diagrams to explain the relationship of one array being a zoomed in version of one bucket from another array, and what the rollover and skew behaviors are.

There are ascii diagrams representing increment within a bucket, roll over to a new bucket, skipping buckets and moving to a new epoch.

Remove flexibility. While it may be that we want to reuse this counter in the future for other cases, right now we have a very specific use case, 24h/15m/5m. I think we should tailor the implementation to work with these constraints. It will simplify the implementation (and testing necessary!).

The API only exposes 24h/15m/5m.

Simplify the implementation. Recalculating the current index is unnecessary if we keep track of our current active bucket within each array. Having the current index has a bunch of advantages...

The active bucket is tracked via curBucket.

Consider using an array per metric. While the high/low precision is interesting, it makes the implementation more difficult to understand. I think it would be more straightforward to have the arrays named based on the metric they are keeping track of. Each of these could even be generalized to a tiny implementation class that wraps the array and current bucket index. This way a lot of these utilities could be implemented directly on this class, like summing the array, advancing the bucket, They can have a "parent" as well, where rolling over propagates to the current bucket of the parent

The internal implementation is called Counter, per discussion on my comment above, this PR does not implement a parent bucket to simplify the implementation as much as possible.

colings86

@stu-elastic I left some comments. Additionally, could we add documentation to this PR so we have documentation explaining these stats to users?

colings86 · 2021-10-29T08:25:34Z

server/src/main/java/org/elasticsearch/script/TimeSeries.java

+import java.io.IOException;
+import java.util.Objects;
+
+public class TimeSeries implements Writeable, ToXContentFragment {


nit: Could we add a javadoc here explaining that this is the response object and that the metrics are collected by TimeSeriesCounter. This avoids a couple of "find usages" calls in the IDE to link the two if you don't know.

colings86 · 2021-10-29T08:28:56Z

server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java

+    }
+
+    /**
+     * The total number of events for all time covered gby the counters.


Suggested change

* The total number of events for all time covered gby the counters.

* The total number of events for all time covered by the counters.

colings86 · 2021-10-29T08:31:27Z

server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java

+         * 300[c]->    320[f]
+         *
+         * [a] Beginning of the current epoch
+         * startOfEpoch = 200            = (t / duration) * duration                          = (235 / 100) * 100


nit: can we avoid spacing this out so much so it's easier to read? (applies to below ones too)

Removed excess spacing.

colings86 · 2021-10-29T08:38:18Z

server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java

+        adder.increment();
+        lock.writeLock().lock();
+        try {
+            if (t < twentyFourHours.earliestTimeInCounter()) {


It's not clear to me why we would ever expect this to happen? Since this counter is always called with now() I would have thought that we would never expect t to go backwards between successive calls and even if there is a race condition I would not have thought we would expect it to go back so far?

IF the above is true then I'm not sure the right behaviour is to trash all the current stats and start again if we get an increment from a "long" time in the past? Erroring probably also isn't a good option here since we don't want to stop the compilation or execution of the script (I think? though it might be worthy of an assert to ensure we catch it in tests) but maybe we should just not increment if this happens?

We're using ThreadPool.absoluteTimeInMillis(). I'll switch to ThreadPool.relativeTimeInMillis.

If we have a very large odd update we have three options:
A) Increment the current bucket assuming it will catch back up soon
B) Ignore the update assuming it will catch back up soon
C) Clear the bucket assuming this is the "new normal"

A & B are better with temporary blips, C is good if there's a one-time adjustment but bad if the weird adjustments keep happening.

I chose C to avoid the odd "getting stuck" possibility.

server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java

colings86 · 2021-10-29T09:35:04Z

server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java

+         */
+        public long sum(long end) {
+            long start = end - duration;
+            if (start >= nextBucketStartTime() || start < 0) {


I'm not sure I follow the thinking on returning 0 if start < 0 here? Below we are saying we will emit incomplete buckets if the start is before the earliest time in the counter so I'm not sure why start < 0 is different?

This was to avoid issues with math on negative time values. In the current version, TimeSeriesCounter.now() ensures time is never negative.

server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java

colings86 · 2021-10-29T09:54:59Z

server/src/test/java/org/elasticsearch/script/TimeSeriesCounterTests.java

+    public void testOnePerSecond() {
+        long time = now;
+        long t;
+        long next = randomLongBetween(1, HOUR);


Can we rename this to something like nextAssertCheck so its easier to see that this is just controlling when we run the asserts?

server/src/test/java/org/elasticsearch/script/TimeSeriesCounterTests.java

stu-elastic · 2021-11-02T11:43:59Z

After a chat with @colings86, here's the next steps:

Move timeProvider into TimeSeriesCounter, this makes the public interface clear. TimeSeriesCounter.inc() and TimeSeriesCounter.timeSeries() no longer take a long.
in Counter, change the parameter t in inc to now and indicate in Javadocs that the value of now is treated as metadata, the code assumes the increment happens "now" and uses the parameter to determine how to update the state in response to forward movements in time. Users of Counter should not dump a bunch of existing events in any order and expect a deterministic outcome.
TimeSeriesCounter will still reset all counters if it receives an event from timeProvider greater than 24 hours ago.
Counter will still clamp all events from the past through the current bucket time range to the current bucket.
Counter will not handle zero negative times.
TimeSeriesCounter will expect to recieve System.currentTimeMillis to avoid requiring Counter to handle negative times.

…ing/cache-metrics-collect-pr

stu-elastic · 2021-11-03T03:40:37Z

@jdconrad and @colings86 I've addressed all outstanding comments. Please re-review.

colings86

LGTM

jdconrad

Thanks for walking me through it again! Changed LGTM.

Script: Time series compile and cache evict metrics

a6c7a93

Collects compilation and cache eviction metrics for each script context. Metrics are available in _nodes/stats in 5m/15m/1d buckets. Refs: elastic#62899

elasticsearchmachine added the v8.0.0 label Oct 13, 2021

stu-elastic added >feature v7.16.0 >enhancement :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache and removed >feature labels Oct 13, 2021

stu-elastic assigned rjernst and jdconrad Oct 13, 2021

stu-elastic marked this pull request as ready for review October 13, 2021 15:02

elasticmachine added the Team:Core/Infra Meta label for core/infra team label Oct 13, 2021

rjernst removed their assignment Oct 13, 2021

rjernst self-requested a review October 13, 2021 15:07

checkstyle

829523e

stu-elastic unassigned jdconrad Oct 13, 2021

stu-elastic mentioned this pull request Oct 13, 2021

Provide better compilation and cache stats to remove compiation rate limits #62899

Closed

3 tasks

stu-elastic added 3 commits October 13, 2021 13:24

remove empty if

610b95d

Collectors import unused

c36d060

TimeSeriesCounterTest -> TimeSeriesCounterTests

795baaa

jdconrad reviewed Oct 13, 2021

View reviewed changes

stu-elastic added 2 commits October 13, 2021 20:00

correctly name seconds, clamp sub series to latest

937037d

Use threadpool for time and time ranges for counters

ee1b1a7

elasticmachine and others added 6 commits October 18, 2021 10:22

Merge branch 'master' into scripting/cache-metrics-collect-pr

ee3e551

Remove spurious newlines

dfe5b4c

Update comments

67d3a43

Test coverage

7c5af25

Rename internal snapshot to timeSuppliedSnapshot to avoid var args co…

3d4d4a5

…llision

Constructor test coverage

3384062

rjernst requested changes Oct 20, 2021

View reviewed changes

server/src/main/java/org/elasticsearch/script/ScriptMetrics.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/script/TimeSeriesCounter.java Outdated Show resolved Hide resolved

stu-elastic added 13 commits October 27, 2021 09:48

Merge branch 'master' of github.com:elastic/elasticsearch into script…

40d2598

…ing/cache-metrics-collect-pr

Simplify Counters, add example documentation

760b832

Update merge from master

3c8ea05

Revert imports IngestServiceTests

d078d19

Merge branch 'master' of github.com:elastic/elasticsearch into script…

333dcf3

…ing/cache-metrics-collect-pr

spotless apply

916a4a1

Merge branch 'master' of github.com:elastic/elasticsearch into script…

44ebd20

…ing/cache-metrics-collect-pr

Tests and total

2822f35

Merge branch 'master' of github.com:elastic/elasticsearch into script…

e2f1c6c

…ing/cache-metrics-collect-pr

Revert total

b2cce9c

align time period

d673c20

total serialized in ScriptContextStats

e9740d5

diagram tweaks

6b3f5d4

stu-elastic added v8.1.0 and removed v8.0.0 labels Oct 28, 2021

stu-elastic requested a review from colings86 October 28, 2021 02:26

colings86 requested changes Oct 29, 2021

View reviewed changes

stu-elastic added 3 commits November 2, 2021 13:50

Merge branch 'master' of github.com:elastic/elasticsearch into script…

8b8c3ae

…ing/cache-metrics-collect-pr

time moved to TSC, add comments, docs

63a23dc

Merge branch 'master' of github.com:elastic/elasticsearch into script…

19a7516

…ing/cache-metrics-collect-pr

stu-elastic requested review from colings86 and jdconrad November 3, 2021 03:40

colings86 approved these changes Nov 3, 2021

View reviewed changes

jdconrad approved these changes Nov 3, 2021

View reviewed changes

stu-elastic merged commit 30e15ba into elastic:master Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script: Time series compile and cache evict metrics #79078

Script: Time series compile and cache evict metrics #79078

stu-elastic commented Oct 13, 2021

elasticmachine commented Oct 13, 2021

jdconrad left a comment

stu-elastic commented Oct 18, 2021

rjernst left a comment •

edited

Loading

stu-elastic commented Oct 20, 2021

stu-elastic commented Oct 28, 2021 •

edited

Loading

colings86 left a comment

colings86 Oct 29, 2021

stu-elastic Nov 3, 2021

colings86 Oct 29, 2021

stu-elastic Nov 3, 2021

colings86 Oct 29, 2021

stu-elastic Nov 3, 2021

colings86 Oct 29, 2021

stu-elastic Oct 29, 2021

colings86 Oct 29, 2021

stu-elastic Nov 3, 2021

colings86 Oct 29, 2021

stu-elastic Nov 3, 2021

stu-elastic commented Nov 2, 2021

stu-elastic commented Nov 3, 2021

colings86 left a comment

jdconrad left a comment

	* The total number of events for all time covered gby the counters.
	* The total number of events for all time covered by the counters.

Script: Time series compile and cache evict metrics #79078

Script: Time series compile and cache evict metrics #79078

Conversation

stu-elastic commented Oct 13, 2021

elasticmachine commented Oct 13, 2021

jdconrad left a comment

Choose a reason for hiding this comment

stu-elastic commented Oct 18, 2021

rjernst left a comment • edited Loading

Choose a reason for hiding this comment

stu-elastic commented Oct 20, 2021

stu-elastic commented Oct 28, 2021 • edited Loading

colings86 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stu-elastic commented Nov 2, 2021

stu-elastic commented Nov 3, 2021

colings86 left a comment

Choose a reason for hiding this comment

jdconrad left a comment

Choose a reason for hiding this comment

rjernst left a comment •

edited

Loading

stu-elastic commented Oct 28, 2021 •

edited

Loading