Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script: Time series compile and cache evict metrics #79078

Merged

Conversation

stu-elastic
Copy link
Contributor

Collects compilation and cache eviction metrics for
each script context.

Metrics are available in _nodes/stats in 5m/15m/1d
buckets.

Refs: #62899

Collects compilation and cache eviction metrics for
each script context.

Metrics are available in _nodes/stats in 5m/15m/1d
buckets.

Refs: elastic#62899
@stu-elastic stu-elastic added >feature v7.16.0 >enhancement :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache and removed >feature labels Oct 13, 2021
@stu-elastic stu-elastic marked this pull request as ready for review October 13, 2021 15:02
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Oct 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@rjernst rjernst removed their assignment Oct 13, 2021
@rjernst rjernst self-requested a review October 13, 2021 15:07
Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some initial comments. I need more time to process the time series logic.

@stu-elastic
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work here! I think this will be a good improvement in introspectability for users, helping make more informed decisions about script cache size adjustments. I also really like having separate precision buckets to save on memory. I have a few general comments about the approach:

  1. Simplify and clarify terminology. There are several terms that are not defined, and I'm not sure they are needed. You removed authoritative, but left delegate. I don't think either of these are necessary, and in fact make understanding the code more difficult. There are also several helper methods that are cryptically named. Overall I think the structure needs to be better described, perhaps with ascii diagrams to explain the relationship of one array being a zoomed in version of one bucket from another array, and what the rollover and skew behaviors are.
  2. Remove flexibility. While it may be that we want to reuse this counter in the future for other cases, right now we have a very specific use case, 24h/15m/5m. I think we should tailor the implementation to work with these constraints. It will simplify the implementation (and testing necessary!).
  3. Simplify the implementation. Recalculating the current index is unnecessary if we keep track of our current active bucket within each array. Having the current index has a bunch of advantages:
    • The offset from the current index can be easily calculated, based on the delta of latest and current time.
    • The logic for rolling over (and other operations like summing) can be generalized into a shared method. If the offset goes past the end of the array, it means a rollover is needed.
    • If the offset is 0, we can just increment the value at the current index. We don't need the adder for the most immediate counter separate from the array.
  4. Consider using an array per metric. While the high/low precision is interesting, it makes the implementation more difficult to understand. I think it would be more straightforward to have the arrays named based on the metric they are keeping track of. Each of these could even be generalized to a tiny implementation class that wraps the array and current bucket index. This way a lot of these utilities could be implemented directly on this class, like summing the array, advancing the bucket, etc. They can have a "parent" as well, where rolling over propagates to the current bucket of the parent. For this to work we should also have the higher precision array map to a single bucket in the lower precision, I'm not sure if this is actually enforced right now (I would have expected 15 min precision on the low precision bucket, but 30 mins is used).

@stu-elastic
Copy link
Contributor Author

Re: 4 above. We'll start with the simplest implementation, which is triplicate the write. Once to 5m, once to 15m and once to 24h, this will have worse performance than writing once but that cost implementation complexity in the current version.

We can always decide to increase implementation complexity to avoid the writing multiple times.

@stu-elastic stu-elastic added v8.1.0 and removed v8.0.0 labels Oct 28, 2021
@stu-elastic
Copy link
Contributor Author

stu-elastic commented Oct 28, 2021

Updated the PR based on the feedback above.

Simplify and clarify terminology.

The following concepts are used: bucket, epoch, earliestTimeInCounter, counterExpired, nextBucketStartTime.

Overall I think the structure needs to be better described, perhaps with ascii diagrams to explain the relationship of one array being a zoomed in version of one bucket from another array, and what the rollover and skew behaviors are.

There are ascii diagrams representing increment within a bucket, roll over to a new bucket, skipping buckets and moving to a new epoch.

Remove flexibility. While it may be that we want to reuse this counter in the future for other cases, right now we have a very specific use case, 24h/15m/5m. I think we should tailor the implementation to work with these constraints. It will simplify the implementation (and testing necessary!).

The API only exposes 24h/15m/5m.

Simplify the implementation. Recalculating the current index is unnecessary if we keep track of our current active bucket within each array. Having the current index has a bunch of advantages...

The active bucket is tracked via curBucket.

Consider using an array per metric. While the high/low precision is interesting, it makes the implementation more difficult to understand. I think it would be more straightforward to have the arrays named based on the metric they are keeping track of. Each of these could even be generalized to a tiny implementation class that wraps the array and current bucket index. This way a lot of these utilities could be implemented directly on this class, like summing the array, advancing the bucket, They can have a "parent" as well, where rolling over propagates to the current bucket of the parent

The internal implementation is called Counter, per discussion on my comment above, this PR does not implement a parent bucket to simplify the implementation as much as possible.

@stu-elastic stu-elastic requested a review from colings86 October 28, 2021 02:26
Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stu-elastic I left some comments. Additionally, could we add documentation to this PR so we have documentation explaining these stats to users?

import java.io.IOException;
import java.util.Objects;

public class TimeSeries implements Writeable, ToXContentFragment {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we add a javadoc here explaining that this is the response object and that the metrics are collected by TimeSeriesCounter. This avoids a couple of "find usages" calls in the IDE to link the two if you don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

}

/**
* The total number of events for all time covered gby the counters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The total number of events for all time covered gby the counters.
* The total number of events for all time covered by the counters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

* 300[c]-> 320[f]
*
* [a] Beginning of the current epoch
* startOfEpoch = 200 = (t / duration) * duration = (235 / 100) * 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we avoid spacing this out so much so it's easier to read? (applies to below ones too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed excess spacing.

adder.increment();
lock.writeLock().lock();
try {
if (t < twentyFourHours.earliestTimeInCounter()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why we would ever expect this to happen? Since this counter is always called with now() I would have thought that we would never expect t to go backwards between successive calls and even if there is a race condition I would not have thought we would expect it to go back so far?

IF the above is true then I'm not sure the right behaviour is to trash all the current stats and start again if we get an increment from a "long" time in the past? Erroring probably also isn't a good option here since we don't want to stop the compilation or execution of the script (I think? though it might be worthy of an assert to ensure we catch it in tests) but maybe we should just not increment if this happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using ThreadPool.absoluteTimeInMillis(). I'll switch to ThreadPool.relativeTimeInMillis.

If we have a very large odd update we have three options:
A) Increment the current bucket assuming it will catch back up soon
B) Ignore the update assuming it will catch back up soon
C) Clear the bucket assuming this is the "new normal"

A & B are better with temporary blips, C is good if there's a one-time adjustment but bad if the weird adjustments keep happening.

I chose C to avoid the odd "getting stuck" possibility.

*/
public long sum(long end) {
long start = end - duration;
if (start >= nextBucketStartTime() || start < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the thinking on returning 0 if start < 0 here? Below we are saying we will emit incomplete buckets if the start is before the earliest time in the counter so I'm not sure why start < 0 is different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to avoid issues with math on negative time values. In the current version, TimeSeriesCounter.now() ensures time is never negative.

public void testOnePerSecond() {
long time = now;
long t;
long next = randomLongBetween(1, HOUR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to something like nextAssertCheck so its easier to see that this is just controlling when we run the asserts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@stu-elastic
Copy link
Contributor Author

After a chat with @colings86, here's the next steps:

  • Move timeProvider into TimeSeriesCounter, this makes the public interface clear. TimeSeriesCounter.inc() and TimeSeriesCounter.timeSeries() no longer take a long.
  • in Counter, change the parameter t in inc to now and indicate in Javadocs that the value of now is treated as metadata, the code assumes the increment happens "now" and uses the parameter to determine how to update the state in response to forward movements in time. Users of Counter should not dump a bunch of existing events in any order and expect a deterministic outcome.
  • TimeSeriesCounter will still reset all counters if it receives an event from timeProvider greater than 24 hours ago.
  • Counter will still clamp all events from the past through the current bucket time range to the current bucket.
  • Counter will not handle zero negative times.
  • TimeSeriesCounter will expect to recieve System.currentTimeMillis to avoid requiring Counter to handle negative times.

@stu-elastic
Copy link
Contributor Author

@jdconrad and @colings86 I've addressed all outstanding comments. Please re-review.

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for walking me through it again! Changed LGTM.

@stu-elastic stu-elastic merged commit 30e15ba into elastic:master Nov 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >enhancement Team:Core/Infra Meta label for core/infra team v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants