Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a MovingFunction pipeline aggregation, deprecate MovingAvg agg #29594

Merged
merged 20 commits into from
May 16, 2018

Conversation

polyfractal
Copy link
Contributor

This pipeline aggregation gives the user the ability to script functions that "move" across a window
of data, instead of single data points. Through custom script contexts, we also expose a number of convenience methods:

  • windowMax()
  • windowMin()
  • windowSum()
  • simpleMovAvg()
  • linearMovAvg()
  • ewmaMovAvg()
  • holtMovAvg()
  • holtWintersMovAvg()

Since we can easily expose moving average functionality through custom script contexts, I would like to deprecate the moving_avg agg at the same time if there isn't objection. Thoughts?

At the moment, this agg doesn't support "predictions" like the old agg... but I think that is a good thing. The "prediction" functionality is just super basic extrapolation, is fairly confusing for users and of questionable value. It also means we can eventually remove the simulated annealing optimizer, since that was mainly there to help minimize the parameters for holt-winters prediction.

I gutted the math portions of the moving avg models and consolidated them in this PR. There's still a lot of boilerplate cruft from the old models left, but that's just the way it is. In the process I de-generified the models to make it work with MovFn, and generics weren't needed anyway.

This is still a WIP, needs docs, etc. @rjernst would you mind taking a look at the scripting context stuff to make sure it looks ok? I'll track down a search/aggs team member for the rest of the review when it's ready.

This is a reboot of #25137

This pipeline aggregation gives the user the ability to script functions that "move" across a window
of data, instead of single data points.  It is the scripted version of MovingAvg pipeline agg.

Through custom script contexts, we expose a number of convenience methods:

 - windowMax
 - windowMin
 - windowSum
 - simpleMovAvg
 - linearMovAvg
 - ewmaMovAvg
 - holtMovAvg
 - holtWintersMovAvg

The user can also define any arbitrary logic via their own scripting, or combine with the above methods.
No need to c/p the code, we can basically share the movavg models
with an easy path towards deprecation in the future.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polyfractal I took a first pass on this and left some comments but I like the approach

moving_fn:
buckets_path: "the_avg"
window: 3
script: "MovingFunctions.windowMax(values)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should test that the other functions are exposed correctly too?

@@ -79,7 +79,7 @@ public static GapPolicy parse(String text, XContentLocation tokenLocation) {
for (GapPolicy policy : values()) {
validNames.add(policy.getName());
}
throw new ParsingException(tokenLocation, "Invalid gap policy: [" + text + "], accepted values: " + validNames);
throw new IllegalArgumentException("Invalid gap policy: [" + text + "], accepted values: " + validNames);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing the token location here, it seems like useful information to return to the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was incompatible with the new parser style (I didnt think you could get token info with the new way of writing parsers), but while writing this comment I noticed how BucketSort did it:

         PARSER.declareField(BucketSortPipelineAggregationBuilder::gapPolicy, p -> {
             if (p.currentToken() == XContentParser.Token.VALUE_STRING) {
                return GapPolicy.parse(p.text().toLowerCase(Locale.ROOT), p.getTokenLocation());
             }
             throw new IllegalArgumentException("Unsupported token [" + p.currentToken() + "]");
         }

which I somehow missed before. I'll put it back to the way it was and using the above snippet :)

/**
* Find the maximum value in a window of values
*/
public static double windowMax(Collection<Double> values) {
Copy link
Contributor

@colings86 colings86 Apr 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the JavaDocs can we qualify the behaviour if there are no values for all these functions? e.g. for this one it will return Double.NaN

*/
public static double linearMovAvg(Collection<Double> values) {
double avg = 0;
long totalWeight = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it correct that the total weight starts at 1 here? This means that if there are no values the function will return 0 which seems misleading?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. OTOH, that's how the linear avg works today so it'd be "breaking" the current behavior (these functions were basically lifted directly from existing moving model code)

I think we should probably change it anyway, but wanted to raise that point.

avg = (v * alpha) + (avg * (1 - alpha));
}
}
return avg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will also return 0 if there are no values, is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto above

@polyfractal polyfractal added review and removed WIP labels Apr 27, 2018
@polyfractal
Copy link
Contributor Author

Review comments addressed, and added documentation + deprecation notices + more tests

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, mostly about documentation

etc.

This is conceptually very similar to the <<search-aggregations-pipeline-movavg-aggregation, Moving Average>> pipeline aggregation, except
it provides more functionality and better flexibility.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more functionality and better flexibility -> I think we can just say more functionality here rather than repeating?

|`buckets_path` |Path to the metric of interest (see <<buckets-path-syntax, `buckets_path` Syntax>> for more details |Required |
|`window` |The size of window to "slide" across the histogram. |Optional |`5`
|`script` |The script that should be executed on each window of data |Required |
|`params` |User-defined extra parameters to be injected into the script |Optional |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed as the script field itself can handle params?

// TEST[setup:sales]

<1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals
<2> A `sum` metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should say this could be any numeric metric as I don't think you could point a moving_fn` at, for example, a geo_centroid agg?

<3> Finally, we specify a `moving_fn` aggregation which uses "the_sum" metric as its input.

Moving averages are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
add normal metrics, such as a `sum`, inside of that histogram. Finally, the `moving_fn` is embedded inside the histogram.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normal -> numeric ?

- `linearMovAvg()`
- `ewmaMovAvg()`
- `holtMovAvg()`
- `holtWintersMovAvg()`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be useful to add windowStdDev() to this list too as I have seen some requests for moving standard deviations too. However I imagine given that it would be used together with moving average, it would need the simple, linear, ewma, holt and holtWinters forms to make sense with moving average, so maybe its too complex to add it in this PR?

@@ -84,7 +84,11 @@ public MovAvgModel clone() {

@Override
public double next(Collection<Double> values) {
return MovingFunctions.simpleMovAvg(values);
double avg = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call this sum as its not actually the avg? 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

durp. :)

@@ -86,7 +86,16 @@ public MovAvgModel clone() {

@Override
public double next(Collection<Double> values) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here explaining why this is duplicating functionality provided by MovingFunctions? I think someone could try to deduplicate the code without this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, actually, this should be pulled back out. This was added when I thought it was a breaking change but it's not actually (since we don't allow null/NaN as a value because of gap_policy) in regular mov avg.

Thanks for noticing, I'll revert this to pointing at the static MovingFunctions version.

*/
public static double windowSum(Collection<Double> values) {
return values.stream().mapToDouble(Double::doubleValue).sum();
if (values.size() == 0) {
return 0.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docs change above it says:

If the window is empty, or all values are null/NaN, 0.0 is returned as the result.

This seems to not be the case here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? If the size is zero, it returns 0.0. And in the mapToDouble(), we only use the value if it's non-null and non-NaN (otherwise we return 0.0 for that position).

There's a test to verify this: https://github.com/elastic/elasticsearch/pull/29594/files/83778999a9aec1e0170efba0d4857100a927a770..c93d065b6ffd4e37089b4e251ad784380b75a942#diff-a35897b4e9c0a693a4515cab553a6427R73

private final String bucketsPathString;
private String format = null;
private GapPolicy gapPolicy = GapPolicy.SKIP;
private int window;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is optional we should set the default here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc error. Window isn't optional, I'll make the fix in the documentation.

};


public MovFnPipelineAggregationBuilder(String name, String bucketsPath, Script script, int window) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the window is optional we should not provide it in the constructor here but rather with a setter method.

@polyfractal
Copy link
Contributor Author

  • Added more details in docs about handling of NaN and null, and how it affects avg counts
  • Made window mandatory in docs
  • Added a standard deviation. I decided to sidestep the issue about what avg to use (and proliferation of a std dev to match each avg) by just making the user pass in the avg. The example shows how to combine one of the moving avg functions with std dev to get the desired behavior. Seemed only slightly less user-friendly, but a lot more flexible and less hassle to maintain
  • Fixed up a bunch of doc testing issues

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script part looks pretty good. I left a couple suggestions.

* This class provides a custom script context for the Moving Function pipeline aggregation,
* so that we can expose a number of pre-baked moving functions like min, max, movavg, etc
*/
public abstract class MovFnScript {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the name not be so cryptic? MovingFunctionScript?

* If all values are missing/null/NaN, the return value will be NaN.
* The average is based on the count of non-null, non-NaN values.
*/
public static double simpleMovAvg(Collection<Double> values) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call this unweightedAvg? Is the MovAvg suffix really necessary given this is within MovingFunctions? Ditto for the methods below.

* Find the maximum value in a window of values.
* If all values are missing/null/NaN, the return value will be NaN
*/
public static double windowMax(Collection<Double> values) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of using DoubleStream as the type here? I think it could save boxing in some cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was ++ on this and got it all coded up, but realized this makes certain things impossible because streams can only be used once.

The standard deviation for example just accepts an avg and the values:

MovingFunctions.windowStdDev(values, MovingFunctions.simpleMovAvg(values))

Which won't work because the stream will get used up in the movavg, and throw an exception when we try to iterate on it again :(

@polyfractal
Copy link
Contributor Author

More cleanup:

  • Renamed MovFnScript to MovingFunctionScript
  • Test fix for movAvg, needed to assertWarning in a few places
  • Discussed with @colings86 about naming and we decided to overhaul all the names, not just simpleMovAvg. The list is now:
    • max()
    • min()
    • sum()
    • stdDev()
    • unweightedAvg()
    • linearWeightedAvg()
    • ewma()
    • holt()
    • holtWinters()
      Chose to keep ewma/holt/holtWinters instead of exponentionalWeighted/doubleExponentialWeighted/tripleExponentialWeighted for brevity :)

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new names! I left a couple more suggestions.

}

public static final String[] PARAMETERS = new String[] {"params", "values"};
public static final ScriptContext<Factory> CONTEXT = new ScriptContext<>("movfn", Factory.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be "moving-function" to parallel the class name? While this is unseen for users most of the time, it can be used when putting a stored script to compile against that context.

* Find the maximum value in a window of values.
* If all values are missing/null/NaN, the return value will be NaN
*/
public static double max(Collection<Double> values) {
Copy link
Member

@rjernst rjernst May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 concerns still:

  • Painless doesn't know anything about generics. So to a script, this means passing in any Collection (that is all painless will check for). What about double[]?
  • Collection is unordered in java, but the order matters (I think?) for some of these calculations (not max, but others). Using double[] would solve this, but if we must stick with boxed types, at least use List<Double>?

Note that doc['fieldname'] returns an object that implements List, but we could make this easier by adding an asArray() method to DoubleScriptDocValues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think double[] should be doable. 👍

I didn't do that originally to avoid unboxing the internal EvictingQueue into an array since it wouldn't be necessary most of the time. And I avoided specifying a Queue as the type since the user might want to use these functions outside of the values we provide (or in addition to).

But I didn't think about the unordered nature of Collection, that's definitely an issue. Lemme poke at it tomorrow with double[], I think that'll work fine.

Sidebar
Alternatively, I could replace the evicting queue with a primitive-based ring buffer. I don't think that's any better though. We'd still have to arrayCopy the relevant slice into the script since there are null/NaNs to deal with, and we probably shouldn't make the method signature some weird ring buffer class :)

Note that doc['fieldname'] returns an object that implements List, but we could make this easier by adding an asArray() method to DoubleScriptDocValues?

Ah interesting. I'm not sure it's needed... do we think users will plop a field's value into the functions instead of the values? I was thinking users might tweak the values before a function, or write their own based on the values... but not use an entire field in place of the provided values. Thoughts?

Copy link
Member

@rjernst rjernst May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget what I said about ScriptDocValues. I was thinking this was like scripted metric agg, which uses a SearchScript for the map script, but you did this correctly (no getDoc()). So no need to do anything with the doc values accessors.

@polyfractal
Copy link
Contributor Author

@rjernst swapped over to double[]. As a side benefit, it makes things a lot easier for the user in the script (values[0] just seems more ergonomic than working with collections/lists) :)

These will be bumped to `6.3.99` after the feature has been backported
to 6.4
@polyfractal
Copy link
Contributor Author

@colings86 @rjernst did either of you want another review pass on this?

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

Test was previously muted which hid the failure.  The holt model in the
test needed updating like the real model (was using incorrect starting
count)
@polyfractal polyfractal merged commit df853c4 into elastic:master May 16, 2018
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 16, 2018
…ngs-to-true

* elastic/master: (34 commits)
  Test: increase search logging for LicensingTests
  Adjust serialization version in IndicesOptions
  [TEST] Fix compilation
  Remove version argument in RangeFieldType (elastic#30411)
  Remove unused DirectoryUtils class. (elastic#30582)
  Mitigate date histogram slowdowns with non-fixed timezones. (elastic#30534)
  Add a MovingFunction pipeline aggregation, deprecate MovingAvg agg (elastic#29594)
  Removes AwaitsFix on IndicesOptionsTests
  Template upgrades should happen in a system context (elastic#30621)
  Fix bug in BucketMetrics path traversal (elastic#30632)
  Fixes IndiceOptionsTests to serialise correctly (elastic#30644)
  S3 repo plugin populate SettingsFilter (elastic#30652)
  mute IndicesOptionsTests.testSerialization
  Rest High Level client: Add List Tasks (elastic#29546)
  Refactors ClientHelper to combine header logic (elastic#30620)
  [ML] Wait for ML indices in rolling upgrade tests (elastic#30615)
  Watcher: Ensure secrets integration tests also run triggered watch (elastic#30478)
  Move allocation awareness attributes to list setting (elastic#30626)
  [Docs] Update code snippet in has-child-query.asciidoc (elastic#30510)
  Replace custom reloadable Key/TrustManager (elastic#30509)
  ...
martijnvg added a commit that referenced this pull request May 17, 2018
* es/master: (74 commits)
  Preserve REST client auth despite 401 response (#30558)
  [test] packaging: add windows boxes (#30402)
  Make xpack modules instead of a meta plugin (#30589)
  Mute ShrinkIndexIT
  [ML] DeleteExpiredDataAction should use client with origin (#30646)
  Reindex: Fixed typo in assertion failure message (#30619)
  [DOCS] Fixes list of unconverted snippets in build.gradle
  [DOCS] Reorganizes RBAC documentation
  SQL: Remove dependency for server's version from JDBC driver (#30631)
  Test: increase search logging for LicensingTests
  Adjust serialization version in IndicesOptions
  [TEST] Fix compilation
  Remove version argument in RangeFieldType (#30411)
  Remove unused DirectoryUtils class. (#30582)
  Mitigate date histogram slowdowns with non-fixed timezones. (#30534)
  Add a MovingFunction pipeline aggregation, deprecate MovingAvg agg (#29594)
  Removes AwaitsFix on IndicesOptionsTests
  Template upgrades should happen in a system context (#30621)
  Fix bug in BucketMetrics path traversal (#30632)
  Fixes IndiceOptionsTests to serialise correctly (#30644)
  ...
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request May 17, 2018
* es/ccr: (75 commits)
  Preserve REST client auth despite 401 response (elastic#30558)
  [test] packaging: add windows boxes (elastic#30402)
  Make xpack modules instead of a meta plugin (elastic#30589)
  Mute ShrinkIndexIT
  [ML] DeleteExpiredDataAction should use client with origin (elastic#30646)
  Reindex: Fixed typo in assertion failure message (elastic#30619)
  [DOCS] Fixes list of unconverted snippets in build.gradle
  [DOCS] Reorganizes RBAC documentation
  SQL: Remove dependency for server's version from JDBC driver (elastic#30631)
  Test: increase search logging for LicensingTests
  Adjust serialization version in IndicesOptions
  [TEST] Fix compilation
  Remove version argument in RangeFieldType (elastic#30411)
  Remove unused DirectoryUtils class. (elastic#30582)
  Mitigate date histogram slowdowns with non-fixed timezones. (elastic#30534)
  Add a MovingFunction pipeline aggregation, deprecate MovingAvg agg (elastic#29594)
  Removes AwaitsFix on IndicesOptionsTests
  Template upgrades should happen in a system context (elastic#30621)
  Fix bug in BucketMetrics path traversal (elastic#30632)
  Fixes IndiceOptionsTests to serialise correctly (elastic#30644)
  ...
ywelsch pushed a commit to ywelsch/elasticsearch that referenced this pull request May 23, 2018
…lastic#29594)

This pipeline aggregation gives the user the ability to script functions that "move" across a window
of data, instead of single data points.  It is the scripted version of MovingAvg pipeline agg.

Through custom script contexts, we expose a number of convenience methods:

 - MovingFunctions.max()
 - MovingFunctions.min()
 - MovingFunctions.sum()
 - MovingFunctions.unweightedAvg()
 - MovingFunctions.linearWeightedAvg()
 - MovingFunctions.ewma()
 - MovingFunctions.holt()
 - MovingFunctions.holtWinters()
 - MovingFunctions.stdDev()

The user can also define any arbitrary logic via their own scripting, or combine with the above methods.
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Sep 24, 2018
russcam pushed a commit to elastic/elasticsearch-net that referenced this pull request Oct 16, 2018
russcam pushed a commit to elastic/elasticsearch-net that referenced this pull request Oct 26, 2018
jrodewig added a commit that referenced this pull request Sep 20, 2021
Documents the removal of the `moving_avg` aggregation in the 8.0 breaking changes.

Relates to #29594.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants