-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregations: Add moving average aggregation #10024
Aggregations: Add moving average aggregation #10024
Conversation
Allows the user to calculate a Moving Average over a histogram of buckets. Provides four different moving averages: - Simple - Linear weighted - Single Exponentially weighted (aka EWMA) - Double Exponentially weighted (aka Holt-winters)
return this; | ||
} | ||
|
||
public MovAvgBuilder weightingType(Weighting weightingType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have the Enum and method name the same? I have no preference as to whether we call it Weighting
or WeightingType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, cool...wasn't sure if that was kosher or not. Will change everything to Weighting!
@polyfractal looks good so far. Left some comments, mostly around making the Weighting use a common interface with each weighting as a sub-class which might help make it easier to plug in new weighting types |
Awesome, thanks @colings86, all seemed reasonable but had question about the Weighting builder stuff. I'll fixup the other stuff right now. |
Lots of extra boilerplate, but mostly around registering streams and parsers. Will make future implementations very simple to build, since they just need to subclass MovAvgModel and provide some code for serialization and parsing. Also allows each model to manage their own settings configuration, which is a lot cleaner.
@colings86 @clintongormley Ok, so I refactored everything as a compromise between generic API and specific internal structure. I think it looks good now. API looks the same as before (settings hash, optional): "movavg": {
"bucketsPath": "the_sum",
"weighting" : "double_exp",
"window": 5,
"settings": {
"alpha": 0.8
}
} Internally, it uses a similar architecture as the SignificantHeuristics, e.g. a parsers are stored in a central map, the relevant parser is retrieved after we are done parsing the request, a MovAvgModel object is constructed and given to the aggregation. All models are responsible for construction, custom settings, serialization, etc. The only difference is that we extract a Map<> for settings (if it exists), instead of using a more specialized XContent parser. I think this is a fair tradeoff.
|
@polyfractal I like the new changes but I wonder if we should have a MovAvgModelBuilder where you can set the settings for a particular model and then we call settings() to get the settings object to add to the |
@colings86 Newbie java dev question: what does a builder buy us? It doesn't seem necessary from a REST point of view (still need to parse a Map), and in java-land users can just make a Just seems like even more boilerplate for such a minor thing :) |
@polyfractal It allows someone using the Java API to have concrete methods describing the options available for each model rather than having to know what keys are valid for the settings map. At the moment the MovAvgBuilder takes the weight and settings separately and as far as I can see there is no way to pass a MovAvgModel into it? When #10217 is complete we should have a MovAvgModel objects which has builders and parser methods as well being POJOs which we can pass around in the Java API. To avoid creating the MovAvgModelBuilder classes you could pre-empt this by adding the builder methods to the MovAvgModel classes now. |
@colings86 Ah, that makes sense...thanks for the explanation. I'll fix it up on my flight home, will ping you when done :) |
Better explains what is being done, and allows extensibility in the future (e.g. wavelets can be used to smooth, not just moving average)
|
Won't the different models have different parameters potentially? The name |
@rjernst Yeah, different models may will have different parameters. The current setup is that you specify which model you want to use, then pass in a generic
This is my current thinking...please feel free to poke holes in it. :) I'm thinking that we should to group functionality by "usage" whenever possible, since we shouldn't expect the average user to know how all the specific parts work. Obviously some will be experts, but I think most will be new to time series analysis. If we clump functionality by usage, a user can sit down and say "I would like to smooth my data", pull out the smoothing agg, and start fiddling with different models and parameters. Similarly, they should be able to pull out an Importantly, an "outlier" agg and "prediction" agg will also support some of the same models as smoothing. There is a lot of overlap between the three...but not full overlap, which is really the problem. The alternative is that users will need to navigate a large set of aggs that inconsistently support functionality. E.g. this table illustrates the problem:
The other concern is that if each agg supports several "modes", the output from that agg will depend on which options you have toggled. That might be irritating if toggling a new param changes the output (e.g. predict will start adding new buckets, outlier will add a new "was_outlier" field or something) Basically, the goal was to prevent an explosion of small aggs that each do one highly technical thing...leaving a lot of users in the dust because they've never done time series analysis before. There is also a bunch of extra stuff that can't be grouped easily, like autocorrelation, changepoint, statistical testing, etc. _(_Oh goodness, that turned into an essay. Sorry :( ) |
@polyfractal That totally makes sense. +1 for |
@rjernst Urgh, second guessing myself now...thoughts? I agree there is concern about making things toooo generic. There is also something nice about having an agg represent a single-purpose, with one well defined set of params. Maybe it's a pain to have one agg with like 10 different modes? Maybe we could do a hybrid approach? I was thinking about it last night and realized that most functionality has an innate purpose, but can be used for prediction or outlier based on that functionality. E.g. a moving average smooths the data, but can find outliers by comparing a value against the smoothed average. So we could do:
I think I'm still leaning towards the prior option, but wanted to write this down. Not sure, just braindumping now :) |
So...after a lot of discussion we decided to revert the naming change and go back to
/cc @rjernst |
import java.io.IOException; | ||
|
||
|
||
public interface MovAvgModelBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably have some JavaDocs since it forms part of the Java API. Also, should this not implement toXContent?
@polyfractal left a comment but I think it's pretty close now |
Allows the user to calculate a Moving Average over a histogram of buckets. Provides four different moving averages: - Simple - Linear weighted - Single Exponentially weighted (aka EWMA) - Double Exponentially weighted (aka Holt-winters) Closes #10024
Closed via a squash-merge in a824184, because I'm terrible at Git and messed up the merge process :( |
Adds an aggregation to calculate the moving average of sibling metrics in histogram-style data (histogram, date_histogram). Moving averages are useful when time series data is locally stationary and has a mean that changes slowly over time.
Seasonal data may need a different analysis, as well as data that is bimodal, "bursty" or contains frequent extreme values (which are not necessarily outliers).
Request
bucketsPath
(Required)Path to buckets/metric values to calculate moving average over
window
(Optional - default 5)The user specifies the window size they wish to calculate a moving average for. E.g. a user may want a 30-day sliding window over a histogram of 90 days total.
Currently, if there is not enough data to "fill" the window, the moving average will be calculated with whatever is available. For example, if a user selects 30-day window, days 1-29 will calculate the moving average with between 1-29 days of data.
We could investigate adding more "edge policies", which determine how to handle gaps at the edge of the moving average
weighting
(Optional - defaultsimple
)Currently, the agg supports four types of weighting:
gap_policy
(Optional - defaultignore
)Determines the policy for handling gaps in the data. Default is to ignore gaps.
settings
(Optional)Extra settings which apply to individual weighting types.
alpha
can be set forsingle_exp
. Defaults to 0.5alpha
andbeta
can be set fordouble_exp
. Defaults to 0.5 and 0.5 respectively.Response
Closes #10002