-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregations: add serial differencing pipeline aggregation #11196
Conversation
d6e0c55
to
b1d07f0
Compare
4cc3ed9
to
e133bdf
Compare
@colings86 Low priority, but this is up for review whenever you have a few spare minutes. It is blissfully simple compared to Open question: currently, if there is not enough data (or the lag is too large), you just don't get any |
PipelineAggregatorStreams.registerStream(STREAM, TYPE.stream()); | ||
} | ||
|
||
private static final Function<Aggregation, InternalAggregation> FUNCTION = new Function<Aggregation, InternalAggregation>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use PipelineAggregator.AGGREGATION_TRANFORM_FUNCTION instead of this?
@polyfractal left some comment but I really like this aggregation and the documentation for it is great :) To your question: I am struggling to decide what is best. As you say, throwing an exception seems unfriendly and would be different behaviour to other aggregations. But equally if we just don't output anything then it can easily confuse users as to why the aggregation is not working since there will be no message or indication anywhere of what caused the aggregation to not output any data. |
@colings86 cleaned up, ready at your leisure :) |
LGTM |
f443ded
to
e3f9d56
Compare
Aggregations: add serial differencing pipeline aggregation
No need for assignment or review yet, still need to write tests!
Serial Differencing
Serial differencing (or just differencing) is a technique where values in a time series are subtracted from itself at different time lags or periods. For example, the datapoint f(x) = f(xt) - f(xt-n), where
n
is the period being used.A period of 1 is equivalent to a derivative: it is simply the change from one point to the next. Single periods are useful for removing constant, linear trends.
Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.
But once we plot the first-difference, it becomes a stationary series (we know this because the first difference is randomly distributed around zero, and doesn't seem to exhibit any pattern/behavior). The transformation reveals that the dataset is a random-walk model, which allows us to use further analysis.
Larger periods can be used to remove seasonal / cyclic behavior. In this example, a population of lemmings was synthetically generated with a sine wave + constant linear trend + random noise. The sine wave has a period of 30 days.
The first-difference removes the constant trend, leaving just a sine wave. The 30th-difference is then applied to the first-difference to remove the cyclic behavior, leaving a stationary series which is amenable to other analysis.
(Old PR and comments: #10190)
API