Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations: add serial differencing pipeline aggregation #11196

Merged
merged 1 commit into from
Jul 10, 2015

Conversation

polyfractal
Copy link
Contributor

No need for assignment or review yet, still need to write tests!

Serial Differencing

Serial differencing (or just differencing) is a technique where values in a time series are subtracted from itself at different time lags or periods. For example, the datapoint f(x) = f(xt) - f(xt-n), where n is the period being used.

A period of 1 is equivalent to a derivative: it is simply the change from one point to the next. Single periods are useful for removing constant, linear trends.

Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.

But once we plot the first-difference, it becomes a stationary series (we know this because the first difference is randomly distributed around zero, and doesn't seem to exhibit any pattern/behavior). The transformation reveals that the dataset is a random-walk model, which allows us to use further analysis.

screen shot 2015-03-19 at 10 42 04 am

Larger periods can be used to remove seasonal / cyclic behavior. In this example, a population of lemmings was synthetically generated with a sine wave + constant linear trend + random noise. The sine wave has a period of 30 days.

The first-difference removes the constant trend, leaving just a sine wave. The 30th-difference is then applied to the first-difference to remove the cyclic behavior, leaving a stationary series which is amenable to other analysis.

screen shot 2015-03-19 at 12 15 06 pm

(Old PR and comments: #10190)

API

{
   "aggs": {
      "my_date_histo": {
         "date_histogram": {
            "field": "timestamp",
            "interval": "day"
         },
         "aggs": {
            "the_sum": {
               "sum": {
                  "field": "lemmings"
               }
            },
            "first_difference": {
               "serial_diff": {
                  "buckets_path": "the_sum",
                  "lag" : 1
               }
            },
            "thirtieth_difference": {
               "serial_diff": {
                  "buckets_path": "first_difference",
                  "lag" : 30
               }
            }
         }
      }
   }
}

@polyfractal polyfractal force-pushed the feature/aggs_2_0_diff branch from d6e0c55 to b1d07f0 Compare May 28, 2015 19:05
@clintongormley clintongormley changed the title Aggregations: Add serial differencing aggregation Add serial differencing aggregation Jun 8, 2015
@polyfractal polyfractal force-pushed the feature/aggs_2_0_diff branch 3 times, most recently from 4cc3ed9 to e133bdf Compare July 7, 2015 19:24
@polyfractal
Copy link
Contributor Author

@colings86 Low priority, but this is up for review whenever you have a few spare minutes. It is blissfully simple compared to moving_avg :)

Open question: currently, if there is not enough data (or the lag is too large), you just don't get any serial_diff metric values. We could also throw an exception, but that seems like poor behavior (the rest of your aggs may work fine). Thoughts?

@polyfractal polyfractal added review and removed WIP labels Jul 7, 2015
PipelineAggregatorStreams.registerStream(STREAM, TYPE.stream());
}

private static final Function<Aggregation, InternalAggregation> FUNCTION = new Function<Aggregation, InternalAggregation>() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use PipelineAggregator.AGGREGATION_TRANFORM_FUNCTION instead of this?

@colings86
Copy link
Contributor

@polyfractal left some comment but I really like this aggregation and the documentation for it is great :)

To your question: I am struggling to decide what is best. As you say, throwing an exception seems unfriendly and would be different behaviour to other aggregations. But equally if we just don't output anything then it can easily confuse users as to why the aggregation is not working since there will be no message or indication anywhere of what caused the aggregation to not output any data.

@polyfractal
Copy link
Contributor Author

@colings86 cleaned up, ready at your leisure :)

@colings86
Copy link
Contributor

LGTM

@polyfractal polyfractal changed the title Add serial differencing aggregation Aggregations: add serial differencing pipeline aggregation Jul 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants