counter reset on aggregation boundaries #1568

alfred-landrum · 2019-09-25T17:11:48Z

My colleague @aponjavic and I have a question about the counter aggregation implementation: We were wondering if the following scenario could occur in real world usage:

2 consecutive aggregation chunks, A1 and A2 are created
a counter reset occurs at the boundary between A1 and A2
the counter state for A2 is greater than the last raw for A1

In this scenario, a CounterSeriesIterator wouldn't treat the A1 -> A2 transition as a reset. I can force this scenario by modifying the unit tests in downsample_test.go; for example, using raw values of:

{t: 5, v: 95}, {t: 15, v: 100}, {t: 25, v: 97}, {t: 35, v: 107}

and forcing aggregation at 20 time unit boundaries causes 2 aggregation chunks, whose values are:

[]downsample.sample{downsample.sample{t:15, v:100}, downsample.sample{t:15, v:100}, downsample.sample{t:35, v:107}, downsample.sample{t:35, v:107}}

Since the value of 107 is > 100, it seems that any iteration over these wouldn't track the reset that occurred between time 15 and time 25.

If this can actually occur, we imagine that adding a "first raw" sample point in A2 for {t: 25: v: 97}, and then ensuring its passed through the aggregation iterations, would address the issue.

Please let us know if this is a real possibility, or if we've misunderstood how counter downsampling works!

The text was updated successfully, but these errors were encountered:

brian-brazil · 2019-09-26T15:06:36Z

I don't think that's the output that downsampleRaw would produce, it'd be {t: 15, v: 95}, {t: 15, v: 100}, {t: 35, v: 197}, {t: 35, v: 107}

I did notice that

thanos/pkg/compact/downsample/downsample.go

Line 594 in 99bc7d2

return it.Next()

may cause a call stack overflow, a for loop would be better.

alfred-landrum · 2019-09-26T15:45:08Z

@brian-brazil : thanks for looking at this. Are the points you posted from a single aggregation chunk? If not, where did the aggregation boundary fall?

brian-brazil · 2019-09-26T17:22:33Z

Oops, got my timestamps wrong. That's two counter chunk, with two samples each.

alfred-landrum · 2019-10-23T17:49:21Z

Here's a diagram that may better illustrate the potential issue. It shows two sections of the same time series, where a counter reset occurred between time 3 & 4.

If that also happens to be the boundary between aggregated chunks, the reset will effectively get lost when using the aggregated counter state: the reset check will compare the state of the second section (4) to the last value of the first section (3).

That could be avoided by recording the 'first value' of an aggregated chunk, and then use it use it instead of the counter state.

brian-brazil · 2019-10-23T18:15:29Z

I'm not seeing that in the code,

thanos/pkg/compact/downsample/downsample.go

Line 582 in bd9aa1b

it.totalV += v

should catch it. Maybe a unittest would demonstrate better?

thanos-io#1568 Signed-off-by: Alfred Landrum <[email protected]>

alfred-landrum · 2019-10-23T20:24:32Z

Yes - here's a unit test that fails due to the missing reset, and also demos storing the first raw value in the chunk to address the issue:

https://github.com/thanos-io/thanos/compare/master...alfred-landrum:ctr-agg-1568-verify?expand=1

brian-brazil · 2019-10-23T20:56:26Z

You're missing this bit of logic in that test:

thanos/pkg/compact/downsample/downsample.go

Line 523 in bd9aa1b

ab.apps[AggrCounter].Append(lastT, it.lastV)

which I believe handles it.

alfred-landrum · 2019-10-23T21:27:06Z

it's there; it's called via aggrChunkBuilder.finalizeChunk:
https://github.com/thanos-io/thanos/blob/master/pkg/compact/downsample/downsample.go#L292

brian-brazil · 2019-10-24T11:03:16Z

Your test is calling downsampleRaw twice, but it's only called once with all the data https://github.com/thanos-io/thanos/blob/master/pkg/compact/downsample/downsample.go#L133 in the current code. Adjusting the test for that, I get the expected result.

alfred-landrum · 2019-10-24T14:28:23Z

The posted test doesn't call downsampleRaw. It calls downsampleOneBatch - a function that consolidates the core loop logic of downsampleRaw:

https://github.com/thanos-io/thanos/compare/master...alfred-landrum:ctr-agg-1568-verify?expand=1#diff-dc3c081796dd4ba1a1fdf96d58ab1f44R333

downsampleRaw calls downsampleOneBatch (or the equivalent logic in master) for batches of the raw input samples, along aggregation boundaries. The logic to choose an aggregation boundary is driven by the targetChunkCount heuristic, and the requirement to include all samples for a given resolution window:

https://github.com/thanos-io/thanos/compare/master...alfred-landrum:ctr-agg-1568-verify?expand=1#diff-dc3c081796dd4ba1a1fdf96d58ab1f44R327

neither of which would prevent the chosen boundary from occurring at the same time as a counter reset.

So, the test calls downsampleOneBatch twice to demonstrate the case where a counter reset and aggregation boundary align.

brian-brazil · 2019-10-25T10:39:41Z

I see what you're saying. I'm guessing maintaining state across the boundaries would handle this.

We'd also want to check for the across-blocks case.

alfred-landrum · 2019-10-25T15:43:31Z

The posted branch includes a proof-of-concept solution:
https://github.com/thanos-io/thanos/compare/master...alfred-landrum:ctr-agg-1568-verify?expand=1#diff-02576818966950907446c791c58856e4R69

which is to store the first raw value in the aggregated chunk. The first values would also need to be preserved during downsampling of aggregated chunks as well, which is not demonstrated in the posted branch.

Preserving the first raw value seems like a minimal change that may require no changes to CounterSeriesIterator.

alfred-landrum · 2019-10-30T15:41:44Z

fyi, I plan to put up a PR that adds the first raw value for the first downsampling (as in the posted branch above), and also preserves the first value at subsequent aggregations as well.

As discussed in thanos-io#1568, storing only the last raw value of a chunk will lose a counter reset when: a) the reset occurs at a chunk boundary, and b) the last raw value of the earlier chunk is less than the first aggregated value of the later chunk. This commit stores the first raw value of a chunk during the initial raw aggregation, and retains it during subsequent aggregations. This is similar to the existing handling for the last raw value of a chunk. With this change, when counterSeriesIterator iterates over a chunk boundary, it will see the last raw value of the earlier chunk, then the first raw value of the later chunk, and then the first aggregated value of the later chunk. The first raw value will always be less than or equal to the first aggregated value, so the only difference in counterSeriesIterator's output will be the possible detection of a reset and an extra sample after the chunk boundary. Fixes: thanos-io#1568 Signed-off-by: Alfred Landrum <[email protected]>

* store the first raw value of a chunk during downsampling As discussed in #1568, storing only the last raw value of a chunk will lose a counter reset when: a) the reset occurs at a chunk boundary, and b) the last raw value of the earlier chunk is less than the first aggregated value of the later chunk. This commit stores the first raw value of a chunk during the initial raw aggregation, and retains it during subsequent aggregations. This is similar to the existing handling for the last raw value of a chunk. With this change, when counterSeriesIterator iterates over a chunk boundary, it will see the last raw value of the earlier chunk, then the first raw value of the later chunk, and then the first aggregated value of the later chunk. The first raw value will always be less than or equal to the first aggregated value, so the only difference in counterSeriesIterator's output will be the possible detection of a reset and an extra sample after the chunk boundary. Fixes: #1568 Signed-off-by: Alfred Landrum <[email protected]> * changelog for #1709 Signed-off-by: Alfred Landrum <[email protected]> * adjust existing downsampling tests Signed-off-by: Alfred Landrum <[email protected]> * add counter aggregation comments to CounterSeriesIterator Signed-off-by: Alfred Landrum <[email protected]>

) * store the first raw value of a chunk during downsampling As discussed in thanos-io#1568, storing only the last raw value of a chunk will lose a counter reset when: a) the reset occurs at a chunk boundary, and b) the last raw value of the earlier chunk is less than the first aggregated value of the later chunk. This commit stores the first raw value of a chunk during the initial raw aggregation, and retains it during subsequent aggregations. This is similar to the existing handling for the last raw value of a chunk. With this change, when counterSeriesIterator iterates over a chunk boundary, it will see the last raw value of the earlier chunk, then the first raw value of the later chunk, and then the first aggregated value of the later chunk. The first raw value will always be less than or equal to the first aggregated value, so the only difference in counterSeriesIterator's output will be the possible detection of a reset and an extra sample after the chunk boundary. Fixes: thanos-io#1568 Signed-off-by: Alfred Landrum <[email protected]> * changelog for thanos-io#1709 Signed-off-by: Alfred Landrum <[email protected]> * adjust existing downsampling tests Signed-off-by: Alfred Landrum <[email protected]> * add counter aggregation comments to CounterSeriesIterator Signed-off-by: Alfred Landrum <[email protected]> Signed-off-by: Aleksey Sin <[email protected]>

alfred-landrum added a commit to alfred-landrum/thanos that referenced this issue Oct 23, 2019

Test case showing issue described in:

9f7194e

thanos-io#1568 Signed-off-by: Alfred Landrum <[email protected]>

alfred-landrum added a commit to alfred-landrum/thanos that referenced this issue Oct 23, 2019

Test case showing issue described in:

26806b2

thanos-io#1568 Signed-off-by: Alfred Landrum <[email protected]>

alfred-landrum added a commit to alfred-landrum/thanos that referenced this issue Oct 23, 2019

Test case showing issue described in:

ece4493

thanos-io#1568 Signed-off-by: Alfred Landrum <[email protected]>

alfred-landrum mentioned this issue Nov 3, 2019

store the first raw value of a chunk during downsampling #1709

Merged

2 tasks

bwplotka closed this as completed in #1709 Nov 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

counter reset on aggregation boundaries #1568

counter reset on aggregation boundaries #1568

alfred-landrum commented Sep 25, 2019

brian-brazil commented Sep 26, 2019 •

edited

Loading

alfred-landrum commented Sep 26, 2019

brian-brazil commented Sep 26, 2019

alfred-landrum commented Oct 23, 2019

brian-brazil commented Oct 23, 2019

alfred-landrum commented Oct 23, 2019

brian-brazil commented Oct 23, 2019

alfred-landrum commented Oct 23, 2019

brian-brazil commented Oct 24, 2019

alfred-landrum commented Oct 24, 2019

brian-brazil commented Oct 25, 2019

alfred-landrum commented Oct 25, 2019

alfred-landrum commented Oct 30, 2019

counter reset on aggregation boundaries #1568

counter reset on aggregation boundaries #1568

Comments

alfred-landrum commented Sep 25, 2019

brian-brazil commented Sep 26, 2019 • edited Loading

alfred-landrum commented Sep 26, 2019

brian-brazil commented Sep 26, 2019

alfred-landrum commented Oct 23, 2019

brian-brazil commented Oct 23, 2019

alfred-landrum commented Oct 23, 2019

brian-brazil commented Oct 23, 2019

alfred-landrum commented Oct 23, 2019

brian-brazil commented Oct 24, 2019

alfred-landrum commented Oct 24, 2019

brian-brazil commented Oct 25, 2019

alfred-landrum commented Oct 25, 2019

alfred-landrum commented Oct 30, 2019

brian-brazil commented Sep 26, 2019 •

edited

Loading