Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

counter reset on aggregation boundaries #1568

Closed
alfred-landrum opened this issue Sep 25, 2019 · 13 comments · Fixed by #1709
Closed

counter reset on aggregation boundaries #1568

alfred-landrum opened this issue Sep 25, 2019 · 13 comments · Fixed by #1709

Comments

@alfred-landrum
Copy link
Contributor

My colleague @aponjavic and I have a question about the counter aggregation implementation: We were wondering if the following scenario could occur in real world usage:

  • 2 consecutive aggregation chunks, A1 and A2 are created
  • a counter reset occurs at the boundary between A1 and A2
  • the counter state for A2 is greater than the last raw for A1

In this scenario, a CounterSeriesIterator wouldn't treat the A1 -> A2 transition as a reset. I can force this scenario by modifying the unit tests in downsample_test.go; for example, using raw values of:

{t: 5, v: 95}, {t: 15, v: 100}, {t: 25, v: 97}, {t: 35, v: 107}

and forcing aggregation at 20 time unit boundaries causes 2 aggregation chunks, whose values are:

[]downsample.sample{downsample.sample{t:15, v:100}, downsample.sample{t:15, v:100}, downsample.sample{t:35, v:107}, downsample.sample{t:35, v:107}}

Since the value of 107 is > 100, it seems that any iteration over these wouldn't track the reset that occurred between time 15 and time 25.

If this can actually occur, we imagine that adding a "first raw" sample point in A2 for {t: 25: v: 97}, and then ensuring its passed through the aggregation iterations, would address the issue.

Please let us know if this is a real possibility, or if we've misunderstood how counter downsampling works!

@brian-brazil
Copy link
Contributor

brian-brazil commented Sep 26, 2019

I don't think that's the output that downsampleRaw would produce, it'd be {t: 15, v: 95}, {t: 15, v: 100}, {t: 35, v: 197}, {t: 35, v: 107}

I did notice that

may cause a call stack overflow, a for loop would be better.

@alfred-landrum
Copy link
Contributor Author

@brian-brazil : thanks for looking at this. Are the points you posted from a single aggregation chunk? If not, where did the aggregation boundary fall?

@brian-brazil
Copy link
Contributor

Oops, got my timestamps wrong. That's two counter chunk, with two samples each.

@alfred-landrum
Copy link
Contributor Author

counter-state

Here's a diagram that may better illustrate the potential issue. It shows two sections of the same time series, where a counter reset occurred between time 3 & 4.

If that also happens to be the boundary between aggregated chunks, the reset will effectively get lost when using the aggregated counter state: the reset check will compare the state of the second section (4) to the last value of the first section (3).

That could be avoided by recording the 'first value' of an aggregated chunk, and then use it use it instead of the counter state.

@brian-brazil
Copy link
Contributor

I'm not seeing that in the code,

should catch it. Maybe a unittest would demonstrate better?

alfred-landrum added a commit to alfred-landrum/thanos that referenced this issue Oct 23, 2019
alfred-landrum added a commit to alfred-landrum/thanos that referenced this issue Oct 23, 2019
alfred-landrum added a commit to alfred-landrum/thanos that referenced this issue Oct 23, 2019
@alfred-landrum
Copy link
Contributor Author

Yes - here's a unit test that fails due to the missing reset, and also demos storing the first raw value in the chunk to address the issue:

https://github.com/thanos-io/thanos/compare/master...alfred-landrum:ctr-agg-1568-verify?expand=1

@brian-brazil
Copy link
Contributor

You're missing this bit of logic in that test:

ab.apps[AggrCounter].Append(lastT, it.lastV)
which I believe handles it.

@alfred-landrum
Copy link
Contributor Author

it's there; it's called via aggrChunkBuilder.finalizeChunk:
https://github.com/thanos-io/thanos/blob/master/pkg/compact/downsample/downsample.go#L292

@brian-brazil
Copy link
Contributor

Your test is calling downsampleRaw twice, but it's only called once with all the data https://github.com/thanos-io/thanos/blob/master/pkg/compact/downsample/downsample.go#L133 in the current code. Adjusting the test for that, I get the expected result.

@alfred-landrum
Copy link
Contributor Author

The posted test doesn't call downsampleRaw. It calls downsampleOneBatch - a function that consolidates the core loop logic of downsampleRaw:

https://github.com/thanos-io/thanos/compare/master...alfred-landrum:ctr-agg-1568-verify?expand=1#diff-dc3c081796dd4ba1a1fdf96d58ab1f44R333

downsampleRaw calls downsampleOneBatch (or the equivalent logic in master) for batches of the raw input samples, along aggregation boundaries. The logic to choose an aggregation boundary is driven by the targetChunkCount heuristic, and the requirement to include all samples for a given resolution window:

https://github.com/thanos-io/thanos/compare/master...alfred-landrum:ctr-agg-1568-verify?expand=1#diff-dc3c081796dd4ba1a1fdf96d58ab1f44R327

neither of which would prevent the chosen boundary from occurring at the same time as a counter reset.

So, the test calls downsampleOneBatch twice to demonstrate the case where a counter reset and aggregation boundary align.

@brian-brazil
Copy link
Contributor

I see what you're saying. I'm guessing maintaining state across the boundaries would handle this.

We'd also want to check for the across-blocks case.

@alfred-landrum
Copy link
Contributor Author

The posted branch includes a proof-of-concept solution:
https://github.com/thanos-io/thanos/compare/master...alfred-landrum:ctr-agg-1568-verify?expand=1#diff-02576818966950907446c791c58856e4R69

which is to store the first raw value in the aggregated chunk. The first values would also need to be preserved during downsampling of aggregated chunks as well, which is not demonstrated in the posted branch.

Preserving the first raw value seems like a minimal change that may require no changes to CounterSeriesIterator.

@alfred-landrum
Copy link
Contributor Author

fyi, I plan to put up a PR that adds the first raw value for the first downsampling (as in the posted branch above), and also preserves the first value at subsequent aggregations as well.

alfred-landrum added a commit to alfred-landrum/thanos that referenced this issue Nov 3, 2019
As discussed in thanos-io#1568, storing only the last raw value
of a chunk will lose a counter reset when:
a) the reset occurs at a chunk boundary, and
b) the last raw value of the earlier chunk is less than
the first aggregated value of the later chunk.

This commit stores the first raw value of a chunk during
the initial raw aggregation, and retains it during
subsequent aggregations. This is similar to the existing
handling for the last raw value of a chunk.

With this change, when counterSeriesIterator iterates over
a chunk boundary, it will see the last raw value of the
earlier chunk, then the first raw value of the later chunk,
and then the first aggregated value of the later chunk. The
first raw value will always be less than or equal to the
first aggregated value, so the only difference in
counterSeriesIterator's output will be the possible detection
of a reset and an extra sample after the chunk boundary.

Fixes: thanos-io#1568

Signed-off-by: Alfred Landrum <[email protected]>
bwplotka pushed a commit that referenced this issue Nov 9, 2019
* store the first raw value of a chunk during downsampling

As discussed in #1568, storing only the last raw value
of a chunk will lose a counter reset when:
a) the reset occurs at a chunk boundary, and
b) the last raw value of the earlier chunk is less than
the first aggregated value of the later chunk.

This commit stores the first raw value of a chunk during
the initial raw aggregation, and retains it during
subsequent aggregations. This is similar to the existing
handling for the last raw value of a chunk.

With this change, when counterSeriesIterator iterates over
a chunk boundary, it will see the last raw value of the
earlier chunk, then the first raw value of the later chunk,
and then the first aggregated value of the later chunk. The
first raw value will always be less than or equal to the
first aggregated value, so the only difference in
counterSeriesIterator's output will be the possible detection
of a reset and an extra sample after the chunk boundary.

Fixes: #1568

Signed-off-by: Alfred Landrum <[email protected]>

* changelog for #1709

Signed-off-by: Alfred Landrum <[email protected]>

* adjust existing downsampling tests

Signed-off-by: Alfred Landrum <[email protected]>

* add counter aggregation comments to CounterSeriesIterator

Signed-off-by: Alfred Landrum <[email protected]>
IKSIN pushed a commit to monitoring-tools/thanos that referenced this issue Nov 26, 2019
)

* store the first raw value of a chunk during downsampling

As discussed in thanos-io#1568, storing only the last raw value
of a chunk will lose a counter reset when:
a) the reset occurs at a chunk boundary, and
b) the last raw value of the earlier chunk is less than
the first aggregated value of the later chunk.

This commit stores the first raw value of a chunk during
the initial raw aggregation, and retains it during
subsequent aggregations. This is similar to the existing
handling for the last raw value of a chunk.

With this change, when counterSeriesIterator iterates over
a chunk boundary, it will see the last raw value of the
earlier chunk, then the first raw value of the later chunk,
and then the first aggregated value of the later chunk. The
first raw value will always be less than or equal to the
first aggregated value, so the only difference in
counterSeriesIterator's output will be the possible detection
of a reset and an extra sample after the chunk boundary.

Fixes: thanos-io#1568

Signed-off-by: Alfred Landrum <[email protected]>

* changelog for thanos-io#1709

Signed-off-by: Alfred Landrum <[email protected]>

* adjust existing downsampling tests

Signed-off-by: Alfred Landrum <[email protected]>

* add counter aggregation comments to CounterSeriesIterator

Signed-off-by: Alfred Landrum <[email protected]>
Signed-off-by: Aleksey Sin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants