-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
querier: bugfix - data gaps when switching iterators #3010
querier: bugfix - data gaps when switching iterators #3010
Conversation
83b1a92
to
2c6d15d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, can we make it buildable AND add test case you were reproing? Without that - hard to prove it works. (:
53ccdba
to
6c06810
Compare
8fb0fe0
to
db2eb42
Compare
6fa5a6a
to
b64893b
Compare
fd6fa68
to
9443274
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this is critical path so let's take it carefully. I am not fan of changing all the tests and carefully adjusted test cases.
Please be aware that we planned to have this algo stabilized to ensure we can run offline deduplication safely. With this we just start from scratch as we changed a bit - plus we change unit tests. This is quite risky.
I checked all and I don't think step based dedup will work ):
As commented below:
Let's take following query: sum_over_time(my_metric[1h]) for 4d with step 1h. Let's say we have 15s scrape interval of this metric and 2 replicas with rollout in between. So let's say:
a replica samples (timestamps after start time): 0, 1, 2, 3,..... 59m58s, 59m59s, 1h, <rollout take 10m>, 1h20m, 1h20m1s, 1h20m2s...
b replica samples 1, 2, 3, 4, ..... every second always
Now our algo will choose replica a first and add step penalty (1h). At the last sample before rollout we will make penalty of B = step (1h), so next iteration we have:
1h20m form a, and 2h from b. This is a gap that could be totally avoided because b has that data. This will be highly visible on PromQL result despite the large step, as certain queries care about each samples - this has to be reliable ):
cc @brian-brazil if you have time.
Let's also do research on how others are doing it, but I believe the algo should be scrape interval-based - we need to assume it based on seen samples. I don't see any other way right now.
Curious if we could team up with promxy creator @jacksontj - we are in open source after all 🤗
We definitely work on something similar, he has some interesting approach to parallel queries on leafs: https://github.com/jacksontj/promxy/blob/master/pkg/proxystorage/proxy.go#L179 but cannot find any deduplication specifics.
As a matter of general principle, I agree. Due to races etc. you're going to need to look at 2-3 intervals minimum.
Indeed, and there will always have to be a gap left for safety when you switch replicas, otherwise you're going to have issues with counters. |
cc9dd30
to
5293d3d
Compare
Turns out the deduplication alg works as expected and the problem is that in our case the data was sent at an interval of 4.5min and the dedup alg uses 2xdelta so 9min which is more then the default lookBackDelta in promql(5min). Changed the PR to make the look back configurable. |
1a9c9af
to
15724ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM, just minor nit in terms of description, also let's fix changelog.
And please remove fixes
stuff - it does not fix any of the issues mentioned I think.
cmd/thanos/query.go
Outdated
@@ -70,6 +70,8 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application) { | |||
maxConcurrentQueries := cmd.Flag("query.max-concurrent", "Maximum number of queries processed concurrently by query node."). | |||
Default("20").Int() | |||
|
|||
lookbackDelta := cmd.Flag("query.lookback-delta", "The maximum lookback duration for retrieving metrics during expression evaluations.").Duration() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we describe it better? It's not maximum but actually lookback delta right? And also we should explain what it means for PromQL or link to docs 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more details, let me know what you think.
Unfortunately, there are official Prom docs about this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not addressed comments ):
d89e0a5
to
128e71e
Compare
@krasi-georgiev We can include this in the next release candidate, so could you baed ut on |
9447fd5
to
9ce6b64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion, otherwise LGTM! 💪
CHANGELOG.md
Outdated
@@ -16,6 +16,7 @@ We use *breaking :warning:* word for marking changes that are not backward compa | |||
* [#3095](https://github.com/thanos-io/thanos/pull/3095) Rule: update manager when all rule files are removed. | |||
* [#3098](https://github.com/thanos-io/thanos/pull/3098) ui: Fix Block Viewer for Compactor and Store | |||
* [#3105](https://github.com/thanos-io/thanos/pull/3105) Query: Fix overwriting maxSourceResolution when auto downsampling is enabled. | |||
* [#3010](https://github.com/thanos-io/thanos/pull/3010) Querier: Added a flag to override the default look back delta in promql. The flag should be set to at least 2 times the slowest scrape interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not forget to fix this on @kakkoyun PR (:
8cda4b8
to
40795d1
Compare
…om> Date: Thu Aug 27 14:10:19 2020 +0300 make the promql lookBack configurable with a flag Signed-off-by: Krasi Georgiev <[email protected]>
40795d1
to
365746a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🥇
…om> Date: Thu Aug 27 14:10:19 2020 +0300 (thanos-io#3010) Make the PromQL lookBack configurable with a flag Signed-off-by: Krasi Georgiev <[email protected]>
…om> Date: Thu Aug 27 14:10:19 2020 +0300 (thanos-io#3010) Make the PromQL lookBack configurable with a flag Signed-off-by: Krasi Georgiev <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]>
* Made sure old sse S3 option if specified, produces error. Signed-off-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> * Cut v0.15.0-rc.0 Signed-off-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> * tsdbstore: Optimized response framing if iterator finished. Signed-off-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> * Apply suggestions from code review Co-authored-by: Krasimir Georgiev <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> * Rule: update manager when all rule files are removed (#3095) This bug was already fixed in #2615 but it got lost when we merged #2200. Co-authored-by: johncming <[email protected]> Co-authored-by: Lili Cosic <[email protected]> Signed-off-by: Simon Pasquier <[email protected]> Co-authored-by: johncming <[email protected]> Co-authored-by: Lili Cosic <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> * ui: Fix Block Viewer for Compactor and Store (#3098) * ui: Fix Block viewer for Compactor and Store Signed-off-by: Prem Kumar <[email protected]> * Add an entry in CHANGELOG.md Signed-off-by: Prem Kumar <[email protected]> * ui: react: Fixed tests for Blocks Signed-off-by: Prem Kumar <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> * querier: Fix overwriting maxSourceResolution when auto downsampling is enabled (#3105) * Fix overwriting maxSourceResolution when auto downsampling is enabled Signed-off-by: Ben Ye <[email protected]> * add changelog Signed-off-by: Ben Ye <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> * Author: Krasi Georgiev <[email protected]> Date: Thu Aug 27 14:10:19 2020 +0300 (#3010) Make the PromQL lookBack configurable with a flag Signed-off-by: Krasi Georgiev <[email protected]> Signed-off-by: Kemal Akkoyun <[email protected]> * Cut release v0.15.0-rc.1 (#3103) * Cut release v0.15.0-rc.1 Signed-off-by: Kemal Akkoyun <[email protected]> * Address review issues Signed-off-by: Kemal Akkoyun <[email protected]> * Fix changelog Signed-off-by: Kemal Akkoyun <[email protected]> * Cut release v0.15.0 (#3129) Signed-off-by: Kemal Akkoyun <[email protected]> * Minor fix Signed-off-by: Kemal Akkoyun <[email protected]> * Address review comments Signed-off-by: Kemal Akkoyun <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]> Co-authored-by: Krasimir Georgiev <[email protected]> Co-authored-by: Simon Pasquier <[email protected]> Co-authored-by: johncming <[email protected]> Co-authored-by: Lili Cosic <[email protected]> Co-authored-by: Prem Kumar <[email protected]> Co-authored-by: Ben Ye <[email protected]>
Behaviour of this changed in thanos-io/thanos#3010 -- this now generates an error, as it probably should, but we sent back a series with no datapoints in it previously.
closes: #3001
Promql has a default 5m look back delta so if the penalty causes the
duration between the samples to be greater than 5m this will show as
gaps in the visualisation and when using the count function returns a
lower number then expected.
For example:
time is in mins
series 1
it.penB = 2 * (ta - it.lastT), 2 * (3 - 1) = 4
t4 is missing because the it.penB is 4 so it skips it.
When the time between
t3 0.3
andt9 0.9
is more then 5min(the default lookback) so when visualizing this it shows as gaps in the graph.
When the penalty is resolution based the seek for itB always advances NOmore then the requested time so doesn’t skip samples when switching
between iterators.
UPDATE: step based penalty wouldn't work when using promql functions. Adjusting the look back solves the problem so PR is updated to add it as a configurable flag.