-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query: invalid rate/irate with deduplication for most recent values #2890
Comments
Thanks for reporting. We thought we found all those kind of issues, but there might be more. The last one was supposed to be fixed in 0.13.0. Can you double check if you run Thanos with the fix from #2401 included? (: |
Note that it's Querier version that matters for deduplication |
Here's querier version report
|
I can try 0.14.0 if it contains any relevant fixes |
Please update but I think it should have the fix. If 0.14 won't work, then what would be awesome is to have exact chunks for that problematic period. You can obtain those by using following script: https://github.com/thanos-io/thanos/blob/40526f52f54d4501737e5246c0e71e56dd7e0b2d/scripts/insecure_grpcurl_series.sh against Querier gRPC API directly (: This will give us exact input that deduplication logic is using. I think it has to do with some edge values.. 🤔 |
I was about to report this same issue! What is happening in my system at least is that the initialPenalty is by default 5000ms, but I have scrape intervals in the 15s-30s range. The two HA Prometheus instances will commonly scrape the same target ~10s apart which means the deduplication logic will always switch series after the first sample causing extra data. Here is a failing test that reproduces the behavior I see: master...csmarchbanks:unstable-queries I pushed out a custom version of thanos that uses a 30s initial penalty and the problem has gone away for me. However, if someone had a 1m scrape interval a 30s initial penalty still would not be enough, and would be way too big for someone with a 5s scrape interval. |
We have similar setup. Two Prometheus instances scraping the same targets with 10s interval. |
Awesome. Looks like we might want to adjust penalty based on dynamic
interval. The edge case is when interval is reconfigured to be different (:
…On Wed, 15 Jul 2020 at 17:17, Antonenko Artem ***@***.***> wrote:
We have similar setup. Two Prometheus instances scraping the same targets
with 10s interval.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2890 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABVA3OYVAJUZEHTV6CWKTG3R3XI73ANCNFSM4OZKKE3A>
.
|
I opened a PR that uses the request resolution to handle cases like this one. I am still working on the tests, but so far it is looking good |
update: resolution based dedup doesn't work with promql functions so reverted the PR. In our case adjusting the default look back delta solved the problem so it seems that the problem here is different and caused by scrape time shifting. @csmarchbanks in your case I think the main problem is that the scrapping between the different replicas is shifted by 30sec. That seems a lot if you manage to align it better problem should be solved. |
I think I'm running into a similar issue on Thanos 0.14.0 The promql is Thanos with deduplication unchecked: I have 2 replica Prometheus pollers scraping with a 60s interval. I tried the suggestion here to increase "initialPenalty" for a few tests, but even setting "initialPenalty = 60,000" didn't help: #2890 (comment) It's possible they are out of sync - what's the best way to get them back in sync? Restart both pollers simultaneously and pray? |
try adjusting the look-back delta - this should solve the problem. |
Thanks - looks like I need to upgrade to 0.15.0+ to be able to use this new flag? What's a good value to pick with a scrape interval of 60s? Is the 5 minute default not big enough? |
Looks like upgrading to 0.16.0-rc0 and modifying |
hm, strange then I am out of ideas, sorry. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Still to investigate to ensure solid deduplication algorithm (:
Kind Regards,
Bartek Płotka (@bwplotka)
…On Tue, 8 Dec 2020 at 09:33, stale[bot] ***@***.***> wrote:
Hello 👋 Looks like there was no activity on this issue for the last two
months.
*Do you mind updating us on the status?* Is this still reproducible or
needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be
closed (we can always reopen an issue if we need!). Alternatively, use
remind command <https://probot.github.io/apps/reminders/> if you wish to
be reminded at some point in future.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2890 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABVA3OZ3DEIVZXINKFNHEYTSTXXGVANCNFSM4OZKKE3A>
.
|
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Still to investigate / try to repro |
Still valid and needs investigation. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Still valid. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
I think this is still valid, not stale. |
Just hit and discovered this issue with thanos 0.32.4 |
@bwplotka this issue has been closed by the bot while the bug still valid. Can we have a feedback and reopen the issue? |
Could someone help by uploading two blocks and then sharing what query to execute and on which timestamp to reproduce this? |
May be helpful if someone encounter same behaviour and/or for deeper diagnostics of this issue. I have two metric sources, which provide same time series, but different values. By a misconfiguration, these sources had different replica-label set, so Querier tried to dedupe that time series. Except, it does not explain, why it occurs only for latest datapoint in any series. After removing replica-label from that sources, issue with incorrect rate() result have gone. |
@GiedriusS |
thanos version : v0.28.0 thanos tools bucket inspect
The labels are all the same. using thanos-kit(https://github.com/sepich/thanos-kit) 01J9N7Z69HWBMSY239G1KEE20A(prometheus-1 raw)
01J9N7Z64ZBBNF1DSZRPHGFTAC(prometheus-0 raw)
01J9N9TRD19XPK90JFQXGBTGS9(compact)
|
Thanos, Prometheus and Golang version used:
bitnami/thanos:0.13.0
prom/prometheus:v2.19.0
Prometheus in HA configuration, 2 instances. Single Thanos querier instance.
Object Storage Provider:
On premise minio deployment
What happened:
Executing rate or irate with deduplication sometimes results in the most recent value being invalid. Either it shoots up to very high or very low value. Turning off deduplication produces correct results from both Prometheus instances.
With deduplication, good result
With deduplication, bad result. Executing the same query eventually gives incorrect result like this
Same query without deduplication always produces correct result
What you expected to happen:
Query with deduplication always giving correct result like this
How to reproduce it (as minimally and precisely as possible):
Prometheus in HA, any rate or irate query.
Full logs to relevant components:
Nothing that would correlate with queries.
The text was updated successfully, but these errors were encountered: