query: invalid rate/irate with deduplication for most recent values #2890

creker · 2020-07-14T09:19:09Z

Thanos, Prometheus and Golang version used:
bitnami/thanos:0.13.0
prom/prometheus:v2.19.0

Prometheus in HA configuration, 2 instances. Single Thanos querier instance.

Object Storage Provider:
On premise minio deployment

What happened:
Executing rate or irate with deduplication sometimes results in the most recent value being invalid. Either it shoots up to very high or very low value. Turning off deduplication produces correct results from both Prometheus instances.

With deduplication, good result

With deduplication, bad result. Executing the same query eventually gives incorrect result like this

Same query without deduplication always produces correct result

What you expected to happen:
Query with deduplication always giving correct result like this

How to reproduce it (as minimally and precisely as possible):
Prometheus in HA, any rate or irate query.

Full logs to relevant components:
Nothing that would correlate with queries.

bwplotka · 2020-07-14T12:00:30Z

Thanks for reporting. We thought we found all those kind of issues, but there might be more. The last one was supposed to be fixed in 0.13.0. Can you double check if you run Thanos with the fix from #2401 included? (:

bwplotka · 2020-07-14T12:00:49Z

Note that it's Querier version that matters for deduplication

creker · 2020-07-14T12:24:35Z

Here's querier version report

thanos, version 0.13.0 (branch: HEAD, revision: adf6facb8d6bf44097aae084ec091ac3febd9eb8)
  build user:       root@ee9c796b3048
  build date:       20200622-09:49:32
  go version:       go1.14.2

creker · 2020-07-14T12:26:01Z

I can try 0.14.0 if it contains any relevant fixes

bwplotka · 2020-07-14T12:29:28Z

Please update but I think it should have the fix. If 0.14 won't work, then what would be awesome is to have exact chunks for that problematic period. You can obtain those by using following script: https://github.com/thanos-io/thanos/blob/40526f52f54d4501737e5246c0e71e56dd7e0b2d/scripts/insecure_grpcurl_series.sh against Querier gRPC API directly (: This will give us exact input that deduplication logic is using.

I think it has to do with some edge values.. 🤔

creker · 2020-07-14T22:53:49Z

0.14 has the same problem. Had to use different metric to catch it. The problem persists for a few second so I tried to get correct and incorrect results.
bad.txt
good.txt

csmarchbanks · 2020-07-15T14:42:27Z

I was about to report this same issue!

What is happening in my system at least is that the initialPenalty is by default 5000ms, but I have scrape intervals in the 15s-30s range. The two HA Prometheus instances will commonly scrape the same target ~10s apart which means the deduplication logic will always switch series after the first sample causing extra data.

Here is a failing test that reproduces the behavior I see: master...csmarchbanks:unstable-queries

I pushed out a custom version of thanos that uses a 30s initial penalty and the problem has gone away for me. However, if someone had a 1m scrape interval a 30s initial penalty still would not be enough, and would be way too big for someone with a 5s scrape interval.

creker · 2020-07-15T16:05:05Z

We have similar setup. Two Prometheus instances scraping the same targets with 10s interval.

bwplotka · 2020-07-16T09:26:45Z

Awesome. Looks like we might want to adjust penalty based on dynamic interval. The edge case is when interval is reconfigured to be different (:

…

On Wed, 15 Jul 2020 at 17:17, Antonenko Artem ***@***.***> wrote: We have similar setup. Two Prometheus instances scraping the same targets with 10s interval. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2890 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVA3OYVAJUZEHTV6CWKTG3R3XI73ANCNFSM4OZKKE3A> .

krasi-georgiev · 2020-08-13T14:53:13Z

I opened a PR that uses the request resolution to handle cases like this one. I am still working on the tests, but so far it is looking good

#3010

krasi-georgiev · 2020-08-27T11:55:41Z

update: resolution based dedup doesn't work with promql functions so reverted the PR.

In our case adjusting the default look back delta solved the problem so it seems that the problem here is different and caused by scrape time shifting.

@csmarchbanks in your case I think the main problem is that the scrapping between the different replicas is shifted by 30sec. That seems a lot if you manage to align it better problem should be solved.

sevagh · 2020-10-06T18:23:00Z

I think I'm running into a similar issue on Thanos 0.14.0

The promql is sum(irate(my_app_counter[5m])), where my_app_counter is a counter.

Thanos with deduplication unchecked:

Thanos with deduplication checked shows a huge spike:

Prometheus replica 1:

Prometheus replica 2:

I have 2 replica Prometheus pollers scraping with a 60s interval. I tried the suggestion here to increase "initialPenalty" for a few tests, but even setting "initialPenalty = 60,000" didn't help: #2890 (comment)

It's possible they are out of sync - what's the best way to get them back in sync? Restart both pollers simultaneously and pray?

krasi-georgiev · 2020-10-07T08:42:47Z

try adjusting the look-back delta - this should solve the problem.

sevagh · 2020-10-07T13:51:11Z

Thanks - looks like I need to upgrade to 0.15.0+ to be able to use this new flag? What's a good value to pick with a scrape interval of 60s? Is the 5 minute default not big enough?

sevagh · 2020-10-07T17:53:48Z

Looks like upgrading to 0.16.0-rc0 and modifying --query.lookback-delta isn't fixing the above situation.

krasi-georgiev · 2020-10-08T11:37:20Z

hm, strange then I am out of ideas, sorry.

stale · 2020-12-08T09:33:13Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

bwplotka · 2020-12-08T09:49:20Z

Still to investigate to ensure solid deduplication algorithm (: Kind Regards, Bartek Płotka (@bwplotka)

…

On Tue, 8 Dec 2020 at 09:33, stale[bot] ***@***.***> wrote: Hello 👋 Looks like there was no activity on this issue for the last two months. *Do you mind updating us on the status?* Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command <https://probot.github.io/apps/reminders/> if you wish to be reminded at some point in future. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2890 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVA3OZ3DEIVZXINKFNHEYTSTXXGVANCNFSM4OZKKE3A> .

stale · 2021-02-06T15:22:15Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

bwplotka · 2021-02-07T20:45:14Z

Still to investigate / try to repro

kakkoyun · 2021-02-10T08:46:30Z

Still valid and needs investigation.

stale · 2021-06-03T02:15:35Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

kakkoyun · 2021-06-03T09:18:26Z

Still valid.

stale · 2021-08-02T18:44:45Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2021-08-17T02:43:17Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

SuperQ · 2023-05-02T17:18:08Z

I think this is still valid, not stale.

rrondeau · 2023-10-10T15:02:42Z

Just hit and discovered this issue with thanos 0.32.4

Lord-Y · 2023-10-11T13:01:12Z

@bwplotka this issue has been closed by the bot while the bug still valid. Can we have a feedback and reopen the issue?

GiedriusS · 2024-05-03T09:29:08Z

Could someone help by uploading two blocks and then sharing what query to execute and on which timestamp to reproduce this?

jianlong0808 · 2024-08-13T11:59:37Z

I am facing same problem with 0.32.5 version. Is there any solution?

mmorev · 2024-10-03T13:47:55Z

May be helpful if someone encounter same behaviour and/or for deeper diagnostics of this issue.

I have two metric sources, which provide same time series, but different values. By a misconfiguration, these sources had different replica-label set, so Querier tried to dedupe that time series.
So, one source pushed some counter with time series like [3,3,3,3], and another pushed counter with exactly same label set, but time series like [8,8,8,8]. After deduplication on Querier side, it resulted to time series [8,8,8,3] or smth like that. No wonder, rate() function for that time series returned non-zero value for latest moment.

Except, it does not explain, why it occurs only for latest datapoint in any series.

After removing replica-label from that sources, issue with incorrect rate() result have gone.

lcc3108 · 2024-10-08T08:53:08Z

@GiedriusS
Are there any guidelines to find and share specific blocks?

lcc3108 · 2024-10-08T10:29:34Z

thanos version : v0.28.0

thanos tools bucket inspect

| 01J9N7Z64ZBBNF1DSZRPHGFTAC | 2024-10-08T11:00:00+09:00 | 2024-10-08T13:00:00+09:00 | 1h59m59.997s   | 38h0m0.003s    | 965,362   | 144,556,873    | 1,205,299   | 1          | false       | color=green,k8s_cluster=prod,prometheus=monitoring/monitoring-stack-kube-prom-prometheus,prometheus_replica=prometheus-monitoring-stack-kube-prom-prometheus-0 | 0s         | sidecar   |
| 01J9N7Z69HWBMSY239G1KEE20A | 2024-10-08T11:00:00+09:00 | 2024-10-08T13:00:00+09:00 | 1h59m59.945s   | 38h0m0.055s    | 965,373   | 144,539,083    | 1,205,311   | 1          | false       | color=green,k8s_cluster=prod,prometheus=monitoring/monitoring-stack-kube-prom-prometheus,prometheus_replica=prometheus-monitoring-stack-kube-prom-prometheus-1 | 0s         | sidecar   |
| 01J9N9TRD19XPK90JFQXGBTGS9 | 2024-10-08T11:00:00+09:00 | 2024-10-08T13:00:00+09:00 | 1h59m59.997s   | 38h0m0.003s    | 965,437   | 279,595,016    | 2,404,073   | 2          | false       | color=green,k8s_cluster=prod,prometheus=monitoring/monitoring-stack-kube-prom-prometheus                                                                       | 0s         | compactor |

The labels are all the same.
Cleared to make it easier to see + minimize sensitive data exposure.

using thanos-kit(https://github.com/sepich/thanos-kit)

01J9N7Z69HWBMSY239G1KEE20A(prometheus-1 raw)

nginx_ingress_controller_requests{} 6.5886553e+07 1728357927080
nginx_ingress_controller_requests{} 6.5886747e+07 1728357957070
nginx_ingress_controller_requests{} 6.5886962e+07 1728357987084
nginx_ingress_controller_requests{} 6.588719e+07 1728358017070
nginx_ingress_controller_requests{} 6.5887429e+07 1728358047090
nginx_ingress_controller_requests{} 6.5887605e+07 1728358077070
nginx_ingress_controller_requests{} 6.5887783e+07 1728358107070
nginx_ingress_controller_requests{} 6.5887973e+07 1728358137080
nginx_ingress_controller_requests{} 6.5888131e+07 1728358167070
nginx_ingress_controller_requests{} 6.5888295e+07 1728358197070
nginx_ingress_controller_requests{} 6.5888481e+07 1728358227070
nginx_ingress_controller_requests{} 6.5888635e+07 1728358257080
nginx_ingress_controller_requests{} 6.5888785e+07 1728358287070
nginx_ingress_controller_requests{} 6.5888948e+07 1728358317080

01J9N7Z64ZBBNF1DSZRPHGFTAC(prometheus-0 raw)

nginx_ingress_controller_requests{} 6.5886519e+07 1728357921865
nginx_ingress_controller_requests{} 6.588671e+07 1728357951865
nginx_ingress_controller_requests{} 6.5886931e+07 1728357981865
nginx_ingress_controller_requests{} 6.588715e+07 1728358011870
nginx_ingress_controller_requests{} 6.5887384e+07 1728358041865
nginx_ingress_controller_requests{} 6.5887568e+07 1728358071865
nginx_ingress_controller_requests{} 6.5887748e+07 1728358101865
nginx_ingress_controller_requests{} 6.588794e+07 1728358131865
nginx_ingress_controller_requests{} 6.5888113e+07 1728358161865
nginx_ingress_controller_requests{} 6.5888267e+07 1728358191937
nginx_ingress_controller_requests{} 6.5888485e+07 1728358221937
nginx_ingress_controller_requests{} 6.5888613e+07 1728358251869
nginx_ingress_controller_requests{} 6.5888761e+07 1728358281865
nginx_ingress_controller_requests{} 6.5888919e+07 1728358311865

01J9N9TRD19XPK90JFQXGBTGS9(compact)

nginx_ingress_controller_requests{} 6.5886519e+07 1728357921865
nginx_ingress_controller_requests{} 6.5886553e+07 1728357927080
nginx_ingress_controller_requests{} 6.588671e+07 1728357951865
nginx_ingress_controller_requests{} 6.5886747e+07 1728357957070
nginx_ingress_controller_requests{} 6.5886931e+07 1728357981865
nginx_ingress_controller_requests{} 6.5886962e+07 1728357987084
nginx_ingress_controller_requests{} 6.588715e+07 1728358011870
nginx_ingress_controller_requests{} 6.588719e+07 1728358017070
nginx_ingress_controller_requests{} 6.5887384e+07 1728358041865
nginx_ingress_controller_requests{} 6.5887429e+07 1728358047090
nginx_ingress_controller_requests{} 6.5887568e+07 1728358071865
nginx_ingress_controller_requests{} 6.5887605e+07 1728358077070
nginx_ingress_controller_requests{} 6.5887748e+07 1728358101865
nginx_ingress_controller_requests{} 6.5887783e+07 1728358107070
nginx_ingress_controller_requests{} 6.588794e+07 1728358131865
nginx_ingress_controller_requests{} 6.5887973e+07 1728358137080
nginx_ingress_controller_requests{} 6.5888113e+07 1728358161865
nginx_ingress_controller_requests{} 6.5888131e+07 1728358167070
nginx_ingress_controller_requests{} 6.5888267e+07 1728358191937
nginx_ingress_controller_requests{} 6.5888295e+07 1728358197070
nginx_ingress_controller_requests{} 6.5888485e+07 1728358221937 
nginx_ingress_controller_requests{} 6.5888481e+07 1728358227070 # <- Here, the metric is reduced and spike occurs when using the increase function
nginx_ingress_controller_requests{} 6.5888613e+07 1728358251869
nginx_ingress_controller_requests{} 6.5888635e+07 1728358257080
nginx_ingress_controller_requests{} 6.5888761e+07 1728358281865
nginx_ingress_controller_requests{} 6.5888785e+07 1728358287070
nginx_ingress_controller_requests{} 6.5888919e+07 1728358311865
nginx_ingress_controller_requests{} 6.5888948e+07 1728358317080

bwplotka added the bug label Jul 14, 2020

bwplotka mentioned this issue Jul 20, 2020

Compact: Offline deduplication #1014

Closed

bwplotka mentioned this issue Jul 29, 2020

querier: Redesign deduplication algorithm to be metric type agnostic or deduce counter type properly. #2547

Closed

krasi-georgiev mentioned this issue Aug 13, 2020

querier: bugfix - data gaps when switching iterators #3010

Merged

stale bot added the stale label Dec 8, 2020

stale bot removed the stale label Dec 8, 2020

stale bot added the stale label Feb 6, 2021

stale bot removed the stale label Feb 7, 2021

kakkoyun added the component: query label Feb 10, 2021

kakkoyun added help wanted needs-more-info labels Feb 10, 2021

stale bot added the stale label Jun 3, 2021

stale bot removed the stale label Jun 3, 2021

yeya24 mentioned this issue Jun 7, 2021

compactor: Add offline deduplication #4239

Merged

2 tasks

stale bot added the stale label Aug 2, 2021

stale bot closed this as completed Aug 17, 2021

GiedriusS reopened this May 3, 2024

stale bot removed the stale label May 3, 2024

GiedriusS added the dont-go-stale Label for important issues which tells the stalebot not to close them label May 3, 2024

kennylevinsen mentioned this issue Jul 31, 2024

thanos-sidecar uploads OOO block before Prometheus compacts it, upsetting thanos-compact #7551

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query: invalid rate/irate with deduplication for most recent values #2890

query: invalid rate/irate with deduplication for most recent values #2890

creker commented Jul 14, 2020

bwplotka commented Jul 14, 2020

bwplotka commented Jul 14, 2020

creker commented Jul 14, 2020

creker commented Jul 14, 2020

bwplotka commented Jul 14, 2020

creker commented Jul 14, 2020 •

edited

Loading

csmarchbanks commented Jul 15, 2020

creker commented Jul 15, 2020

bwplotka commented Jul 16, 2020 via email

krasi-georgiev commented Aug 13, 2020

krasi-georgiev commented Aug 27, 2020 •

edited

Loading

sevagh commented Oct 6, 2020 •

edited

Loading

krasi-georgiev commented Oct 7, 2020

sevagh commented Oct 7, 2020

sevagh commented Oct 7, 2020

krasi-georgiev commented Oct 8, 2020

stale bot commented Dec 8, 2020

bwplotka commented Dec 8, 2020 via email

stale bot commented Feb 6, 2021

bwplotka commented Feb 7, 2021

kakkoyun commented Feb 10, 2021

stale bot commented Jun 3, 2021

kakkoyun commented Jun 3, 2021

stale bot commented Aug 2, 2021

stale bot commented Aug 17, 2021

SuperQ commented May 2, 2023

rrondeau commented Oct 10, 2023

Lord-Y commented Oct 11, 2023 •

edited

Loading

GiedriusS commented May 3, 2024

jianlong0808 commented Aug 13, 2024 •

edited

Loading

mmorev commented Oct 3, 2024

lcc3108 commented Oct 8, 2024

lcc3108 commented Oct 8, 2024 •

edited

Loading

query: invalid rate/irate with deduplication for most recent values #2890

query: invalid rate/irate with deduplication for most recent values #2890

Comments

creker commented Jul 14, 2020

bwplotka commented Jul 14, 2020

bwplotka commented Jul 14, 2020

creker commented Jul 14, 2020

creker commented Jul 14, 2020

bwplotka commented Jul 14, 2020

creker commented Jul 14, 2020 • edited Loading

csmarchbanks commented Jul 15, 2020

creker commented Jul 15, 2020

bwplotka commented Jul 16, 2020 via email

krasi-georgiev commented Aug 13, 2020

krasi-georgiev commented Aug 27, 2020 • edited Loading

sevagh commented Oct 6, 2020 • edited Loading

krasi-georgiev commented Oct 7, 2020

sevagh commented Oct 7, 2020

sevagh commented Oct 7, 2020

krasi-georgiev commented Oct 8, 2020

stale bot commented Dec 8, 2020

bwplotka commented Dec 8, 2020 via email

stale bot commented Feb 6, 2021

bwplotka commented Feb 7, 2021

kakkoyun commented Feb 10, 2021

stale bot commented Jun 3, 2021

kakkoyun commented Jun 3, 2021

stale bot commented Aug 2, 2021

stale bot commented Aug 17, 2021

SuperQ commented May 2, 2023

rrondeau commented Oct 10, 2023

Lord-Y commented Oct 11, 2023 • edited Loading

GiedriusS commented May 3, 2024

jianlong0808 commented Aug 13, 2024 • edited Loading

mmorev commented Oct 3, 2024

lcc3108 commented Oct 8, 2024

lcc3108 commented Oct 8, 2024 • edited Loading

creker commented Jul 14, 2020 •

edited

Loading

krasi-georgiev commented Aug 27, 2020 •

edited

Loading

sevagh commented Oct 6, 2020 •

edited

Loading

Lord-Y commented Oct 11, 2023 •

edited

Loading

jianlong0808 commented Aug 13, 2024 •

edited

Loading

lcc3108 commented Oct 8, 2024 •

edited

Loading