thanos-query: deduplication picks up time-series with missing data #981

Hashfyre · 2019-03-27T10:01:57Z

Thanos, Prometheus and Golang version used
thanos: v0.3.1
prometheus: v2.5.0
kubernetes: v1.12.6
Kubernetes Distro: KOPS
weave: weaveworks/weave-kube:2.5.0
Cloud Platform: AWS
EC2 Instance Type: R5.4XL

Architecture


      G1                               G2
      |                                |
      |                                |
      TQ1                              TQ2
      |                                |
 --------------                        |
 |------------|-------------------------                 
 |            |                        |
TSC1        TSC2                       TS
 |            |
P1           P2

G1: Grafana realtime
G2: Grafana Historical
TQ1: Thanos Query realtime (15d retention)
TQ2: Thanos Query historical
TSC: Thanos Sidecars
TS: Thanos store

Each sidecar and the store is fronted by a service with *.svc.cluster.local DNS to which the --store flag points to.

G2, TQ2 are not involved in this RCA.

What happened
Event Timelime:

Due to some weave-net issues on our monitoring instance group, one of the prometheus-replicas P1 stops scraping some targets

We See the Following metric gap in Grafana (G1)

This particular metric was being scraped from cloudwatch-exporter
We investigate thanos-query and see the following deduplication behavior:

We can see that instead of having two series per metric we have only one, however thanos-query seems to produce contiguous data on dedup=true which is enabled by default.
Later on we migrate the data of the bad prometheus pod on a new volume and make P2 live
We see the following data in thanos-query with dedup=false

We can clearly see that one prometheus has data and another is missing it.

However, when we query with dedup=true, the merged set displays missing data instead of contiguous data as expected.

What you expected to happen

We expected thanos deduplication to trust the series that has contiguous data over the one with the missing data and produce a series with contiguous data. Missing a scrape in HA Prometheus environment is expected at times, if one of the prometheus replicas has data the final output should not show missing data.

How to reproduce it (as minimally and precisely as possible):

Have a setup as in examples of the repo in kubernetes or as described in the above architecture
block network on one of the prometheus replicas so that it misses scrape and hence has gaps in data.
make the blocked prometheus available again after a significant deltaT.
use thanos-query to deduplicate the dataset and compare results.

Environment:
Underlying K8S Worker Node:

OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.4 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Kernel (e.g. uname -a): Linux ip-10-100-6-218 4.4.0-1054-aws [shipper] Warning about .tmp Prometheus files #63-Ubuntu SMP Wed Mar 28 19:42:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

bwplotka · 2019-03-28T18:31:16Z

Hi, thanks for the report 👋

Yea, I think this is essentially some edge case for our penalty algorithm. The code is here: https://github.com/improbable-eng/thanos/blob/master/pkg/query/iter.go#L416

The problem is that this case is pretty rare (e.g we cannot repro it). I would say adding more unit tests would be nice and help to narrow down what's wrong. Help wanted (:

MacroPower · 2020-01-20T17:42:36Z

I am having this same issue. I can actually reproduce it by having a couple prometheus instances scraping the same target, then just rebooting (recreating the pod, in my case) a single node. It will miss one or two scrapes. You'll then start to see gaps in the data if thanos happens to query the node that was rebooted.

stale · 2020-02-19T18:27:49Z

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

brancz · 2020-02-19T18:31:08Z

@bwplotka if we had a data dump of one of these we should be able to extract the time series with the raw data that cause this no? In that case if someone would share a data dump like that that would help us a lot. If you feel it’s confidential data I think we’d also be open to accepting the data privately and extract the time series ourselves. That is if you trust us of course :)

bwplotka · 2020-02-19T19:40:08Z

Yes! we care about samples only as well so you can mask series if you want for privacy reasons! 👍 (:

stale · 2020-04-19T10:39:56Z

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

bwplotka · 2020-05-19T06:00:14Z

Looks like this is last standing deduplication characteristic we could improve. I would not call it bug necessarily, it is just not responsive enough by design. I plan to adjust it in near future.

Looks like this is only last standing bug for offline compaction to work!

sepich · 2020-06-09T13:28:35Z

We have the same issue with v0.13.0-rc.1:

Here target has been unavailable from 4:30-7:00, and this gap is ok. But we also see gaps 10:00-now.
But the data is actually exist, here i'm changing zoom from 12h to 6h:

And then back to 12h zoom, but this time turn deduplication off (it is --query.replica-label=replica on querier side):

I've tried to change differernt query params (like resolution, partial response etc) but only deduplication and time range having the initial gap leads to such result.
So, it seems having metric stale in time range leads to gaps on each replica label change.
Here is the same 6h window, moved to time of initial gap:

And you see the gap after 10:00 appears on 6h window too.

stale · 2020-07-09T16:01:44Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2020-07-16T16:45:50Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

omron93 · 2021-05-28T11:00:57Z

Looks like this is last standing deduplication characteristic we could improve. I would not call it bug necessarily, it is just not responsive enough by design. I plan to adjust it in near future.

@bwplotka Was this already done? Or is there a config change to workaround this issue? We see the same issue with thanos 0.18.0

omron93 · 2021-06-07T10:02:36Z

@onprem @kakkoyun Is there a way to reopen this issue? or better to create a new one?

kakkoyun · 2021-06-07T10:05:29Z

Hello 👋 Could you please try out a newer version of Thanos to if it's still valid? Of course we could reopen this issue.

omron93 · 2021-06-07T10:31:12Z

@kakkoyun I've installed 0.21.1 and we're still seeing the same behaviour.

malejpavouk · 2021-06-24T08:42:07Z

We see the same behavior. It seems like only one instance (we have 2 Prometheus instances scraping the same targets) is taken into account and the other one is completely ignored (so dedup(A, B) == A)

thanos:v0.21.1

stale · 2021-08-25T19:34:29Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

malejpavouk · 2021-08-25T20:14:38Z

/notstale

stale · 2021-10-30T06:32:24Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

omron93 · 2021-11-04T07:57:15Z

/notstale

stale · 2022-01-09T02:33:44Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

jmichalek132 · 2022-02-09T19:01:02Z

still relevant

stale · 2022-04-16T03:54:34Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

omron93 · 2022-04-19T08:14:57Z

Still relevant

aarontams · 2022-07-25T14:56:39Z

Adding myself here to watch this issue.

clalos2592 · 2022-09-16T10:23:03Z

Adding myself here too.

jamessewell · 2022-10-24T01:27:42Z

We are seeing this issue as well. Dedup ignores a series which has no breaks in favour of one which does.

Antiarchitect · 2022-12-09T14:21:01Z

Seems like faced this too 0.29.0. Thanos Query has multiple sources and selects prometheus sidecar with data gaps on recent data.
It's very strange issue holds so long.

caoimheharvey · 2023-04-18T06:35:21Z

Seems like faced this too 0.29.0. Thanos Query has multiple sources and selects prometheus sidecar with data gaps on recent data. It's very strange issue holds so long.

I've also had this issue for the same version, have been able to verify that all of the metrics are being received correctly, so the issue appears to be when the data is queried.

saikatg3 · 2024-02-09T06:48:25Z

Facing a similar issue with missing metrics in v0.32.3. The metrics are being remotely written from two Prometheus replica instances, each with unique external replica labels, into the Receiver. The Receiver utilizes multiple replicas for high availability setup. However, with deduplication enabled in Thanos query, metrics are intermittently missing in Grafana.

mdraijer · 2024-12-02T15:21:56Z

Is there any progress on this issue?

We also recognize the gaps in the displayed metrics: we know that the metrics are stored, but sometimes they're not completely shown in Grafana or Thanos Query-frontend. There are gaps of several minutes, hours, or even 1-2 days. Sometimes a restart of a component, e.g. the Store, solves the problem. Sometimes changing the period (zoom in or out) closes the gap. Sometimes changing the resolution solves it.

No definite solution to fill the gaps, sometimes all there is to it is to wait (a couple of days) and then the metrics appear again.

Our stack is:

2 Prometheus replicas (HA) in each Kubernetes cluster;
Remote_write from Prometheus to Thanos Receiver;
Receiver receiving from 7 clusters, that is 14 Prometheuses, in a ring of 8 instances;
Receiver writing to MinIO object storage; retention in Receiver 1d;
1 Store pod for reading in MinIO;
Currently data in MinIO is aggregated to ~9 months.

bwplotka added bug difficulty: hard help wanted priority: P0 labels Mar 28, 2019

bwplotka mentioned this issue Apr 7, 2019

Compact: Offline deduplication #1014

Closed

stale bot added the stale label Feb 19, 2020

stale bot removed the stale label Feb 19, 2020

bwplotka added the component: query label Mar 20, 2020

stale bot added stale and removed stale labels Apr 19, 2020

bwplotka mentioned this issue May 19, 2020

querier: Add deduplication + promSeries micro benchmarks, consider moving to heap sort for sorting chunks + dedup chunks first #2626

Closed

bwplotka mentioned this issue Jun 9, 2020

Gaps in graphs when using Thanos #2736

Closed

stale bot added the stale label Jul 9, 2020

stale bot closed this as completed Jul 16, 2020

kakkoyun reopened this Jun 7, 2021

stale bot removed the stale label Jun 7, 2021

stale bot added the stale label Aug 25, 2021

stale bot removed the stale label Aug 25, 2021

stale bot added the stale label Oct 30, 2021

stale bot removed the stale label Nov 4, 2021

stale bot added the stale label Jan 9, 2022

yeya24 mentioned this issue Jan 23, 2022

query: metric type and scrape interval aware deduplication #5094

Open

stale bot removed the stale label Feb 9, 2022

stale bot added the stale label Apr 16, 2022

stale bot removed the stale label Apr 19, 2022

matej-g added the dont-go-stale Label for important issues which tells the stalebot not to close them label Oct 24, 2022

jnyi mentioned this issue May 21, 2024

Issue with deduplication alogrithm in Thanos #7364

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thanos-query: deduplication picks up time-series with missing data #981

thanos-query: deduplication picks up time-series with missing data #981

Hashfyre commented Mar 27, 2019 •

edited

Loading

bwplotka commented Mar 28, 2019 •

edited

Loading

MacroPower commented Jan 20, 2020

stale bot commented Feb 19, 2020

brancz commented Feb 19, 2020 •

edited

Loading

bwplotka commented Feb 19, 2020

stale bot commented Apr 19, 2020

bwplotka commented May 19, 2020 •

edited

Loading

sepich commented Jun 9, 2020 •

edited

Loading

stale bot commented Jul 9, 2020

stale bot commented Jul 16, 2020

omron93 commented May 28, 2021

omron93 commented Jun 7, 2021

kakkoyun commented Jun 7, 2021

omron93 commented Jun 7, 2021

malejpavouk commented Jun 24, 2021 •

edited

Loading

stale bot commented Aug 25, 2021

malejpavouk commented Aug 25, 2021

stale bot commented Oct 30, 2021

omron93 commented Nov 4, 2021

stale bot commented Jan 9, 2022

jmichalek132 commented Feb 9, 2022

stale bot commented Apr 16, 2022

omron93 commented Apr 19, 2022

aarontams commented Jul 25, 2022

clalos2592 commented Sep 16, 2022

jamessewell commented Oct 24, 2022

Antiarchitect commented Dec 9, 2022 •

edited

Loading

caoimheharvey commented Apr 18, 2023

saikatg3 commented Feb 9, 2024

mdraijer commented Dec 2, 2024

thanos-query: deduplication picks up time-series with missing data #981

thanos-query: deduplication picks up time-series with missing data #981

Comments

Hashfyre commented Mar 27, 2019 • edited Loading

bwplotka commented Mar 28, 2019 • edited Loading

MacroPower commented Jan 20, 2020

stale bot commented Feb 19, 2020

brancz commented Feb 19, 2020 • edited Loading

bwplotka commented Feb 19, 2020

stale bot commented Apr 19, 2020

bwplotka commented May 19, 2020 • edited Loading

sepich commented Jun 9, 2020 • edited Loading

stale bot commented Jul 9, 2020

stale bot commented Jul 16, 2020

omron93 commented May 28, 2021

omron93 commented Jun 7, 2021

kakkoyun commented Jun 7, 2021

omron93 commented Jun 7, 2021

malejpavouk commented Jun 24, 2021 • edited Loading

stale bot commented Aug 25, 2021

malejpavouk commented Aug 25, 2021

stale bot commented Oct 30, 2021

omron93 commented Nov 4, 2021

stale bot commented Jan 9, 2022

jmichalek132 commented Feb 9, 2022

stale bot commented Apr 16, 2022

omron93 commented Apr 19, 2022

aarontams commented Jul 25, 2022

clalos2592 commented Sep 16, 2022

jamessewell commented Oct 24, 2022

Antiarchitect commented Dec 9, 2022 • edited Loading

caoimheharvey commented Apr 18, 2023

saikatg3 commented Feb 9, 2024

mdraijer commented Dec 2, 2024

Hashfyre commented Mar 27, 2019 •

edited

Loading

bwplotka commented Mar 28, 2019 •

edited

Loading

brancz commented Feb 19, 2020 •

edited

Loading

bwplotka commented May 19, 2020 •

edited

Loading

sepich commented Jun 9, 2020 •

edited

Loading

malejpavouk commented Jun 24, 2021 •

edited

Loading

Antiarchitect commented Dec 9, 2022 •

edited

Loading