-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thanos-query: deduplication picks up time-series with missing data #981
Comments
Hi, thanks for the report 👋 Yea, I think this is essentially some edge case for our penalty algorithm. The code is here: https://github.com/improbable-eng/thanos/blob/master/pkg/query/iter.go#L416 The problem is that this case is pretty rare (e.g we cannot repro it). I would say adding more unit tests would be nice and help to narrow down what's wrong. Help wanted (: |
I am having this same issue. I can actually reproduce it by having a couple prometheus instances scraping the same target, then just rebooting (recreating the pod, in my case) a single node. It will miss one or two scrapes. You'll then start to see gaps in the data if thanos happens to query the node that was rebooted. |
This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions. |
@bwplotka if we had a data dump of one of these we should be able to extract the time series with the raw data that cause this no? In that case if someone would share a data dump like that that would help us a lot. If you feel it’s confidential data I think we’d also be open to accepting the data privately and extract the time series ourselves. That is if you trust us of course :) |
Yes! we care about samples only as well so you can mask series if you want for privacy reasons! 👍 (: |
This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions. |
Looks like this is last standing deduplication characteristic we could improve. I would not call it bug necessarily, it is just not responsive enough by design. I plan to adjust it in near future. Looks like this is only last standing bug for offline compaction to work! |
Hello 👋 Looks like there was no activity on this issue for last 30 days. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
@bwplotka Was this already done? Or is there a config change to workaround this issue? We see the same issue with thanos 0.18.0 |
Hello 👋 Could you please try out a newer version of Thanos to if it's still valid? Of course we could reopen this issue. |
@kakkoyun I've installed 0.21.1 and we're still seeing the same behaviour. |
We see the same behavior. It seems like only one instance (we have 2 Prometheus instances scraping the same targets) is taken into account and the other one is completely ignored (so dedup(A, B) == A) thanos:v0.21.1 |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
/notstale |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
/notstale |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
still relevant |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Still relevant |
Adding myself here to watch this issue. |
Adding myself here too. |
We are seeing this issue as well. Dedup ignores a series which has no breaks in favour of one which does. |
Seems like faced this too 0.29.0. Thanos Query has multiple sources and selects prometheus sidecar with data gaps on recent data. |
I've also had this issue for the same version, have been able to verify that all of the metrics are being received correctly, so the issue appears to be when the data is queried. |
Is there any progress on this issue? We also recognize the gaps in the displayed metrics: we know that the metrics are stored, but sometimes they're not completely shown in Grafana or Thanos Query-frontend. There are gaps of several minutes, hours, or even 1-2 days. Sometimes a restart of a component, e.g. the Store, solves the problem. Sometimes changing the period (zoom in or out) closes the gap. Sometimes changing the resolution solves it. No definite solution to fill the gaps, sometimes all there is to it is to wait (a couple of days) and then the metrics appear again. Our stack is:
|
Thanos, Prometheus and Golang version used
thanos: v0.3.1
prometheus: v2.5.0
kubernetes: v1.12.6
Kubernetes Distro: KOPS
weave: weaveworks/weave-kube:2.5.0
Cloud Platform: AWS
EC2 Instance Type: R5.4XL
Architecture
G1: Grafana realtime
G2: Grafana Historical
TQ1: Thanos Query realtime (15d retention)
TQ2: Thanos Query historical
TSC: Thanos Sidecars
TS: Thanos store
Each sidecar and the store is fronted by a service with
*.svc.cluster.local
DNS to which the--store
flag points to.G2, TQ2 are not involved in this RCA.
What happened
Event Timelime:
We See the Following metric gap in Grafana (G1)
This particular metric was being scraped from cloudwatch-exporter
We investigate thanos-query and see the following deduplication behavior:
We can see that instead of having two series per metric we have only one, however thanos-query seems to produce contiguous data on
dedup=true
which is enabled by default.Later on we migrate the data of the bad prometheus pod on a new volume and make P2 live
We see the following data in thanos-query with
dedup=false
We can clearly see that one prometheus has data and another is missing it.
dedup=true
, the merged set displays missing data instead of contiguous data as expected.What you expected to happen
We expected thanos deduplication to trust the series that has contiguous data over the one with the missing data and produce a series with contiguous data. Missing a scrape in HA Prometheus environment is expected at times, if one of the prometheus replicas has data the final output should not show missing data.
How to reproduce it (as minimally and precisely as possible):
Environment:
Underlying K8S Worker Node:
uname -a
): Linux ip-10-100-6-218 4.4.0-1054-aws [shipper] Warning about .tmp Prometheus files #63-Ubuntu SMP Wed Mar 28 19:42:42 UTC 2018 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: