How Can Cortext Handle Data Gaps in Network Partition Scenarios in HA Cluster of Prometheuses? #5633

aleskxyz · 2023-11-06T09:08:29Z

aleskxyz
Nov 6, 2023

Hi,

Cortext selects a leader from the cluster of HA Prometheus to retrieve samples. Imagine a network partition situation where each Prometheus can scrape data from some instances. With the current Cortext design, only samples from the elected Prometheus will be written to long-term storage, and samples from other Prometheuses will be discarded, resulting in gaps for samples that are scraped only by the standby Prometheus.

Does Cortext have a solution for this, or can it handle this situation like Thanos, which deduplicates data at query time?

Thanks.

alanprot · 2023-11-06T16:58:22Z

alanprot
Nov 6, 2023
Maintainer

Why do you have samples that are scraped only by the standby prometheus? I think the idea is to have 2 prometheus instances scraping exactly the same metrics. The reason why we only accept 1 replica is because TSDB will reject duplicate samples (metrics with the same timestamp for the same series with different values). Alan Diego

…

On Mon, Nov 6, 2023 at 1:08 AM aleskxyz ***@***.***> wrote: Hi, Cortext selects a leader from the cluster of HA Prometheus to retrieve samples. Imagine a network partition situation where each Prometheus can scrape data from some instances. With the current Cortext design, only samples from the elected Prometheus will be written to long-term storage, and samples from other Prometheuses will be discarded, resulting in gaps for samples that are scraped only by the standby Prometheus. Does Cortext have a solution for this, or can it handle this situation like Thanos, which deduplicates data at query time? Thanks. — Reply to this email directly, view it on GitHub <#5633>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6XK4DPC7WEQ4D4SARLFK3YDCSJVAVCNFSM6AAAAAA67E7VSCVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZVHAYTQMBQGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

aleskxyz Nov 6, 2023
Author

Thanks for your reply!
As I told above, we may see this inconsistency in case of network partition.
Imagine we have 2 prometheus in 2 different racks that both of them are scraping all instances.
when internal connection between 2 racks is disrupted, then the active prometheus cannot scrape resources in the other rack but the local prometheus of that rack is still working.
Thanks

alanprot · 2023-11-06T19:08:20Z

alanprot
Nov 6, 2023
Maintainer

So the problem is the "fail over time"? The default value is 15 seconds its configurable: ha_tracker_update_timeout Alan Diego

…

On Mon, Nov 6, 2023 at 10:33 AM aleskxyz ***@***.***> wrote: Thanks for your reply! As I told above, we may see this inconsistency in case of network partition. Imagine we have 2 prometheus in 2 different racks that both of them are scraping all instances. when internal connection between 2 racks is disrupted, then the active prometheus cannot scrape resources in the other rack but the local prometheus of that rack is still working. Thanks — Reply to this email directly, view it on GitHub <#5633 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6XK4FQKIV477GPNERF4D3YDEUQTAVCNFSM6AAAAAA67E7VSCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TIOJQHEYTE> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How Can Cortext Handle Data Gaps in Network Partition Scenarios in HA Cluster of Prometheuses? #5633

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

How Can Cortext Handle Data Gaps in Network Partition Scenarios in HA Cluster of Prometheuses? #5633

aleskxyz Nov 6, 2023

Replies: 2 comments · 1 reply

alanprot Nov 6, 2023 Maintainer

aleskxyz Nov 6, 2023 Author

alanprot Nov 6, 2023 Maintainer

aleskxyz
Nov 6, 2023

Replies: 2 comments 1 reply

alanprot
Nov 6, 2023
Maintainer

aleskxyz Nov 6, 2023
Author

alanprot
Nov 6, 2023
Maintainer