-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compact: Penalty deduplication very slow during compaction #7759
Comments
Hey @mkrull! I'm here to help you with any bugs, questions, or contributions you have in mind. Let's work together to solve this issue. The penalty deduplication process in Thanos can take an excessively long time during compaction if at least one replica label is not specified. This is a requirement for the penalty-based deduplication algorithm to function correctly. Without specifying a replica label, the deduplication process becomes inefficient, leading to prolonged processing times. To resolve this issue, ensure that you specify at least one replica label using the cmd.Flag("deduplication.func", "Experimental. Deduplication algorithm for merging overlapping blocks. "+
"Possible values are: \"\", \"penalty\". If no value is specified, the default compact deduplication merger is used, which performs 1:1 deduplication for samples. "+
"When set to penalty, penalty based deduplication algorithm will be used. At least one replica label has to be set via --deduplication.replica-label flag.").
Default("").EnumVar(&cc.dedupFunc, compact.DedupAlgorithmPenalty, "")
cmd.Flag("deduplication.replica-label", "Label to treat as a replica indicator of blocks that can be deduplicated (repeated flag). This will merge multiple replica blocks into one. This process is irreversible."+
"Experimental. When one or more labels are set, compactor will ignore the given labels so that vertical compaction can merge the blocks."+
"Please note that by default this uses a NAIVE algorithm for merging which works well for deduplication of blocks with **precisely the same samples** like produced by Receiver replication."+
"If you need a different deduplication algorithm (e.g one that works well with Prometheus replicas), please set it via --deduplication.func.").
StringsVar(&cc.dedupReplicaLabels) Ensure that you set the Additionally, consider the following points to further optimize the compaction process:
By addressing these points, you can significantly improve the efficiency of the penalty deduplication process during compaction in Thanos. |
To address the above: |
The metric causing the deduplication being slow is pretty high cardinality with ~200,000 series in the 2h block. Those series are distributed across many chunks ~30-50 due to how the metric is exposed. Removing that metric brings down the compaction time to expected levels. While this is certainly not an ideal use of Prometheus it is still a non-obvious impact from a user perspective and is tricky to deal with if the maintainers of a Prometheus + Thanos stack have limited control over the metrics produced. |
Thanos, Prometheus and Golang version used:
Thanos 0.31.0 and 0.36.1 yielded similar results.
Object Storage Provider:
S3
What happened:
We have blocks collected from a specific service that hit either a bug or a worst case scenario using penalty vertical compaction. The below example is an extreme case but other similar blocks still take 2h-8h with an occasional 15+h duration on decent AWS hardware (m7i.8xl or similar). Other larger or similarly sized blocks in our infrastructure take a fraction of that time with the same settings.
where the duration of
duration=24h24m14.990136102s
seems completely out of proportion for the two blocks that are large but not outrageously large:producing:
The profile suggests most of the time is spent at https://github.com/thanos-io/thanos/blob/v0.36.1/pkg/dedup/iter.go#L422, that is kind of expected but it gets hit hundreds of billions to trillions of times.
What you expected to happen:
Compaction to be finished well below 2h for the compactors to keep up.
How to reproduce it (as minimally and precisely as possible):
I am still trying to figure out what happens exactly, I can reproduce the long durations with the same blocks on my local machine and will spend some time to hopefully get more information.
Full logs to relevant components:
Nothing out of the ordinary, even in debug mode.
Anything else we need to know:
I am pretty sure we are "holding it wrong" and would like to figure out in what way. I will add more information as requested or once I find something.
The text was updated successfully, but these errors were encountered: