-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compactor: change GatherIndexIssueStats validation logic on sorted labels #831
Conversation
deliberately exclude the first element when checking that a label set is in sorted order. The first element is always "__name__" for prometheus and while '_' is lexographically before all lowercase letters, it does not precede upper case letters. This made compaction fail on valid metrics such as up{Z="one", a="two"} since the expected order was [Z, __name__, a]. This change brings the validation more in line with what prometheus is using at https://github.com/prometheus/prometheus/blob/9f903fb3f7841f55d1b634d5fb02986a8bfc5e0c/pkg/textparse/promparse.go#L230 Add additional check may also be needed to make sure that the first value is always __name__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am ok with this change, thanks for this!
But the problem is that really upper case label names are quite wrong -> It quite massively increase lookup time/effort. So they should be discouraged. Anyway, we should do it explicitely if any, not by accident sorting bug. Thanks (:
Agreed, although I didn't know about the perf hit aspect of it. The metrics that tickled this bug were changed as soon as we found them but unfortunately, we can't rewrite history, so the compactor needs to be flexible enough to deal with (and eventually erase) our bad decisions. Related to this: I'd be interested if you can you point me to any discussions or pages that talk about this perf hit? Thanks for the quick reply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to this: I'd be interested if you can you point me to any discussions or pages that talk about this perf hit?
I will be honest, there is nothing and I only remember @fabxc (co-author of Thanos and author of TSDB) mentioning some perf consequnces. But cannot find any immdiate answer why - anyway we should double check if name is actually first. Because I though it is not. I was thinking that the reason of perf issues is exactly that name is somewhere between upper and lower case labels. Maybe for your block we remove Z
and that's why this code works? ;p (your test does not cover this case)
l0 = l | ||
|
||
// Ensure all but the first value (__name__) are sorted | ||
if !sort.SliceIsSorted(lset[1:], func(i, j int) bool { return lset[i+1].Name <= lset[j+1].Name }) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's actually check if name is first maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because I think that's the whole point.. if the labels are ordered name is between uppercase and lowercase already.
pkg/block/index.go
Outdated
} | ||
l0 = l | ||
|
||
// Ensure all but the first value (__name__) are sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing trailing petiof
I initially omitted the check for Issue #32 in prometheus/tsdb was the best guidance I found on how this should be implemented and after reading your comment, I've added another test case that checks bad (at least in my eyes) labelsets correctly throw an error. Unfortunately, the old code happily accepts these label sets while my new changes result in an error (the expected output). So now I'm not 100% sure this won't break anything 🤷♂️. |
Yuck. Ensuring So I guess the question becomes: when is |
Let's focus on what we want to achieve here. Are you blocked? Do you use upper case label names? (: |
} | ||
|
||
// Ensure all but the first value (__name__) are sorted. | ||
if !sort.SliceIsSorted(lset[1:], func(i, j int) bool { return lset[i+1].Name <= lset[j+1].Name }) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit different and I might raise a different pull request but what if we would relax this check a bit and switch it to <
here? Essentially, it seems like Prometheus accepts duplicate labels and if that happens (happened in our case) then compactor doesn't work anymore due to this check. Case in point:
{"caller":"main.go:181","err":"error executing compaction: compaction failed: compaction: gather index issues for block /data/compact/0@{monitor=\"monitor\",replica=\"repl\"}/01D34EDQMSQ29RHAC47XGKHGC7: out-of-order label set {__name__=\"foo\",exported_job=\"vv\",host=\"172_16_226_56\",host=\"172_16_226_56\",region=\"lt\",subtask_index=\"5\",task_attempt_id=\"32e4b047bb768583ff57c709be3b1046\",task_attempt_num=\"8\",task_id=\"688c028a219ff3372f3eecb0ee5811f9\",task_name=\"Source:_foo\",tenant=\"abc\",tier=\"cooltier\",tm_id=\"53b2ed987b08f427dec4ee1465df91fa\"} for series 2594231","level":"error","msg":"running command failed","ts":"2019-02-11T13:30:33.901722306Z"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to #848.
I'm dogfooding my fork and it's running great so far. The metrics with capital labels are gone but unfortunately, I'll still need to use my fork until those troublesome metrics age out of the retention window (which by design will be a while). I think the ultimate goal should be that compactor validation should be as close to prometheus' validation but no more strict than prometheus. This PR specifically should just cover the edge case of labels that start with a capital letter. A temporary fix until we can agree on the right impl here, might be completing #711. |
Ok, so with #848 being merged, can we rebase this / or decide to close? (: There is definitely more work on this if we want to productionize that (: |
I'm ok with closing this. |
I just ran into this bug. @sgmitchell does this PR still fix the issue? |
Yea, some idea is implement this:
#415 to allow back in time
fixes.
śr., 13 mar 2019 o 17:10 Jarod Watkins <[email protected]>
napisał(a):
… I just ran into this bug. @sgmitchell <https://github.com/sgmitchell>
does this PR still fix the issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#831 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGoNuw88w9RfejlmW4Yc1yO8Xk__H-ZIks5vWTEbgaJpZM4azS0w>
.
|
I just built this branch (after rebasing off master) and it does not solve my issue (some labels that are all uppercase). You can see my update here, maybe I did the merge wrong? |
Ok, rebased myself to double check @ipstatic's work. https://github.com/jjneely/thanos/tree/GatherIndexIssueStats-label-sorting I updated my compact docker image and re-ran, and I get an error that Can we at least skip compaction of bad blocks? Or re-order the list correctly here?
I actually think that this patch fixes some of my woes with Compact, but I do appear to have conditions where |
Ok, looked a bit more into this and it seems like in the Prometheus code there is some single package for all things related to labels: https://github.com/prometheus/prometheus/blob/9f903fb3f7841f55d1b634d5fb02986a8bfc5e0c/pkg/labels/labels.go. As you can see, everywhere there the label set is being sorted before being returned to the users of that package. In here as mentioned in the original issue: https://github.com/prometheus/prometheus/blob/9f903fb3f7841f55d1b634d5fb02986a8bfc5e0c/pkg/textparse/promparse.go#L230 the |
@jjneely indeed this seems like a Prometheus bug. I added another test to the parser in Prometheus and ran the tests with: Result:
So yes, it seems that Prometheus parser itself isn't following the invariants laid out in the labels package. Needs fixing there, IMHO. |
Yay! So labels values should always be in sorted order with no special treatment of
Which was the original error I was facing that lead me here. Is that correct? |
Yes. So now we should add an extra flag to Thanos Compact to make it ignore this error, and print information about this in case it happens. We should explicitly note that all versions before and including Prometheus 2.8.0 are affected by this. Care to make a patch? |
Closing this because this has been outlined already in #888. Will add the relevant info there. |
Yes. I'll be glad to work on a PR for this. Thanks a bunch for the help, makes contributing easy. |
Is it possible to disable sorting? |
No, probably because it permits Prometheus to use binary search. What problem are you trying to solve? |
I use HTTP URL parameters. (target, user, pass). Because of the sorting, I have the wrong request url. |
I'm sorry but I think that your issue is completely unrelated to this one and even the Thanos project 😄 pleask ask in https://github.com/prometheus/prometheus about this. But most likely you will be told that your application must not depend on such minute details because IMHO it would be hard to implement any kind of ability to specify an arbitrary ordering of parameters when scraping. |
Deliberately exclude the first element when checking that a label set is in sorted order.
The first element is always
__name__
for prometheus and while_
is lexicographically before all lowercase letters, it does not precede upper case letters. This made compaction fail on valid metrics such asup{Z="one", a="two"}
since the expected order was[Z, __name__, a]
.This change brings the validation more in line with what prometheus is using in promparse.
An additional check may also be needed to make sure that the first value is always
__name__
Changes
GatherIndexIssueStats
Verification
Here's a simple config that generates data that will crash the compactor. I validated that my fix no longer crashed the compactor on a similar dataset.
Use the following scrape config in prometheus to generate a metrics
up{Z="one", a="two", instance="localhost:9090",job="prometheus"}
Start the compactor running the
improbable/thanos:v0.3.0
image and read the bucket that the thanos sidecar is writing. When the container encounters this timeseries, it will crash with the the following log linelevel=error ts=... caller=main.go:181 msg="running command failed" err="error executing compaction: compaction failed: compaction: gather index issues for block /var/thanos/store/compact/0@{monitor=\"prometheus\",replica=\"prometheus-0\"}/01D36C79GDX5DVXZ1PAD5ZT65M: out-of-order label set {__name__=\"up\",Z=\"one\",a=\"two\",instance=\"localhost:9090\",job=\"prometheus\"} for series 2739605"