-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Follow up for #3895 - returning 422 (instead of 500) when query hits max_chunks_per_query
limit with block storage
#3937
Follow up for #3895 - returning 422 (instead of 500) when query hits max_chunks_per_query
limit with block storage
#3937
Conversation
d483c4d
to
906e883
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! The changes LGTM, including Thanos, but we should investigate why the compactor cleanup fails.
pkg/compactor/compactor_test.go
Outdated
@@ -1019,7 +1019,7 @@ func removeIgnoredLogs(input []string) []string { | |||
|
|||
for i := 0; i < len(input); i++ { | |||
log := input[i] | |||
if strings.Contains(log, "block.MetaFetcher") || strings.Contains(log, "block.BaseFetcher") { | |||
if strings.Contains(log, "block.MetaFetcher") || strings.Contains(log, "block.BaseFetcher") || strings.Contains(log, "failed deleting non-compaction group directories/files, some disk space usage might have leaked") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't sound correct. We shouldn't filter out error logs. If there's an issue we should fix it. Could you investigate, please?
Suggestion: try to upgrade to Thanos master
again. The compactor local filesystem cleanup is something still under work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pracucci.
I did upgrade Thanos and the problem persists.
Investigating this issue seems that the problem is cause during the first compaction the compactDir
is not yet created and then we get an error - no such file or directory when trying to delete on this line.
We can see that the folder is immediately created when the first compaction run from here, here and here
I think its fine to ignore this new log line there or we should try to do a change on thanos to not log that warn in case of a ENOENT
. Thoughts?
BTW, the log level of this message is Warn
not Error
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks harmless to me. Changing thanos code to not generate error if dir doesn't exist would be another way of fixing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree on not fixing it (and ignoring the log) because of a couple of reasons:
- If a Cortex user sees this warning, it may be "warned" even if it's a non issue
- If in the future other issues cause this deletion to fail, we'll not notice it if we ignore such log
I agree it's "not a big deal" right now, but given the fix looks trivial we could just go ahead and fix it:
thanos-io/thanos#3974
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!!! And thanks for following up with thanos! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanos-io/thanos#3974 has been merged so if you upgrade Thanos again (which would be required anyway because of fixes done in the meanwhile) you can then remove the change from removeIgnoredLogs()
. Thanks!
pkg/storegateway/bucket_stores.go
Outdated
type ChunkLimiter struct { | ||
limiter *store.Limiter | ||
} | ||
|
||
func (c *ChunkLimiter) Reserve(num uint64) error { | ||
err := c.limiter.Reserve(num) | ||
if err != nil { | ||
return httpgrpc.Errorf(http.StatusUnprocessableEntity, err.Error()) | ||
} | ||
|
||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits:
- I would move it at the end of the file, to keep
NewBucketStores()
close to itsstruct
definition - I think it's not required to export
ChunkLimiter
so I would rename it tochunkLimiter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanprot could you address this comment, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! :D Thanks again!! :D
Overall, this looks good to me (after fixing Marco's suggestions). Please fix DCO and rebase to fix CHANGELOG conflict before we can merge. |
8979f8b
to
b9035fa
Compare
We need to upgrade Thanos again, because of some fixes done in the meanwhile (see thanos-io/thanos#3978). Upgrading to current Please also rebase |
@alanprot to keep the Thanos review easier, could you update Thanos at the commit of when thanos-io/thanos#3974 has been merged? Other PRs have been merged after that and I would prefer reviewing them in a separate PR. |
0b8cba1
to
c313cb9
Compare
18f8fef
to
c038351
Compare
Signed-off-by: Alan Protasio <[email protected]>
Signed-off-by: Alan Protasio <[email protected]>
Signed-off-by: Alan Protasio <[email protected]>
Signed-off-by: Alan Protasio <[email protected]>
c038351
to
20fefb0
Compare
Thanks @pracucci .. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good, thanks! Sorry for my very late reply :( I double checked Thanos changes and LGTM as well.
What this PR does:
Following up PR #3895
There were still some cases where Cortex was returning 5XX (instead of 422) when a user sends queries that hit
max_chunks_per_query
limit.Specifically, this limit is also checked inside Thanos and returned as an generic error here
This PR updates Thanos as it needs this change to work.
Basically a custom limiter was created to return an error with a specific GRPC code when those limits are hit, and thanos will take care of propagating this error code.
Its important to note that when updating thanos we brought this change that added a parameter on some methods.
The value
metadata.NoneFunc
as used on this extra parameter. (I don't know if cortex want to enable this hash check in the future)Ex: b5b572c#diff-adc81f6ffba52bc1871b0dac7c88237aa1cdea40e7691e35e834b40129ab723fR56
Which issue(s) this PR fixes:
Fixes #3895
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]