Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow up for #3895 - returning 422 (instead of 500) when query hits max_chunks_per_query limit with block storage #3937

Merged
merged 4 commits into from
Apr 8, 2021

Conversation

alanprot
Copy link
Member

@alanprot alanprot commented Mar 11, 2021

What this PR does:
Following up PR #3895

There were still some cases where Cortex was returning 5XX (instead of 422) when a user sends queries that hit max_chunks_per_query limit.

Specifically, this limit is also checked inside Thanos and returned as an generic error here

This PR updates Thanos as it needs this change to work.

Basically a custom limiter was created to return an error with a specific GRPC code when those limits are hit, and thanos will take care of propagating this error code.

Its important to note that when updating thanos we brought this change that added a parameter on some methods.
The value metadata.NoneFunc as used on this extra parameter. (I don't know if cortex want to enable this hash check in the future)

Ex: b5b572c#diff-adc81f6ffba52bc1871b0dac7c88237aa1cdea40e7691e35e834b40129ab723fR56

Which issue(s) this PR fixes:
Fixes #3895

Checklist

  • Tests updated
  • [NA] Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@alanprot alanprot force-pushed the feature/limits-err-code-2 branch from d483c4d to 906e883 Compare March 11, 2021 00:37
@tomwilkie tomwilkie requested a review from pracucci March 18, 2021 16:16
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! The changes LGTM, including Thanos, but we should investigate why the compactor cleanup fails.

CHANGELOG.md Outdated Show resolved Hide resolved
@@ -1019,7 +1019,7 @@ func removeIgnoredLogs(input []string) []string {

for i := 0; i < len(input); i++ {
log := input[i]
if strings.Contains(log, "block.MetaFetcher") || strings.Contains(log, "block.BaseFetcher") {
if strings.Contains(log, "block.MetaFetcher") || strings.Contains(log, "block.BaseFetcher") || strings.Contains(log, "failed deleting non-compaction group directories/files, some disk space usage might have leaked") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't sound correct. We shouldn't filter out error logs. If there's an issue we should fix it. Could you investigate, please?

Suggestion: try to upgrade to Thanos master again. The compactor local filesystem cleanup is something still under work.

Copy link
Member Author

@alanprot alanprot Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pracucci.

I did upgrade Thanos and the problem persists.

Investigating this issue seems that the problem is cause during the first compaction the compactDir is not yet created and then we get an error - no such file or directory when trying to delete on this line.

We can see that the folder is immediately created when the first compaction run from here, here and here

I think its fine to ignore this new log line there or we should try to do a change on thanos to not log that warn in case of a ENOENT. Thoughts?

BTW, the log level of this message is Warn not Error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks harmless to me. Changing thanos code to not generate error if dir doesn't exist would be another way of fixing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree on not fixing it (and ignoring the log) because of a couple of reasons:

  1. If a Cortex user sees this warning, it may be "warned" even if it's a non issue
  2. If in the future other issues cause this deletion to fail, we'll not notice it if we ignore such log

I agree it's "not a big deal" right now, but given the fix looks trivial we could just go ahead and fix it:
thanos-io/thanos#3974

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!! And thanks for following up with thanos! :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanos-io/thanos#3974 has been merged so if you upgrade Thanos again (which would be required anyway because of fixes done in the meanwhile) you can then remove the change from removeIgnoredLogs(). Thanks!

Comment on lines 76 to 87
type ChunkLimiter struct {
limiter *store.Limiter
}

func (c *ChunkLimiter) Reserve(num uint64) error {
err := c.limiter.Reserve(num)
if err != nil {
return httpgrpc.Errorf(http.StatusUnprocessableEntity, err.Error())
}

return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits:

  1. I would move it at the end of the file, to keep NewBucketStores() close to its struct definition
  2. I think it's not required to export ChunkLimiter so I would rename it to chunkLimiter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanprot could you address this comment, please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! :D Thanks again!! :D

@pstibrany
Copy link
Contributor

Overall, this looks good to me (after fixing Marco's suggestions). Please fix DCO and rebase to fix CHANGELOG conflict before we can merge.

@alanprot alanprot force-pushed the feature/limits-err-code-2 branch 4 times, most recently from 8979f8b to b9035fa Compare March 25, 2021 21:45
@pracucci
Copy link
Contributor

pracucci commented Apr 1, 2021

We need to upgrade Thanos again, because of some fixes done in the meanwhile (see thanos-io/thanos#3978). Upgrading to current main should be fine (I just checked it).

Please also rebase master before upgrading Thanos cause we did some changes to go.mod which may have an impact to the updated dependencies.

@pracucci
Copy link
Contributor

pracucci commented Apr 1, 2021

Please also rebase master before upgrading Thanos cause we did some changes to go.mod which may have an impact to the updated dependencies.

@alanprot to keep the Thanos review easier, could you update Thanos at the commit of when thanos-io/thanos#3974 has been merged? Other PRs have been merged after that and I would prefer reviewing them in a separate PR.

@alanprot alanprot force-pushed the feature/limits-err-code-2 branch from 0b8cba1 to c313cb9 Compare April 5, 2021 04:41
@alanprot alanprot force-pushed the feature/limits-err-code-2 branch 2 times, most recently from 18f8fef to c038351 Compare April 5, 2021 05:11
alanprot added 4 commits April 4, 2021 22:19
Signed-off-by: Alan Protasio <[email protected]>
Signed-off-by: Alan Protasio <[email protected]>
Signed-off-by: Alan Protasio <[email protected]>
@alanprot alanprot force-pushed the feature/limits-err-code-2 branch from c038351 to 20fefb0 Compare April 5, 2021 05:19
@alanprot
Copy link
Member Author

alanprot commented Apr 5, 2021

Thanks @pracucci ..
I just updated the PR

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good, thanks! Sorry for my very late reply :( I double checked Thanos changes and LGTM as well.

@pracucci pracucci merged commit 02bcf27 into cortexproject:master Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants