Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.*: Don't report "object not found" as error for Get/GetRange #2365

Closed
wants to merge 2 commits into from
Closed

.*: Don't report "object not found" as error for Get/GetRange #2365

wants to merge 2 commits into from

Conversation

pstibrany
Copy link
Contributor

@pstibrany pstibrany commented Apr 2, 2020

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Don't treat "object not found" as failures when reporting metrics for Get/GetRange operations.

Recently introduced IgnoreDeletionMarkFilter uses Get operation to check if block is marked for deletion, and since many blocks are not, bucket store returns "object not found" error. Since metricBucket reports this as failed operation, it looks as if there were many failed operations. But this is expected outcome for Get/GetRange, so I don't think it should be reported as failure.

Alternatively, we can modify IgnoreDeletionMarkFilter to check for file first, but that would require two operations.

Verification

See decreased number of reported failures thanos_objstore_bucket_operation_failures_total for Get operation.

Signed-off-by: Peter Štibraný <[email protected]>
@pstibrany
Copy link
Contributor Author

check-doc target failed with

docs/getting-started.md
	ERROR	https://fosdem.org/2019/schedule/event/thanos_transforming_prometheus_to_a_global_scale_in_a_seven_simple_steps/
		Dialing to the given TCP address timed out

I believe it's unrelated to my change.

@GiedriusS
Copy link
Member

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it would be cleaner to do this in deletionmark.go itself because some code might actually try to Get something that the code assumes that some object exists, and that should probably be reflected in the metrics. For example, the user could still run into consistency issues even with all of these safeguards in place but with this if it would be impossible to detect such a problem easily.

@pstibrany
Copy link
Contributor Author

IMHO it would be cleaner to do this in deletionmark.go itself because some code might actually try to Get something that the code assumes that some object exists, and that should probably be reflected in the metrics. For example, the user could still run into consistency issues even with all of these safeguards in place but with this if it would be impossible to detect such a problem easily.

I've created #2369 that implements alternative fix in ReadDeletionMark function. Personally I weakly prefer this PR, but either one is fine. Thank you.

@pracucci
Copy link
Contributor

pracucci commented Apr 3, 2020

IMHO it would be cleaner to do this in deletionmark.go itself because some code might actually try to Get something that the code assumes that some object exists, and that should probably be reflected in the metrics.

From my perspective, bucket level metrics should track a failure only when a bucket-related failure occurs. If the object does not exist, it's not a bucket-level error (ie. networking error, storage unavailable, etc...). I personally think this PR is better than #2369.

@bwplotka
Copy link
Member

bwplotka commented Apr 3, 2020

Looking.

NotFound is a problem indeed - it's most of the time user error, but for us we used to treat it as a server error if it's unexpected.

@bwplotka
Copy link
Member

bwplotka commented Apr 3, 2020

One way of doing it is to always use Exists method before.. but that seems like overhead for most of the times ):

@pstibrany
Copy link
Contributor Author

One way of doing it is to always use Exists method before.. but that seems like overhead for most of the times ):

#2369 :)

@bwplotka
Copy link
Member

bwplotka commented Apr 3, 2020

or actually you did that version too #2369

Will reach you offline

@bwplotka
Copy link
Member

bwplotka commented Apr 3, 2020

I think I like #2369 more, because we will be still notified for cases that not found is unexpected...

Other option is to extend the API a bit and add something like NewBucketNotFoundIsOk ;p

@bwplotka
Copy link
Member

bwplotka commented Apr 3, 2020

Slack discussion: https://cloud-native.slack.com/archives/CL25937SP/p1585903230146500

@pracucci Re #2365 (comment)

You are right, but we still need to have some metric if this happens in unexpectedly. Fourth alternative is another metric (: But this will be inconsistent with our bucket Lvl metric because we don't separate user-level errors at the moment.

@bwplotka
Copy link
Member

bwplotka commented Apr 3, 2020

Alternative we agrred offline for: #2370 Let me know if that's make sense 🤗

@pstibrany
Copy link
Contributor Author

#2370 seems to fix the problem as well. Thanks!

@bwplotka
Copy link
Member

bwplotka commented Apr 6, 2020

Thanks to you for starting discussion and ideas 👍

@bwplotka bwplotka closed this Apr 6, 2020
@pstibrany
Copy link
Contributor Author

Thanks Bartek!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants