Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(specs): Add specification for partial-write errors #16034

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

srebhan
Copy link
Member

@srebhan srebhan commented Oct 16, 2024

Summary

Add specification for handling partial-write errors on outputs, defining the behavior and error content

Checklist

  • No AI generated code was used in this PR

Related issues

related to #11942
related to #14802
related to #15908
related to #15742

@telegraf-tiger telegraf-tiger bot added the docs Issues related to Telegraf documentation and configuration descriptions label Oct 16, 2024
@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Oct 16, 2024
@srebhan srebhan marked this pull request as draft October 16, 2024 18:35
Copy link
Member

@DStrand1 DStrand1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just have one question and a small fix

docs/specs/tsd-008-partial-write-error-handling.md Outdated Show resolved Hide resolved
Comment on lines +33 to +35
To do so, the error must contain a list of successfully
written metrics, which must be marked as __accepted__ and must be removed from
the buffer. The error must contain a list of metrics fatally failed to be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should accepted metrics be guaranteed to be in the error? Or could they be inferred based on any errored metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate your help with the formulation as mine is ambiguous and overly complex I feel. What I want to say is that we get the list of metrics that are accepted (aka. can be dropped from the buffer) as well as the list of metrics rejected e.g. due to serialization errors or similar (aka. can be dropped from the buffer). Now all metrics in the batch not belonging to one of the mentioned lists should be kept in the buffer and re-issued for writing with the next batch!

So either the error explicitly provides them (which we might want in the future) or we need to infer the metrics to keep from not being in one of the other lists...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think relying on the error explicitly providing accepted metrics specifically is that its not something we can rely on since if all are accepted there would be no error. I think it makes the most sense to have the error only give information about which metrics had an error (retryable or otherwise) and if a metric isn't mentioned in the error it can be assumed as accepted. Does that sound reasonable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I would take another view: All metrics that should be dropped from the buffer should be mentioned in the error so we do have an implicit failback of "what's not mentioned should be kept" which is the safe spot IMO. That's what currently is done, all metrics "accepted" and all metrics "rejected" are in there, so everything else should be kept.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes a lot of sense in a code flow perspective, I think my only hangup is with this just being called an "error," when the error describes metrics that are not errored but accepted properly (and in many cases, may only contain accepted metrics). Maybe it makes sense for this not to be an error but some other explicit return type that could contain an error field? But if you don't have an issue with this then I'm fine with it

Co-authored-by: Dane Strandboge <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Issues related to Telegraf documentation and configuration descriptions ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants