Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics #6141
Labels
AWS
Issues with AWS plugins or experienced by users running on AWS
exempt-stale
long-term
Long term issues (exempted by stale bots)
Background
Fluent Bit Flushing and Metrics Mechanism
In Fluent Bit, logs are ingested from inputs and then batched into chunks of roughly 2MB. These are buffered and then sent to outputs. An output flush event is supposed to either succeed or fail to send the entire chunk. Flush events are ideally not supposed to be dependent on each other, enabling concurrency/workers to send multiple chunks at a time.
So for each chunk, an output must either return FLB_OK, FLB_RETRY, or FLB_ERROR. These returns are counted and become the Fluent Bit prometheus metrics: https://docs.fluentbit.io/manual/administration/monitoring
S3 Output Buffering
The S3 plugin has its own buffering mechanism, because it serves a unique use case. Customers expect large files, potentially several GB in size in their S3 bucket. Tons of little 2 MB files, one created for each chunk, would be a poor user experience.
Thus, AWS implemented custom buffering in out_s3. The plugin accepts data from flush events and then buffers it into the file system to create large files. Thus, it does not send data in most flush events. Thus, it always returns FLB_OK for all flushes (unless there was an error writing to the buffer files). This means that the prometheus metrics for S3 are meaningless.
Problem Statement
Because S3 has its own buffering and retry strategy, the prometheus output metrics are useless for customers. Actually, they are worse than useless- they are misleading. In most cases, the metrics for S3 will show only success, since it only ever returns FLB_OK. Thus, customers may think their uploads are all succeeding even though they are not.
Goal
S3 output metrics in the prometheus endpoint should be meaningful and useful to users. S3 customers do not think in terms of failed chunks, they think in terms of failed file uploads.
Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics
Currently, there is code that increments the error, retry and success metrics when flush events are completed here:
The proposal is that Fluent Bit supports an output flag,
FLB_OUTPUT_OMIT_METRICS
which if set will bypass the metric counter code above.Instead, the plugin will have the option of incrementing the cmetric counters itself. This would allow the S3 output to increment the counters based on file uploads.
Usefulness for other plugins
This could potentially be useful for other outputs as well. Many outputs sometimes have to make multiple requests to send a single chunk. For example, if an API (ex: AWS CloudWatch PutLogEvents) has a 1 MB max payload size, then each ~2MB chunk could take 2+ requests to upload.
Output plugin developers may prefer to decide that the meaning of the retry/success metrics for their plugin is based on API calls, not chunks. Essentially, the argument here is that end-users do not think in terms of chunks, its an internal Fluent Bit concept, and thus the prometheus metrics are maximally useful if they are tied to end-user needs.
The text was updated successfully, but these errors were encountered: