-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip - filter_ratelimit: Add record rate limiter #433
Conversation
I didn't notice #410, but I don't think it's complete yet. Anyways, we have a requirement in our production systems to limit logs on a per pod basis (in Kubernetes), so we can protect our ES cluster and other pods from a rogue application that starts spamming logs heavily. As It uses the token bucket algorithm which is simple to code and reason about. I had to modify the hashtable implementation to store any arbitrary object, so I could store the bucket data for quick retrieval. My work on this also motivated #431 to limit total hash table size. Let me know what you think, I will try this out in our own systems as well to get some feedback. |
Changed it to |
I'll rename "Events" to "Records" as well to match the other plugins. |
Quick benchmarks (updated using null output): single bucketbaseline:
with filter:
Around 3% increase. |
c3cac52
to
e13d732
Compare
Add a filter to rate limit records per configurable bucket field. The benefit of this is the overall logging infrastructure can be protected from a rogue logging source (e.g. an application that spams a high volume of messages in a short period of time). Rate limiting is implemented using a token based algorithm. Some n tokens per second are added, with a total burst limit of q. This limits messages to n messages per second on average, with a max amount of q per second. Configuration: *Bucket_Key [filename-field]* Field to use for grouping messages into a rate limited bucket. Rate limits apply to the bucket. Buckets are independent from each other. *Records_Per_Second 10* Average number of records per second allowed. *Records_Burst 20* Max number of records in a second allowed. *Initial_Records_Burst 100* Max number of records to allow on startup. Useful when expected to load a lot of log messages at startup from new log files. *Max_Buckets 256* Number of expected active buckets. If too small, rate limiting won't function very well.
I'm thinking I think a better approach may be some time delay before rate limiting kicks in. That should let the tail plugin finish gathering all prior logs and send them through (I suppose even that isn't perfect due to backpressure, but I think it will be a safer approach). So I could introduce |
Also thinking the dropped messages log line throttling should be time based - it's still really easy to cause log spam with it in my testing. |
Replace Initial_Record_Burst with Initial_Delay that delays ratelimiting for the specified number of seconds. This provides a better mechanism to avoid throttling logs at startup. Throttle the dropped record log message by time, so log volume of the filter is independent of incoming log volume. Add an option Log_Period_Seconds so the logging period is configurable. Ensure records_burst is >= records_per_second. Improve test coverage.
Ok made some improvements in the last commit:
|
I'm not sure what will happen in case you have limit of 5 msg per second, but get 7 messages each 3rd second. Will you drop overhead if there is no burst or keep it? |
I've experimented with this plugin, here is the problem I've found or maybe I did not get how to configure it, please advise:
config
|
Thanks for testing it @onorua. This plugin is token based (not leaky bucket), with a drop policy if out of tokens. My idea is we rely on normal backpressure if there is a network limitation.
Yeah that's as expected, if you exceed "burst" it will drop records for that bucket. Can you let me know what you had in mind instead? My goal here is to prevent single log sources from causing disruption to other sources. Maybe it doesn't match with what you want. Do you want a single rate limit and rely on fluent-bits normal backpressure mechanism? |
Can you rely on normal network backpressure here? ES should slow down and fluent-bit will naturally slow down number of messages sent.
If no burst it will drop the messages. |
Maybe we need a better policy over dropping records, but need to think on it. A problem at least is that fluent-bit can't distinguish different log flows (afaict) so can't slow down messages for just a single log file. Maybe some sort of fair queuing would work with a leaky bucket rate limit. But it's quite a different approach. And we'd still need a way of pausing log ingestion per log source. |
I'm thinking we could modify the tail input plugin to have a buffer per file rather than a global buffer. Then we could have a leaky bucket which is keyed on the Path_Key for instance, and rely on the backpressure to stop file ingestion per file. |
In our case we use fluent-bit as a forwarder, which means it has no relation to tail or whats so ever. What we really need is the leaky-bucket, which would leak even if it is overflowed 100% of the time.
What I have discovered, is that for forwarding purpose none of GCRA, token - works well, because client fluent-bit/fluentd sends data in batches every several seconds (default is 5 I believe). Because you consider fluent-bit only as ingesting, while in our case it is ingesting + forwarding we have slightly different expectations as well as approach. |
Hmm I forgot about the flushing behaviour. I wonder what the purpose of that is, over sending records as soon as possible. I think I see what you want now though - some sort of traffic shaping on the output. For me it's more about policy enforcement. So probably different things. I think if this filter could "pause" ingestion/forwarding instead of dropping, then you could sort of get that traffic shaping, although you'd still have bursts from the flushes. |
@onorua I'm thinking now that I'll be happy with Mem_Buf_LImit - which will throttle logs anyways. E.g if your flush period is 5s, and Mem_Buf_Limit is 5Mi, you'll throttle logs to 1Mi/s. |
Mem_buf_Limit and the new filter_throttle address the need. Closing the issue for now. |
Add a filter to rate limit records per configurable bucket field. The
benefit of this is the overall logging infrastructure can be protected
from a rogue logging source (e.g. an application that spams a high
volume of messages in a short period of time).
Rate limiting is implemented using a token based algorithm. Some n
tokens per second are added, with a total burst limit of q. This limits
messages to n messages per second on average, with a max amount of q per
second.
Configuration
Bucket_Key [filename-field]
Field to use for grouping messages into a rate limited bucket. Rate limits
apply to the bucket. Buckets are independent from each other.
Records_Per_Second 10
Average number of records per second allowed.
Records_Burst 20
Max number of records in a second allowed.
Initial_Records_Burst 100
Max number of records to allow on startup. Useful when expected to load
a lot of log messages at startup from new log files.
Max_Buckets 256
Number of expected active buckets. If too small, rate limiting won't
function very well.