Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Recalculating flush jitter #1296

Closed
kostasb opened this issue May 30, 2016 · 13 comments · Fixed by #1298
Closed

[feature request] Recalculating flush jitter #1296

kostasb opened this issue May 30, 2016 · 13 comments · Fixed by #1298

Comments

@kostasb
Copy link

kostasb commented May 30, 2016

Scope:

Change the behavior of Telegraf's "flush_jitter" to make it sleep for a random,newly assigned, amount of time before flushing each output.


original topic:
Add a flush_randomize_jitter configuration parameter for the agent, that will force recalculation of the flush interval (FlushInterval.Duration) for the next buffer flush cycle.
This will enable the use of flush interval jitter in the same Telegraf agent, while the current flush_jitter implementation calculates the flush interval only once at daemon startup using the flush_jitter parameter, but then maintains the calculated flush interval for all onward cycles.

@sparrc
Copy link
Contributor

sparrc commented May 30, 2016

I would prefer to change the behavior of telegraf, sleeping for a random (newly assigned) amount before flushing each output. This is currently the behavior of collection_jitter.

because of the architecture there is no way for collection_jitter to behave the same way that flush_jitter currently does, so I'd rather have them be consistent and make flush_jitter behave the same as collection_jitter.

@kostasb
Copy link
Author

kostasb commented May 31, 2016

My only consideration is about users in the field currently using flush_jitter for its original purpose to avoid having multiple agents send data at the exact same time, while flushing happens at fixed intervals.
A change of flush_jitter behavior after an upgrade for these existing installations may be confusing to users as they will see writes coming in at random times.

Adding the new behavior for jitter as a new feature & config parameter would enable us to support both solutions.

@sparrc
Copy link
Contributor

sparrc commented May 31, 2016

@kostasb currently flush_jitter does jitter the interval by a random amount, but for each instance it is a fixed length that gets set at startup.

This would be just to change the behavior to reget the jitter each interval.

@kostasb
Copy link
Author

kostasb commented May 31, 2016

@sparrc That is my consideration: users who currently use jitter to introduce some fixed delay that is consistent for each flush, will be getting a random delay for each flush once this behavior is changed.

For example, in case there are data points emitted to Telegraf without a timestamp this will result in random timestamps inside the jitter window, while with the current fixed jitter the interval would be consistent for every write. This might affect some users, which is why I suggested keeping both behaviors.

@sparrc
Copy link
Contributor

sparrc commented May 31, 2016

Telegraf doesn't emit metrics without a timestamp, and I don't think users would be relying on a "fixed" flushing interval that changes any time the service is restarted or reloaded. It seems to me like it makes more sense to make the two jitter options consistent rather than adding more flags to get them to do the same thing.

@daviesalex
Copy link

@sparrc +1. Anybody who wants this will really want true random (random on each flush).

@kostasb
Copy link
Author

kostasb commented May 31, 2016

@sparrc As long as we add a note in documents for the change in jitter behavior and the use cases affected for data points arriving to Telegraf without timestamps ( e.g. udp and tcp input plugins ) that sounds good to me.

@sparrc
Copy link
Contributor

sparrc commented May 31, 2016

@kostasb yes, will do :),

Telegraf adds timestamps when it adds points, so any udp or tcp points would get timestamps as soon as they hit the telegraf listener.

@kostasb
Copy link
Author

kostasb commented May 31, 2016

@sparrc Thanks for clarifying that, then the change shouldn't really affect any real world cases.

@daviesalex
Copy link

The current implementation isnt really achieving a terribly random effect. Here is real world data from a random sample of 1,000 servers running telegraf plotting the interval out of the log file on startup:

image

Looked at another way:

image

The hope is that a random item per node per 10 seconds will be far better distributed to achieve the goal (which is a overall uniform, or close to uniform, distribution). Equally important, it will also eliminate long term biases for some nodes.

Thanks guys!

@kostasb
Copy link
Author

kostasb commented May 31, 2016

@sparrc Will it be possible for jitter to go down to ms precision or will we stick with seconds ?

sparrc added a commit that referenced this issue May 31, 2016
use a common function between collection_jitter and flush_jitter. which
creates the same behavior between the two options.

going forward, both jitters will be random sleeps that get re-evaluated
at runtime for every interval (previously only collection_jitter did
this)

also fixes behavior so that both jitters will exit in the event of a
process exit.

closes #1296
@sparrc
Copy link
Contributor

sparrc commented May 31, 2016

it will be nanoseconds

sparrc added a commit that referenced this issue May 31, 2016
use a common function between collection_jitter and flush_jitter. which
creates the same behavior between the two options.

going forward, both jitters will be random sleeps that get re-evaluated
at runtime for every interval (previously only collection_jitter did
this)

also fixes behavior so that both jitters will exit in the event of a
process exit.

closes #1296
sparrc added a commit that referenced this issue May 31, 2016
use a common function between collection_jitter and flush_jitter. which
creates the same behavior between the two options.

going forward, both jitters will be random sleeps that get re-evaluated
at runtime for every interval (previously only collection_jitter did
this)

also fixes behavior so that both jitters will exit in the event of a
process exit.

closes #1296
sparrc added a commit that referenced this issue May 31, 2016
use a common function between collection_jitter and flush_jitter. which
creates the same behavior between the two options.

going forward, both jitters will be random sleeps that get re-evaluated
at runtime for every interval (previously only collection_jitter did
this)

also fixes behavior so that both jitters will exit in the event of a
process exit.

closes #1296
sparrc added a commit that referenced this issue May 31, 2016
use a common function between collection_jitter and flush_jitter. which
creates the same behavior between the two options.

going forward, both jitters will be random sleeps that get re-evaluated
at runtime for every interval (previously only collection_jitter did
this)

also fixes behavior so that both jitters will exit in the event of a
process exit.

closes #1296
sparrc added a commit that referenced this issue May 31, 2016
use a common function between collection_jitter and flush_jitter. which
creates the same behavior between the two options.

going forward, both jitters will be random sleeps that get re-evaluated
at runtime for every interval (previously only collection_jitter did
this)

also fixes behavior so that both jitters will exit in the event of a
process exit.

closes #1296
@daviesalex
Copy link

Thanks, @sparrc. We will get this into our environment ASAP and let you know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants