-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telegraf stops publishing metrics via wavefront output, all plugins take too long to collect - no meaningful error message #3710
Comments
This sounds like the wavefront plugin is waiting on a network response without the proper timeout. @puckpuck Can you take a look at this one? It is important that outputs return from the |
@pberlowski 2 things
@danielnelson Wavefront uses net.Dial("tcp", uri) with default timeouts. Other output plugins do very similar, so I'm not sure if this is the issue or not here, however it would make sense to force a 3s (or 5s) timeout. Also looking at the stacktrace, seemed like the HTTP listener was more of the culprit here. It had about 250 go routines tied to it. Happy to investigate here further as needed. |
@puckpuck While the outputs are sending, telegraf stops reading messages from the |
Added a 5s deadline to writes on output plugin, so if Write ever takes more than 5s it will close and resume. This may cause metrics to get re-written, on a subsequent output plugin Write, since there is no way that I know of to tell telegraf which (if any) of the metrics within the batch were successfully written. @pberlowski please still follow through with the steps I asked previously so we can track this internally as well. |
Hey guys, many thanks for a blazing fast response! @puckpuck We're investigating the network configuration separately. The interaction between our telegraf and the Wavefront proxy was indeed ticketed in Wavefront and we'll follow up as soon as we have full picture. Also, we'll indeed look into upgrading to 1.5.1 ASAP. We are considering that a likely root cause of this issue. At the same time, I feel uneasy about coming to such deadlock in case of a possible network fissure or similar condition. As stated in the ticket, I feel the expected outcome here is a warning message in the telegraf log and standard point buffering behavior to minimize the impact. Is this a correct assumption? @danielnelson Is there a way to output some descriptive log message to communicate that the stoppage might be caused by the full accumulator channel? Or maybe just a "channel full" warning of some sort? This would possibly help debug the issue and ease ticketing/communication around such condition? |
@pberlowski You should have also seen an error message like:
It isn't always a problem if the channel fills so we wouldn't want to log, however we do have plans to address this issue by adding a system to ack messages on output. We will also likely change the Output interface in the future to pass in a |
@danielnelson The "skipping flush" warning message had not come up in the logs - we were going straight into "too long to collect" loop. I'll be watching for any changes in this area. Thanks for good communications. |
Bug report
After a certain amount of time (usually 5-10 minutes) under a load-test, the telegraf agent stops sending metrics onward via wavefront output. In addition to that, the log fills up with "took to long to collect" messages from all plugins.
No relevant error message is available in the logs.
This issue seems similar to #3629 however no aggregation plugin is in use.
There is a suspicion that the output may be too slow, even though the buffer doesn't seem to overfill and drop metrics.
Load test description
Measurements are submitted via http_listener from a load-generator host that replays a snapshot of data at selectable throughput. At 5000pps (fields), the error happens within minutes of starting the agent.
Relevant telegraf.conf:
http-proxy.conf.txt
telegraf.conf.txt
System info:
Telegraf version: 1.4.5
OS: Centos 7.3
Expected behavior:
Telegraf logs a meaningful error to describe a reason for overload and keeps buffering metrics.
Actual behavior:
Telegraf seizes operations silently and then starts logging failures to collect on time.
Additional info:
Stacktrace:
telegraf.stacktrace.txt
Hardware and software metrics screenshots for the failure window:
The text was updated successfully, but these errors were encountered: