-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[inputs.vsphere] Telegraf exit on init if conn to vsphere fails #10523
Comments
Hi,
Here is the error message from your logs, showing the failing connection error immediate on the start of Telegraf:
To be clear this is expected behavior. If a connection is not made, then we exit. If we were to ignore the connection issue, you could possibly never get metrics or think that since Telegraf continued to run, everything is in a good state and therefore no action is required on your part. #10078 is about possibly adding additional retry mechanisms for connections. Based on your expected behavior I am not sure there is anything further to add here. Could you explain more of what you were after? Thanks |
I'll try to explain
then copy vcenter_vm.conf to vcenter_vm1.conf and change url to localhost
solution: telegraf metrics for every state of input plugin curl http://telegraf-host:xxx/metrics
|
I agree with @v98765 that this is very unexpected behavior. I would have expected the vsphere plugin to exit itself, not the entire telegraf process. This has precedent and is handled as I would expect it with other plugins by writing to the log both in test and in production which is what the following plugins do. docker
exec
http
prometheus
redfish
In other words, if my vsphere happens to not be accessible, the very last thing I want to have happen is to immediately lose all of my metrics and page the entire team for what looks like a massive outage. Is this something where a PR would be accepted or usually handled by the internal team? |
I'm a bit confused over what the desired behavior is supposed to be. My code is just passing the error back to the caller, which seems to cause the exit if the error is non-nil. What is the desired behavior here? Also, I noticed that it crashed with a nil-pointer error in test mode. I pulled the latest code from "master" and the error disappeared. @powersj Can you give me some guidance on how to address this? If the vsphere plugin handles this the way it's supposed to, maybe we should close this? |
If you are connecting to an external service, then plugins should error on init if you cannot connect. I know this is less than ideal for some, but as I have explained to others before, we have seen time and time again where telegraf "just starts up" and users think everything is golden, when in fact metrics are not getting collected. This results in angry users.
We have agreed that this can be less than ideal in situations where services or networks are flaky or timing is off. As such we have allowed for plugin-by-plugin options that allow for starting the plugin even if issues are found. The MongoDB and I think a couple of other plugins now have options to ignore connection errors. We would accept a PR that adds an option to ignore connection failures on start to resolve this issue. It must be an option that is off by default. |
Adding a bit of clarification to why I added to this thread, hopefully to help others who may be in a similar situation. What we were running into was one of our vcenters falling offline (reasons are unimportant) which then caused a running Telegraf instance to fail, leading to all of our metrics going down. To us, this was much, much more drastic as it was all data lost, not just the one input. In fact, we first assumed a core routing problem on the network. It never crossed our minds that Telegraf would exit if one of its inputs wasn't online if it had been working properly. Our solution was to fork Telegraf, have it not exit on errors like this, and have inputs give booleans on if they were alive or not (basically mimicking the Ping input). Simple monitoring on that boolean now immediately points us to the misbehaving input without affecting any other metric gathering. A bit more work to maintain from upstream but this usage fits our needs much better. |
If telegraf was already started and collecting metrics and then crash/panic or exited after telegraf was already collecting metrics for some time, then it is a bug. Instead of a fork would you be willing to put up a PR to see what changes you had to make? This is different than the original reporter, so ideally a new issue may be better to capture this. |
I'll try my best to get a PR (and open a new issue) but the codebase is fairly deeply integrated with our internal systems so I'll have to create a sample version. The quick one to demonstrate would be the mqtt_consumer where, on It's been quite some time and many, many commits since we've worked on this, apologies I can't give more info quickly. It is also extremely possible that I'm crossing some wires here so if something doesn't add up, feel free to assume it is a mistake on the part of my memory. |
@mlschuh So you're saying that the plugin crashes if it can't connect to vCenter AFTER the initialization? I tried this by simply dropping the network connection to vCenter after a few minutes and it worked as expected. It was throwing this error (obviously), but kept running and was able to resume as soon as I restored the network connection. I tested this with 1.25.0.
Could you provide more information on the conditions when this happens along with a log (preferably)? I agree with @powersj that this warrants a separate issue as this would be a very serious bug! |
@prydin My team has refreshed my memory on this. I can summarize by saying it was other Inputs (like MQTT) that would stop Telegraf, which would then not restart because of vcenter not being accessible. A cascading set of circumstances that do not really warrant any further attention. I will experiment with the mqtt consumer and open an issue if I can reproduce the exit on the current release. Thank you for engaging with me on this, sorry for the confusion. |
telegraf config with 2 inputs plugins
config test
start telegraf
process exited |
Relevent telegraf.conf
Logs from Telegraf
System info
centos 8
Docker
No response
Steps to reproduce
Expected behavior
At least a log message to STDERR, on init, about a failing conn to a vcenter server (with no retry mechanism)
Actual behavior
telegraf.service: Main process exited
Additional info
same issue, but mongodb plugin
#10078
The text was updated successfully, but these errors were encountered: