Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input.vsphere: unable to connect after vCenter reboot #6054

Closed
davidaugustin1 opened this issue Jun 27, 2019 · 4 comments · Fixed by #6085
Closed

input.vsphere: unable to connect after vCenter reboot #6054

davidaugustin1 opened this issue Jun 27, 2019 · 4 comments · Fixed by #6085
Labels
area/vsphere bug unexpected problem or unintended behavior
Milestone

Comments

@davidaugustin1
Copy link

Relevant telegraf.conf:

# # Read metrics from VMware vCenter
 [[inputs.vsphere]]
#   ## List of vCenter URLs to be monitored. These three lines must be uncommented
#   ## and edited for the plugin to work.
   vcenters = [ "https://vc01.my.company/sdk" , "https://vc07.my.company/sdk" ]
   username = "user"
   password = "password"
   interval = "30s" 
#
#   ## VMs
#   ## Typical VM metrics (if omitted or empty, all metrics are collected)
   vm_metric_include = [
     "cpu.demand.average",
     "cpu.idle.summation",
     "cpu.latency.average",
     "cpu.readiness.average",
     "cpu.ready.summation",
     "cpu.run.summation",
     "cpu.usagemhz.average",
     "cpu.used.summation",
     "cpu.wait.summation",
     "mem.active.average",
     "mem.granted.average",
     "mem.latency.average",
     "mem.swapin.average",
     "mem.swapinRate.average",
     "mem.swapout.average",
     "mem.swapoutRate.average",
     "mem.usage.average",
     "mem.vmmemctl.average",
     "net.bytesRx.average",
     "net.bytesTx.average",
     "net.droppedRx.summation",
     "net.droppedTx.summation",
     "net.usage.average",
     "power.power.average",
     "virtualDisk.numberReadAveraged.average",
     "virtualDisk.numberWriteAveraged.average",
     "virtualDisk.read.average",
     "virtualDisk.readOIO.latest",
     "virtualDisk.throughput.usage.average",
     "virtualDisk.totalReadLatency.average",
     "virtualDisk.totalWriteLatency.average",
     "virtualDisk.write.average",
     "virtualDisk.writeOIO.latest",
     "sys.uptime.latest",
     "guestfilesystem.freespace.total",
   ]
#   # vm_metric_exclude = [] ## Nothing is excluded by default
#   # vm_instances = true ## true by default
#
#   ## Hosts
#   ## Typical host metrics (if omitted or empty, all metrics are collected)
   host_metric_include = [
     "cpu.coreUtilization.average",
     "cpu.costop.summation",
     "cpu.demand.average",
     "cpu.idle.summation",
     "cpu.latency.average",
     "cpu.readiness.average",
     "cpu.ready.summation",
     "cpu.swapwait.summation",
     "cpu.usage.average",
     "cpu.usagemhz.average",
     "cpu.used.summation",
     "cpu.utilization.average",
     "cpu.wait.summation",
     "disk.deviceReadLatency.average",
     "disk.deviceWriteLatency.average",
     "disk.kernelReadLatency.average",
     "disk.kernelWriteLatency.average",
     "disk.numberReadAveraged.average",
     "disk.numberWriteAveraged.average",
     "disk.read.average",
     "disk.totalReadLatency.average",
     "disk.totalWriteLatency.average",
     "disk.write.average",
     "mem.active.average",
     "mem.latency.average",
     "mem.state.latest",
     "mem.swapin.average",
     "mem.swapinRate.average",
     "mem.swapout.average",
     "mem.swapoutRate.average",
     "mem.totalCapacity.average",
     "mem.usage.average",
     "mem.vmmemctl.average",
     "net.bytesRx.average",
     "net.bytesTx.average",
     "net.droppedRx.summation",
     "net.droppedTx.summation",
     "net.errorsRx.summation",
     "net.errorsTx.summation",
     "net.usage.average",
     "power.power.average",
     "storageAdapter.numberReadAveraged.average",
     "storageAdapter.numberWriteAveraged.average",
     "storageAdapter.read.average",
     "storageAdapter.write.average",
     "sys.uptime.latest",
   ]
    cluster_metric_include = [] ## if omitted or empty, all metrics are collected
    datastore_metric_include = [] ## if omitted or empty, all metrics are collected
   datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
    object_discovery_interval = "300s"
    timeout = "20s"

System info:

OS: openSUSE Tumbleweed
Telegraf version: telegraf-1.10.2_linux_amd64

Steps to reproduce:

  1. Run Telegraf
  2. Reboot vCenter

Expected behavior:

vSphere Plugin reconnects to vCenter and continues metric collection

Actual behavior:

vSphere Plugin tries to reauthenticate with no success until I restart Telegraf
After restarting Telegraf, everything is working normal

Additional info:

This one gets repeatet until Telegraf is restarted

E! [input.vsphere]: Error in discovery for vc07.my.company: ServerFaultCode: The object 'vim.view.ContainerView:session[52ede638-4de8-d8f8-7797-aae997059bb7]524e4d5e-9d65-b101-c451-c7b3211b9a6c' has already been deleted or has not been completely created

Log:
(stripped down to vc07 as vc01 is still working)


2019-06-25T12:37:10Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!

2019-06-25T12:37:10Z D! [input.vsphere] Discovering resources for datastore

2019-06-25T12:37:20Z E! Error in plugin [inputs.vsphere]: took longer to collect than collection interval (30s)
2019-06-25T12:37:30Z E! Error in plugin [inputs.vsphere]: Post https://vc07.my.company/sdk: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

2019-06-25T12:37:30Z D! [input.vsphere] Discovering resources for datacenter
2019-06-25T12:37:30Z D! [input.vsphere] Discovering resources for cluster

2019-06-25T12:37:50Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!
2019-06-25T12:38:00Z E! Error in plugin [inputs.vsphere]: took longer to collect than collection interval (30s)
2019-06-25T12:38:10Z E! Error in plugin [inputs.vsphere]: Post https://vc07.my.company/sdk: context deadline exceeded

2019-06-25T12:38:20Z D! [input.vsphere]: Discover new objects for vc07.my.company
2019-06-25T12:38:20Z D! [input.vsphere] Discovering resources for cluster

2019-06-25T12:38:21Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!
2019-06-25T12:38:28Z E! Error in plugin [inputs.vsphere]: ServerFaultCode: The session is not authenticated.
2019-06-25T12:38:28Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!

2019-06-25T12:38:30Z D! [input.vsphere]: Collecting metrics for 43 objects of type host for vc07.my.company
2019-06-25T12:38:30Z D! [input.vsphere] Discovering resources for host
2019-06-25T12:38:30Z D! [input.vsphere] Discovering resources for vm
2019-06-25T12:38:30Z D! [input.vsphere] Discovering resources for datastore
2019-06-25T12:38:30Z D! [input.vsphere] Discovering resources for datacenter
2019-06-25T12:38:33Z D! [input.vsphere]: Collecting metrics for 224 objects of type vm for vc07.my.company
2019-06-25T12:38:37Z D! [input.vsphere]: Collecting metrics for 3 objects of type cluster for vc07.my.company
2019-06-25T12:38:50Z D! [input.vsphere]: Collecting metrics for 0 objects of type host for vc07.my.company
2019-06-25T12:38:50Z D! [input.vsphere]: Collecting metrics for 0 objects of type vm for vc07.my.company
2019-06-25T12:39:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type vm for vc07.my.company
2019-06-25T12:39:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type datastore for vc07.my.company
2019-06-25T12:39:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type datacenter for vc07.my.company
2019-06-25T12:39:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type host for vc07.my.company
2019-06-25T12:41:51Z D! [input.vsphere]: Discover new objects for vc07.my.company
2019-06-25T12:41:51Z D! [input.vsphere] Discovering resources for cluster
2019-06-25T12:41:51Z D! [input.vsphere] Discovering resources for host
2019-06-25T12:41:51Z D! [input.vsphere] Discovering resources for vm
2019-06-25T12:41:51Z D! [input.vsphere] Discovering resources for datastore
2019-06-25T12:41:51Z D! [input.vsphere] Discovering resources for datacenter
2019-06-25T12:41:54Z D! [input.vsphere] Discovering resources for vm
2019-06-25T12:42:02Z D! [input.vsphere] Discovering resources for datastore
2019-06-25T12:42:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type host for vc07.my.company
2019-06-25T12:42:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type vm for vc07.my.company
2019-06-25T12:42:22Z D! [input.vsphere] Discovering resources for datacenter
2019-06-25T12:42:23Z D! [input.vsphere] Discovering resources for cluster
2019-06-25T12:42:23Z D! [input.vsphere] Discovering resources for host
2019-06-25T12:42:50Z D! [input.vsphere]: Collecting metrics for 0 objects of type host for vc07.my.company
2019-06-25T12:42:50Z D! [input.vsphere]: Collecting metrics for 0 objects of type vm for vc07.my.company

2019-06-25T12:45:50Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!
2019-06-25T12:45:50Z E! Error in plugin [inputs.vsphere]: expected element type <Envelope> but have <html>
2019-06-25T12:46:20Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!
2019-06-25T12:46:20Z E! Error in plugin [inputs.vsphere]: expected element type <Envelope> but have <html>
2019-06-25T12:46:50Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!
2019-06-25T12:46:50Z E! Error in plugin [inputs.vsphere]: expected element type <Envelope> but have <html>
2019-06-25T12:46:51Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!
2019-06-25T12:46:51Z E! [input.vsphere]: Error in discovery for vc07.my.company: expected element type <Envelope> but have <html>

2019-06-25T12:46:54Z D! [input.vsphere] Discovering resources for vm
2019-06-25T12:47:02Z D! [input.vsphere] Discovering resources for datastore

2019-06-25T12:47:20Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!
2019-06-25T12:47:20Z E! Error in plugin [inputs.vsphere]: expected element type <Envelope> but have <html>

2019-06-25T12:47:23Z D! [input.vsphere] Discovering resources for datacenter
2019-06-25T12:47:23Z D! [input.vsphere] Discovering resources for cluster
2019-06-25T12:47:23Z D! [input.vsphere] Discovering resources for host

2019-06-25T12:48:05Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!
2019-06-25T12:48:05Z E! Error in plugin [inputs.vsphere]: Post https://vc07.my.company/sdk: dial tcp ServerIP:443: connect: connection refused
2019-06-25T12:48:20Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!
2019-06-25T12:48:20Z E! Error in plugin [inputs.vsphere]: Post https://vc07.my.company/sdk: dial tcp ServerIP:443: connect: connection refused
2019-06-25T12:48:50Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!
2019-06-25T12:48:50Z E! Error in plugin [inputs.vsphere]: 503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x000055ebf62f4c00] _serverNamespace = /sdk action = Allow _port = 8085)
2019-06-25T12:49:20Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!
2019-06-25T12:49:20Z E! Error in plugin [inputs.vsphere]: 503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x000055ebf62f4c00] _serverNamespace = /sdk action = Allow _port = 8085)
2019-06-25T12:49:50Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!
2019-06-25T12:49:50Z E! Error in plugin [inputs.vsphere]: 503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x000055ebf62f4c00] _serverNamespace = /sdk action = Allow _port = 8085)
2019-06-25T12:50:20Z I! [input.vsphere]: Client session seems to have time out. Reauthenticating!

2019-06-25T12:50:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type datastore for vc07.my.company
2019-06-25T12:50:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type datacenter for vc07.my.company
2019-06-25T12:50:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type cluster for vc07.my.company
2019-06-25T12:50:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type host for vc07.my.company
2019-06-25T12:50:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type vm for vc07.my.company
2019-06-25T12:51:51Z D! [input.vsphere]: Discover new objects for vc07.my.company
2019-06-25T12:51:51Z D! [input.vsphere] Discovering resources for host

2019-06-25T12:51:51Z E! [input.vsphere]: Error in discovery for vc07.my.company: ServerFaultCode: The object 'vim.view.ContainerView:session[52ede638-4de8-d8f8-7797-aae997059bb7]524e4d5e-9d65-b101-c451-c7b3211b9a6c' has already been deleted or has not been completely created

2019-06-25T12:51:54Z D! [input.vsphere] Discovering resources for cluster
2019-06-25T12:51:54Z D! [input.vsphere] Discovering resources for host
2019-06-25T12:52:02Z D! [input.vsphere] Discovering resources for vm
2019-06-25T12:52:10Z D! [input.vsphere] Discovering resources for datastore
2019-06-25T12:52:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type host for vc07.my.company
2019-06-25T12:52:20Z D! [input.vsphere]: Collecting metrics for 0 objects of type vm for vc07.my.company
2019-06-25T12:52:30Z D! [input.vsphere] Discovering resources for datacenter
2019-06-25T12:52:50Z D! [input.vsphere]: Collecting metrics for 0 objects of type host for vc07.my.company

@davidaugustin1
Copy link
Author

Some additional info:
Here are the log entrys from vCenter by that time.
At 3:31:10 PM I restarted Telegraf manually

"User cannot logon since the user is already logged on","Information","Tuesday, June 25, 2019 2:38:18 PM",,"","VSPHERE.LOCAL\telegraf","vim.event.AlreadyAuthenticatedSessionEvent"
"User cannot logon since the user is already logged on","Information","Tuesday, June 25, 2019 2:38:20 PM",,"","VSPHERE.LOCAL\telegraf","vim.event.AlreadyAuthenticatedSessionEvent"
"User cannot logon since the user is already logged on","Information","Tuesday, June 25, 2019 2:38:28 PM",,"","VSPHERE.LOCAL\telegraf","vim.event.AlreadyAuthenticatedSessionEvent"
"User VSPHERE.LOCAL\telegraf@telegrafIP logged in as Go-http-client/1.1","Information","Tuesday, June 25, 2019 2:38:30 PM",,"","VSPHERE.LOCAL\telegraf","vim.event.UserLoginSessionEvent"
"User VSPHERE.LOCAL\telegraf@telegrafIP logged in as Go-http-client/1.1","Information","Tuesday, June 25, 2019 2:50:20 PM",,"","VSPHERE.LOCAL\telegraf","vim.event.UserLoginSessionEvent"
"User VSPHERE.LOCAL\telegraf@telegrafIP logged in as Go-http-client/1.1","Information","Tuesday, June 25, 2019 3:31:10 PM",,"","VSPHERE.LOCAL\telegraf","vim.event.UserLoginSessionEvent"

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Jul 1, 2019
@danielnelson
Copy link
Contributor

danielnelson commented Jul 1, 2019

@prydin What do you think? Possibly an issue with govmomi?

@prydin
Copy link
Contributor

prydin commented Jul 1, 2019

This looks like an issue with the code that attempts to reauthenticate against vCenter if it loses connectivity.

This line

2019-06-25T12:51:51Z E! [input.vsphere]: Error in discovery for vc07.my.company: ServerFaultCode: The object 'vim.view.ContainerView:session[52ede638-4de8-d8f8-7797-aae997059bb7]524e4d5e-9d65-b101-c451-c7b3211b9a6c' has already been deleted or has not been completely created

makes me think that it holds onto some reference to an ephemeral object in the old instance of vCenter. I'd have to check the code that handles the reconnection. I'll get back to you once I've done that.

@prydin
Copy link
Contributor

prydin commented Jul 1, 2019

Yeah... It's a bug in the plugin code. Turns out the connection needs to be torn down one step further to guarantee that no references are reused. We'll try to get this fixed over the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vsphere bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants