-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lighty.io could not handle many incoming netconf callhome connections #1655
Comments
Hi, the configuration for every call-home device is hard-coded here: https://github.com/opendaylight/netconf/blob/a2d875954fbf4a82dc0f3b91fa360dfa96ee213a/apps/callhome-provider/src/main/java/org/opendaylight/netconf/callhome/mount/CallHomeMountSessionContext.java#L78. You can try to create "real" devices with that configuration to see if they are connecting and disconnecting consecutively in a high latency network. If yes, you can find out the values which makes connection stable. Then, to solve this bug an issue report into ODL netconf project is needed. Best |
Actually the server configuration is using netty bootstrap defaults, so these can be updated here the code referenced by Ivan is used to create the topology node entry to track the device status, |
@sekobey can you please provide logs to investigate why devices have been disconnected? |
Additionally are you able to test with lighty.io version 15.2? To be sure you have fix for https://jira.opendaylight.org/browse/NETCONF-832. In version 15.1 call-home devices are not sending keep-alive RPCs to keep connection up. |
Hi Experts, There are a few questions about the code pasted above.
At t0 time, Would connecting log mean that the underlying TCP connection (from device to controller) + SSH connection + the Netconf connection are OK. But there is an issue thereafter (probably when the yang models/schema from device are being downloaded) which causes the device to move from connecting to disconnected state?
@ihrasko as requested, the logs should be updated as well, today in a few hours for more in-depth analysis into this. |
Hi @ihrasko , UnableToConnect: java.io.IOException: Broken pipe 2023-11-28 13:29:50.770 INFO 1 o.n.c.m.CallhomeStatusReporter : Setting failed status for callhome device id:Uri{_value=2150001479}. Disconnect: 2023-11-28 13:31:07.658 INFO 1 .m.n.s.CallHomeListenerService : Netconf device 3600010593 connecting... 2023-11-28 13:33:22.976 INFO 1 .n.c.m.CallHomeMountDispatcher : Removing Uri{_value=3600010593} from Netconf Topology. 2023-11-28 13:33:23.005 INFO 1 .n.l.NetconfDeviceCommunicator : RemoteDevice{3600010593}: Session went down: End of input |
For point 1: there is no way you can guarantee your devices are connecting/connected/unconnected at the same time (their status in operational data store is updated on the same time). For example if you check For point 2: connection termination after 10 minutes has been resolved in https://jira.opendaylight.org/browse/NETCONF-832 (lighty.io 15.2.0). For point 3: the schema resolution is run for each device in separate thread (its like that by implementation):
|
Thanks @ihrasko We are executing tests now with 15.2.0 version and will see if that helps. Appreciate the response on my queries. |
@fatihozkan51200 From logs I assume that the device has been never successfully connected (even for "reconnecting" scenario). It suddenly gets removed, that can happen when:
What is the state of 'http://[ODL_IP_ADDRESS]:8181/rests/data/odl-netconf-callhome-server:netconf-callhome-server'. Are you able to see 4500 device there? What are the logs from one of connecting devices? |
Yes, IMO the solution here can really be make call-home devices defaults configurable. |
Hi,
Describe the bug
One of our microservices (running in docker container) is using lighty.io v15.1. This service is responsible for handling incoming device connections. Devices perform netconf callhome connection after SSH connection between the device and lighty.io is performed. When we run 4500 devices and the microservice in the same network in AWS (network latency is around 0,386 ms), all devices connect to the service successfully in 10 minutes. But when we moved microservice to another AWS region where network latency is around 180ms, all of these devices could not connect in a reasonable time. It takes hours to connect all of them. But when we first run 1500 devices, they connect in 10 minutes. After that we run another 1500 devices, again they connect in 10 minutes etc. However, when we run 4500 devices at the same time, they cannot connect in 1 or 2 hours.
All the device SSH public keys are loaded into the Allowed Devices store of lighty.io at the startup of the microservice. So, we see that SSH connections are performed well. We overrided onDataTreeChanged method of DataTreeChangeListener interface to get changes in topology tree when doing netconf callhome connection between device and microservice as follows:
When we run 4500 devices at the same time, we see "Netconf device connecting..." and "Netconf device disconnected" logs consecutively very much in the microservice logs for the same device in high latency network (~180 ms). But we do not see these logs in low latency network (~0,386 ms).
What can be the reason? Is there a configuration parameter at lighty.io to handle this situation? Like increasing some timeout value, maybe this is happening because of network latency?
Thank you.
Branch
lighty.io branch 15.1
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
Thank you for creating your report! We will get back to you, as soon as possible. Please note, that our support on GitHub is based on a non-guaranteed best effort. As soon as we are able, we will get back to you.
If you are in a hurry and have an inquiry regarding commercial support, please via this contact form: https://pantheon.tech/contact-us
The text was updated successfully, but these errors were encountered: