-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[velux] hub discovery; representation properties; socket lock up issues #8777
Conversation
Travis tests were successfulHey @andrewfg, |
@gs4711 feedback appreciated |
I discovered (no pun intended) that the Velux hub discovery is very "flaky" .. sometimes it works as it should, and sometimes not .. so this commit implements the "correct" way of getting the ipv4 address, and only if that fails does it fall back to my prior "work around" solution; this works under (almost) all circumstances. |
Hello Andrew, thanks for pushing it further! Currently starting from a greenfield setup, I do not see any discovery happening:
Is there anything, I have to declare before? |
Guenther, you don't need to do anything special .. but it often takes some time to discover the hub(s). The KLF advertises itself runni9ng a service under You can check if the KLF is adverising itself by means of an mDNS browser app (sometimes known as Bonjour browser) -- see below a couple of screenshots from apps on Apple and Android devices. |
Sorry for the inconvenience, but I'm recognizing completely loss of connectivity ...
The latest successful steps are (with some lines of failure):
The bridge is still working but no longer responding on the API. |
I forgot: After restarting the binding, everything is fine, again. Therefore, there is still bug within the communication of the binding. |
Hi @gs4711 from your logs, it looks like you are testing an earlier build. Depending on which version of OH you are testing on, the latest jar files are here. Note that the v2.5.x version is called |
Guenther, I am not 100% sure what is going wrong, but as far as I can tell, the KLF is not fully compliant with the mDNS specifications: it seems that sometimes its mDNS response contains both a host name PS I think it may also depend on your network router etc. -- some routers and some clients on the network, implement an mDNS cache and mDNS forwarding. If that is the case then the forwarder may resolve the IP address from its cache, even if the KLF is currently not showing its IP address. Or something like that. Another idea, is that instead of putting the IP address in the discovered hub's Configuration Parameter, we could perhaps put the host name in there. That would postpone the IP address resolution problem from the time of discovery to the time of connection. But I am not sure if that helps very much. => What do you think? |
Andrew,
from my point of view the announcements of the KLF200 looks good:
I have checked it for a while and it seems to stay stable.
Regards,
Guenther
… Am 30.10.2020 um 23:43 schrieb Andrew Fiddian-Green ***@***.***>:
Digging into the advertisments, I can recognize the KLF, but it is not recognized by the binding:
Any idea?
Guenther, I am not 100% what is going wrong. but as far as I can tell, the KLF is not fully compliant with the mDNS specifications: it seems that sometimes its mDNS response contains both a host name VELUX_KLF_LAN_XXX and an IP address 192.168.x.xxx, but sometimes it seems to only return the hostname. In the case that it does not return an IP address, I am also asking regular DNS if it has a lookup for the IP address of the respective host name. However if both mDNS and regular DNS are unable to resolve the host name to an IP address then the discovery fails. => Can you please check if your mDNS browser resolves an IP address (your screen shot did not show anthing)?
|
Just the next try…
2020-10-31 21:17:25.636 [INFO ] [.dashboard.internal.DashboardService] - Started Dashboard at http://192.168.45.126:8080
2020-10-31 21:17:25.639 [INFO ] [.dashboard.internal.DashboardService] - Started Dashboard at https://192.168.45.126:8443
2020-10-31 21:17:26.109 [TRACE] [ery.VeluxHubMdnsDiscoveryParticipant] - createResult(): called..
2020-10-31 21:17:26.111 [TRACE] [ery.VeluxHubMdnsDiscoveryParticipant] - getIpAddress(): called for device 'VELUX_KLF_WIFI_F86D'
2020-10-31 21:17:26.113 [TRACE] [ery.VeluxHubMdnsDiscoveryParticipant] - getIpAddress(): gateway 'VELUX_KLF_WIFI_F86D' ignored (not ethernet)
2020-10-31 21:17:26.116 [TRACE] [ery.VeluxHubMdnsDiscoveryParticipant] - createResult(): called..
2020-10-31 21:17:26.117 [TRACE] [ery.VeluxHubMdnsDiscoveryParticipant] - getIpAddress(): called for device 'VELUX_KLF_LAN_54BD'
2020-10-31 21:17:26.196 [TRACE] [ery.VeluxHubMdnsDiscoveryParticipant] - getIpAddress(): gateway 'VELUX_KLF_LAN_54BD' ignored (no ip address)
2020-10-31 21:17:27.944 [INFO ] [thome.model.lsp.internal.ModelServer] - Started Language Server Protocol (LSP) service on port 5007
2020-10-31 21:17:31.839 [TRACE] [ery.VeluxHubMdnsDiscoveryParticipant] - createResult(): called..
2020-10-31 21:17:31.841 [TRACE] [ery.VeluxHubMdnsDiscoveryParticipant] - getIpAddress(): called for device 'VELUX_KLF_LAN_54BD'
2020-10-31 21:17:31.842 [TRACE] [ery.VeluxHubMdnsDiscoveryParticipant] - getIpAddress(): gateway 'VELUX_KLF_LAN_54BD' ignored (no ip address)
2020-10-31 21:17:31.843 [TRACE] [ery.VeluxHubMdnsDiscoveryParticipant] - createResult(): called..
2020-10-31 21:17:31.845 [TRACE] [ery.VeluxHubMdnsDiscoveryParticipant] - getIpAddress(): called for device 'VELUX_KLF_WIFI_F86D'
2020-10-31 21:17:31.846 [TRACE] [ery.VeluxHubMdnsDiscoveryParticipant] - getIpAddress(): gateway 'VELUX_KLF_WIFI_F86D' ignored (not ethernet)
2020-10-31 21:19:06.334 [TRACE] [rnal.discovery.VeluxDiscoveryService] - VeluxDiscoveryService(without Bridge) just initialized.
The initial discovery seems to work well … but the ip resolution fails within java code - whereas other tools provide the correct address as well.
|
Correct. The OH MDNS discovery service is based on the JMDNS Java library (partly written by @kaikreuzer). So my conclusion is that either a) the KLF is not compliant to the mDNS specification, or b) the the JMDNS library is not compliant to the specification. But -- just to complicate things further -- the two things do sometime work together and resolve the IP address. It is very strange. PS I was thinking about writing my own mDNS resolver, or alternatively modifying the the JMDNS library. But that would be a lot of work, and it goes way beyond this PR. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect extension. Minor issues with discovery (probably either device malfunctions or common library misimplementation).
I am going to try to fix the mDNS discovery issues. So I will set the WIP flag for a couple more days. |
=> Cant set the flags any more :( |
No you have a real build error.
|
That's weird. The error is on |
There have been recent changes to the null checker that has implicitly added the annotations to the standard library. In general you would be correct, that Your local build probably succeeds because you haven't merged the main branch recently. I don't suggest you do since it would be more of a headache than it is worth atm, but just make the null checks to please jenkins for now. |
Signed-off-by: Andrew Fiddian-Green <[email protected]>
Ok the build worked this time. |
Many thanks Connor :)
|
Hi Andrew I have just experienced that after the binding has been running without problems for one month. As I could not recennect, the option for reboot did not work. |
@EjvindHald thank you for the feedback. By contrast my binding has been running perfectly. :) |
PS are you using the v3 or v2.5 build? The v3.0.0 of the binding had some further TCP socket improvements beyond what I put in the last v2.5 build. |
Yes, your stuff is working very good. I am using 2.5. Do you know if this is accumulating problem with the hub, which could be solved by a weekly automatic reboot? |
I have seen no evidence of that. How does your KLF200 get its IP address? Is it static, or via DHCP? If the latter, I wonder if there might be an issue with DHCP renewals? Perhaps setting the IP address as ‘reserved’ in your DHCP server would help? Just an idea... |
It has a static IP address. |
Ok. I have no more ideas. Let us see how OH3 goes.. |
So the KLF200 died again after only after 2 days, so it is not an accumulating thing. Instead, it happens after a reboot of OH 2.5.10, which I am using. Same pattern as last time. When OH comes up again after reboot it cannot access the KLF200. Power recycle of KLF200 is the only solution, when this happens. I hope to be on OH 3 in a few days. I am currently testing the compability. |
@EjvindHald a plain reboot is a "hard" reboot, that does not let OH shutdown cleanly, so it is very likely to leave the sockets in a zombie state. You need to isssue a stop command to the openhab service first, and let that complete before you issue the reboot command. |
I am simply restarting the Docker container which is running OH. I would expect that to be the same as restarting the OH service as you write and not a reboot of an OS. |
I suggest the following OH2
OH3
PS I am interested to know what makes you expect that? |
Docker is a way of virtualize a service. So normally a service like openhab equals one container. Therefore, a restart of a container would as far as I can see be the same as your "systemctl stop openhab.service" which is used for stopping an ordinary service. When I stop the container, I get this in the openhab log:
java.lang.NullPointerException: null
2020-12-25 16:44:38.926 [INFO ] [io.openhabcloud.internal.CloudClient] - Shutting down openHAB Cloud service connection 2020-12-25 16:44:38.935 [ERROR] [me.core.internal.events.EventHandler] - Dispatching/filtering event for subscriber 'org.eclipse.smarthome.core.events.EventSubscriber' failed: null java.lang.NullPointerException: null
2020-12-25 16:44:38.938 [ERROR] [me.core.internal.events.EventHandler] - Dispatching/filtering event for subscriber 'org.eclipse.smarthome.core.events.EventSubscriber' failed: null java.lang.NullPointerException: null
2020-12-25 16:44:38.954 [INFO ] [io.openhabcloud.internal.CloudClient] - Disconnected from the openHAB Cloud service (UUID = 1f7d7507-b5e1-4552-9aec-174da4e81ec0, base URL = http://localhost:8079) 2020-12-25 16:44:39.027 [INFO ] [basic.internal.servlet.WebAppServlet] - Stopped Basic UI I do not know why I get the Java errors for EventHandler. But I do not see any message saying Velux is shutdown or similar. This is openhab 2.5.11 - I still do not know the behaviour in 3.0 EDIT: Here is the messages from stop of OH3 container: So still no Velux friendly sign off message.... EDIT2: There is a Velux graceful shutdown - it is working! |
@EjvindHald I don’t know anything about Docker, so I can’t really help you. But my gut feeling is that a thing like that has no way it could know how to shut down OH in a proper manner. And I have already given you my own suggestion above. But if you want more expert advice on the proper way with Docker, then I think you need to open a thread elsewhere where the Docker experts are likely to see it. |
@andrewfg : With your high expertise, you are probably right. I will try to figure out, if the openhab guys maybe can create the Docker openhab image differently, so it caters for this. The link is here and here. |
|
@andrewfg : Thanks for all your contribution and support. |
…es (openhab#8777) * [velux] set explicit timeouts & keepalives on socket * [velux] implement mdns service * [velux] fix representation property names * [velux] fix representation properties * [velux] finalize mdns * [velux] spotless * [velux] use both mDNS and regular DNS to resolve ip addresses * [velux] complete class rewrite using asynchronous polling thread * [velux] refactor bridgeDirectCommunicate to simplify looping * [velux] asynchronous polling means Thread.sleep no longer needed * [velux] faster synch of actuator changes * [velux] use single thread executor instead of thread pool * [velux] faster synch of actuator changes * [velux] shut down task executor Signed-off-by: Andrew Fiddian-Green <[email protected]>
…es (openhab#8777) * [velux] set explicit timeouts & keepalives on socket * [velux] implement mdns service * [velux] fix representation property names * [velux] fix representation properties * [velux] finalize mdns * [velux] spotless * [velux] use both mDNS and regular DNS to resolve ip addresses * [velux] complete class rewrite using asynchronous polling thread * [velux] refactor bridgeDirectCommunicate to simplify looping * [velux] asynchronous polling means Thread.sleep no longer needed * [velux] faster synch of actuator changes * [velux] use single thread executor instead of thread pool * [velux] faster synch of actuator changes * [velux] shut down task executor Signed-off-by: Andrew Fiddian-Green <[email protected]>
This PR implements a number of improvements to the Velux binding.
1. The binding could not auto- discover Velux hubs
A new
VeluxBridgeFinder
class has been added to discover Velux KLF200 hubs. Note: the Velux mDNS hub discovery is "flaky" under JmDNS, and there are issues with MDNSDiscoveryService, so this enhancement implements its own custom mDNS service resolver.The
ReadMe.md
file has been updated to say that auto- discovery is supported on hubs.2. The binding had missing or wrong representation property names
The wrong representation property names meant that re-discovered Things were not properly being matched up with already discovered Things in the Inbox.
These errors and omissions have been corrected.
3. The binding suffered from lock-ups in its TCP communication to the hub
If the binding went offline via a catastrophic failure, the Velux hub might be left with a large amount of un-sent (or un-acknowledged) data in its write buffer, and its side of the socket might get into a "zombie" state. This would prevent the binding from re-connecting again when the binding was restarted. Note: This seems to be a bug in the hub, which I have reported to Velux, but so far without response from them..
The following has been implemented:
keepalive
Closeable
interface in a number of places, so shutdown code gets called if the class instance gets destroyed.Note: These changes ameliorate the problem, but possibly do not completely eliminate it, since the underlying cause lies within the hub itself.
4. The binding would block on read calls
The binding used blocking socket read calls to fetch data from the hub. This meant that the whole binding could sit in a blocked state for tens of seconds or until incoming data arrived.
The following has been implemented:
5. The binding wrongly used the
available()
method on its SSL socketA little-known "feature" of SSL sockets is that
available()
always returns0
on them. So the binding's "receive-only" calls were never actually processing incoming messages, and the data would therefore backlog in the read buffer (see item 3. above) until the next "send" call could be made.The asynchronous message queue (see item 4.) also solves this problem, since
available()
can now check for entries in the queue rather than (not) checking for data in the socket buffer. This could shorten response times by 10-50 seconds.6. The
bridgeDirectCommunicate()
method was not transparentWhile writing, testing and debugging the above changes, I felt the
bridgeDirectCommunicate()
method to be written in a hard to understand form, having two nestedwhile()
loops, labelledbreak
statements, and multiple levels ofcontinue
andbreak
statements. Furthermore the logger messages were hard for an outsider (like me) to understand.The
bridgeDirectCommunicate()
method has been refactored to flatten the loop structure, simplify the functioning ofbreak
andcontinue
statements, and clarify the logger messages.7. The task scheduler was not executing in the right order
The task scheduler was based on a multi- thread pool, but since the tasks were anyway constrained though the single IO socket pipeline, it meant that tasks would sometimes execute in the wrong order.
The task scheduler has been changed from a multi- thread pool to a single thread executor.
8. The OH UI was not being updated fast enough
The UI would display the old position of actuators until the actuator had finally stopped moving in a new position.
The UI is now updated immediately to display the target position as soon as the actuator starts to move in that direction.
9. User request: the remote reboot feature should be implemented
See this issue
A ThingAction for remote reboot has been implemented.
10. User request: the binding should indicate if windows are manually opened
Users on the community forum requested that when windows are manually opened, the binding should inform them. Furthermore, see this issue
The window position is now displayed as
UNDEF
when manually opened. And in the case that position is invalid, yet target position is valid, the window position is taken from the target position.11. If the user changed Configuration Parameters via PaperUI they were not saved
Changes to Configuration Parameters are now saved.
12. The GW_SET_NODE_VELOCITY_REQ/CFM command set has been removed from the API
The GW_SET_NODE_VELOCITY_REQ/CFM command set has been removed from the KLF200 API specification.
The code in the binding that would have called this command set, is not called. And the respective classes have been marked as
deprecated
. Nevertheless these classes have NOT been deleted from the binding 'just in case' the function might be added back again in a future API release.13. User request: Somfy actuators wrongly display undefined position
See this issue
Wherever possible if the current position is undefined, the actuator will display the target position instead. Furthermore the issue and a possible user work around are explained in the read me file.