What to do if a device does not reply? #1415

AlCalzone · 2021-01-16T01:05:59Z

AlCalzone
Jan 16, 2021
Maintainer

Again a case of specs vs reality. The specs mandate for every GET-type command that the device MUST reply with the corresponding Report. However, we're often in a situation where we know the device has received the command, but simply does not reply. This could have several reasons, e.g. something is not supported, the response gets lost along the way back, ...

Experience has shown that many of even the most basic queries (e.g. switch value, or sensor value for a supported CC) do timeout.

@hanskroner:

Do we need to expect a timeout for every get request?
What if queries depend on each other, e.g. during an interview
Are there any other smart things we can do in this case?
What if the device's behavior is contradicting itself? [bug] Aeotec Smart Switch 6 - Fails to complete Interview #1369 is a good case of this:

16:18:31.265 CNTRLR « [Node 006] node info received
                      supported CCs:
                      · Z-Wave Plus Info
                      · Binary Switch
                      · Multilevel Switch
                      · Color Switch
                      · Configuration
                      · All Switch
                      · Meter
                      · Clock
                      · Association
                      · Association Group Information
                      · Manufacturer Specific
                      · Version
                      · Firmware Update Meta Data
                      · Powerlevel
                      controlled CCs:
                      · Device Reset Locally
                      · Hail

It advertises support for the Z-Wave+ CC (even the manual says so), but does not reply when we query it:

16:18:49.544 CNTRLR » [Node 006] querying Z-Wave+ information...
16:18:49.551 SERIAL » 0x0109001306025e01259c07                                            (11 bytes)
16:18:49.551 DRIVER » [Node 006] [REQ] [SendData]
                      │ transmit options: 0x25
                      │ callback id:      156
                      └─[ZWavePlusCCGet]
16:18:49.555 SERIAL « [ACK]                                                                   (0x06)
16:18:49.561 SERIAL « 0x0104011301e8                                                       (6 bytes)
16:18:49.562 SERIAL » [ACK]                                                                   (0x06)
16:18:49.564 DRIVER « [RES] [SendData]
                        was sent: true
16:18:49.721 SERIAL « 0x010700139c00001067                                                 (9 bytes)
16:18:49.721 SERIAL » [ACK]                                                                   (0x06)
16:18:49.722 DRIVER « [REQ] [SendData]
                        callback id:     156
                        transmit status: OK

^ message acknowledged, but then... crickets...

OZW just ignores all of these failures - zwave-js aborts the interview (mainly because I trusted the specs back when I wrote this behavior).

Answered by hanskroner

Jan 16, 2021

Do we need to expect a timeout for every get request?

Yes. You're dealing with RF communication and a stateless protocol - there are no guarantees. Keep in mind this is as true for SET requests as it is for GET requests: if a device tries to switch a power switch on via a SET command, but the switch device detects over-current or over-temp., it's free to refuse the request and not change state. Unless this request was Supervision Command Class encapsulated, the sender won't know the reason for refusing the state change. Some devices might chose to send a REPORT out the Lifeline with the state not having changed and/or issue a notification/alarm via Notification Command Class letting t…

View full answer

hanskroner · 2021-01-16T12:46:19Z

hanskroner
Jan 16, 2021

Do we need to expect a timeout for every get request?

Yes. You're dealing with RF communication and a stateless protocol - there are no guarantees. Keep in mind this is as true for SET requests as it is for GET requests: if a device tries to switch a power switch on via a SET command, but the switch device detects over-current or over-temp., it's free to refuse the request and not change state. Unless this request was Supervision Command Class encapsulated, the sender won't know the reason for refusing the state change. Some devices might chose to send a REPORT out the Lifeline with the state not having changed and/or issue a notification/alarm via Notification Command Class letting the controller know of the issue, but it's impossible for the library (who doesn't have any context) to know that was a result of the outgoing SET command.

What if queries depend on each other, e.g. during an interview

Depending on the Command Class in question, this might mean that the entire interview must be halted. For example, if something like this were to happen for the S2 Command Class during the "secure inclusion" process, the entire "secure inclusion" process is aborted. The device must be manually excluded by the user, and the user must then perform a new attempt at inclusion. Other failures are softer - it should, for example, be possible to continue with not knowing the on/off state of a bulb during the initial interview, defaulting to an "unknown" state.

Are there any other smart things we can do in this case?

In these situations, trying to be smart usually causes more harm than good. It's better to be honest - something failed and the library doesn't have the context to deal with it. Let the client know what went wrong, it's likely in a better position to deal with the exception. As the library matures, and with input from clients using it, some cases that can be handled might come up - but this is a safer starting point.

What if the device's behavior is contradicting itself? [bug] Aeotec Smart Switch 6 - Fails to complete Interview #1369 is a good case of this:

For the sake of this discussion, let's assume that zwave-js is at this point Z-Wave Certified and has the "it does the right thing"^TM stamp of approval. You're facing a device that is in breach of certification requirements. Even if the manufacturer gets notified of this and releases an OTA firmware update, you still have to deal with the device in front of you. You have options. In this case, the library could decide that after 3 failed attempts at getting the information, it'll give up and set its values to a sensible, generic default (if there is one). Another option would be "hiding" that Command Class from clients. A third possibility is having device-specific workarounds, where these values are known in advance and provided out-of-band to the library.

Whatever the workaround, it's important to surface the problem to the client. It, or it's users, then have the opportunity to let device manufacturer and the Z-Wave Alliance know about the non-compliance so it can get corrected.

zwave-js aborts the interview (mainly because I trusted the specs back when I wrote this behavior).

"Trust, but verify." Even if the Certification process was somehow infalible and never let a single non-compliance slip through, a library will still need to deal with older devices, manufactured during a time when a certain requirement might not have existed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to do if a device does not reply? #1415

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What to do if a device does not reply? #1415

AlCalzone Jan 16, 2021 Maintainer

Replies: 1 comment

hanskroner Jan 16, 2021

AlCalzone
Jan 16, 2021
Maintainer

hanskroner
Jan 16, 2021