Skip to content
This repository has been archived by the owner on Oct 4, 2021. It is now read-only.

Protocol / Queue handling issue #151

Closed
susisstrolch opened this issue Jul 16, 2019 · 74 comments
Closed

Protocol / Queue handling issue #151

susisstrolch opened this issue Jul 16, 2019 · 74 comments
Labels
bug Something isn't working

Comments

@susisstrolch
Copy link

Today my Logic Analyzer arrived, so I found the time to look what's going on on EMS-Bus.
Environment:

  • RC35 (DeviceID:0x10 ProductID:86 Version:01.11)
  • MC10 (DeviceID:0x08 ProductID:72 Version:02.07)
  • EMS-ESP version 1.8.1b17
    • external power supply
    • tx_mode = 0

Issue 1 - EMS-ESP continues sending, even not being polled again

1. MC10:	08 0B 16 00 FF 5A 64 00 06 FA 0A 01 0F 64 64 02 
	        08 F8 0F 0F 0F 0F 1E 05 04 09 09 00 6D 00
2. ESP:	0B 00
3. MC10:	   0B 00 89 00 85
4. ESP:		  0B 88 14 00 20 E4 00
<200ms pause>
  1. MC10 sends a MC10Parameter message to us.
  2. We acknowledge with "0B " -> ok, fini
  3. MC10 echos our acknowledge???
    The from MC10 looks fine, it's approx. 1.035ms
  4. We continue to poll MC10, which is simply ignored by the busmaster by not sending an echo.
    Selection_023

Issue 2 - pretty similiar to Issue 1

1. RC35:	10 0B 3E 00 00 00 00 7D 00 00 00 00 00 00 00 00
	        00 11 05 00 E9 00
2. ESP:	0B 00
3. RC35:	   0B 00
4. ESP:		  0B 90 3D 00 20 80 00
5. RC35:		     0B 90 7A
<200ms pause>
  1. RC35 sends a HK1MonitorMessage to us
  2. We respond with "0B " -> ok, fini
  3. RC35 echos our acknowledge
  4. We start sending w/o request
  5. Busmaster stops echo after 2nd byte and stays silent for 200ms.
    Selection_020
@susisstrolch susisstrolch added the bug Something isn't working label Jul 16, 2019
@proddy
Copy link
Collaborator

proddy commented Jul 16, 2019

interesting, if you use tx_mode=2 (your implementation) does it start to clear up a little?

@susisstrolch
Copy link
Author

That will be the next step.. first, I have to create triggers for the LA - especially on Rx / Tx BRK.
digitalWrite doesn't work in emsuart - that leads to reboot loops from watchdog.

@proddy
Copy link
Collaborator

proddy commented Jul 17, 2019

getting digitalWrite to work in an ISR is tricky. although it is atomic its slow. I've managed to do this before in earlier versions using the LED to show BRKs using

WRITE_PERI_REG(PERIPHS_GPIO_BASEADDR + (state ? 4 : 8), (1 << EMSESP_Status.led_gpio)); // 4 is on, 8 is off

@susisstrolch
Copy link
Author

susisstrolch commented Jul 17, 2019

Jep.. it was more a "which one to use and how/where to mark the actions".
Anyway...
Added two markers to visualize whats going on. One is related to Rx, the other to Tx.
Short Tx mark is set when we fill the Tx FIFO. Long Tx mark is set when we generate a BRK, after waiting for Tx-FIFO empty.
Rx mark (long) is set as soon as we detect a BRK on Rx.

Now, Iet's have a look at the protocol problem (still in tx_mode 0):

  1. MC10 sends a MC10Parameter message to us.
    Selection_025

Selection_026

  1. At the blue marker you see that ESP acknowledges with a 0B <BRK> - which is imho wrong, because the command isn't a poll request. So we should answer with ACK or NACK.
  2. EMS acknowledges with 0B <BRK> - so far ok - and, because the <BRK terminates a communication - EMS assumes that the bus is free.
    Selection_027
  3. ESP however continues to send - which is politely ignored by EMS.
    Selection_028

So, I suggest the following extensions to protocol handler:
If we are targeted by the message AND if bit7 isn't set we have a direct telegram, which should be ACKed with the 01 - and after that by sending a 0B <BRK.
In addition, ESP should wait for the next poll request before sending the next telegram.

The tx_mode 2 is my next target...

@susisstrolch
Copy link
Author

Uups - seems my JPEG rendering is a bit too conservative...

@susisstrolch
Copy link
Author

susisstrolch commented Jul 18, 2019

Ok, here's the next one...
Switching to alternate Tx/Rx is much too late after reboot. The EMS-Bus keeps pulled low to long.
Green Marker: echo from reboot
Blue Marker: Tx moved over to alternate Tx
As you can see the EMS is pulled down for >4s after reboot...
Selection_030

Zoomed in before blue marker:
Selection_031

Imho switching over to D8/D7 should occur before WIFI and other stuff...

@proddy
Copy link
Collaborator

proddy commented Jul 18, 2019

the uart being initialized and the interrupts enabled at the last minute was by design as to not interfere with anything else that was happening during bootup (serial transmissions, wifi, mqtt, websockets etc). It seemed the sensible thing to do at the time as I didn't want to fill up the Rx buffers. The logic is in line 1569 of ems-esp.cpp. It should be safe to move it forward as you suggested although I'm not sure what advantage it will have?

@proddy
Copy link
Collaborator

proddy commented Jul 18, 2019

btw how you finding the dslogic? I have one too from https://www.dreamsourcelab.com/product/dslogic-plus but found it quite cumbersome to use. I often switch over to a 4 EUR USB logic analyzer and using Saleae's free software

@susisstrolch
Copy link
Author

Advantage is not to block the EMS bus - there's nothing on the line during this time...

@proddy
Copy link
Collaborator

proddy commented Jul 18, 2019

I see. I would have expected it wouldn't pull down the line as the D7/D8 pins are not activated though ?

@susisstrolch
Copy link
Author

Next one... found in tx_mode 2...
Between receiving a "ITS_TO_ME" telegram and processing we could be interrupted / delayed by WIFI/MQTT transmission.
Currently, the tx doesn't care about the Rx status. This will lead to overlapping / misleading operations on the EMS bus.
Green marker: EMS is sending a poll request
Dark green marker: Poll request is processed
Overview:
EMS sends a poll request which is handled ~130ms after the request. Meanwhile, the next device is sending, which gets disturbed by us.

Screenshot from 2019-07-18 23-28-05
Request details:
Screenshot from 2019-07-18 23-29-42

Response disturbing bus:
Screenshot from 2019-07-18 23-31-41

IMHO, Tx should honour the Rx-Status - simply fail if Rx != IDLE...

@susisstrolch
Copy link
Author

susisstrolch commented Jul 18, 2019

btw how you finding the dslogic

It's a really great tool - needs a bit time to find out how to work, but after that you find anything you like!!

IMG_5959

IMG_5960

@proddy
Copy link
Collaborator

proddy commented Jul 18, 2019

jealous !

@susisstrolch
Copy link
Author

I see. I would have expected it wouldn't pull down the line as the D7/D8 pins are not activated though ?

I also don't understand why they are pulled down. My first thought was that it happens in the Arduino core - at least the traces looks alike...

@susisstrolch
Copy link
Author

susisstrolch commented Jul 18, 2019

jealous !

Playground :) accepted by Susi :))
(won't talk about the 3D printer in background )

IMG_5961

@susisstrolch
Copy link
Author

Somehow we must handle the duedate of the received package.
Maybe by a timestamp, If delta > 100ms we won't process the Tx anymore...

@proddy
Copy link
Collaborator

proddy commented Jul 18, 2019

jealous !

Playground :) accepted by Susi :))
(won't talk about the 3D printer in background )

IMG_5961

3D printer's on my wish list too. Velleman make some nice kits.

@susisstrolch
Copy link
Author

Yep... but it needs some tweeks until it's really usable...
F.e. the sliders are suboptimal...
Simply ping me if you go to K8800...

@susisstrolch
Copy link
Author

susisstrolch commented Jul 18, 2019

Totaly off-topic:
iPhone8p holder for the bicycle.. printed in PLA+Carbon...
IMG_5962
[fixed horrible typo]

@susisstrolch susisstrolch mentioned this issue Jul 21, 2019
@bbqkees
Copy link

bbqkees commented Jul 25, 2019

I see. I would have expected it wouldn't pull down the line as the D7/D8 pins are not activated though ?

I also don't understand why they are pulled down. My first thought was that it happens in the Arduino core - at least the traces looks alike...

Is the EMS bus itself actually blocked?
If I recall correctly when the TX input is floating or 0 the TX transistor is active (so the current is drawn through the resistors).
Could be the I/O of the ESP is pulled low at boot so the analyzer sees a logic 0 at those pins but the EMS bus itself might be just fine.

@susisstrolch
Copy link
Author

susisstrolch commented Jul 25, 2019 via email

@proddy
Copy link
Collaborator

proddy commented Jul 25, 2019

GPIO15 (D8) which we use for Tx is automatically pulled high by the ESP8266 during boot up, which causes the block on the EMS bus. In the code its forced down so there's only a very short window now (3 bytes I think).

@susisstrolch
Copy link
Author

susisstrolch commented Jul 25, 2019

GPIO15 (D8) which we use for Tx is automatically pulled high by the ESP8266 during boot up

No... GPIO15 has a pull down (10k) on (nearly) all boards (Wemos, NodeMCU).
The adapter board itself is inverting the Tx signal - pulling current from EMS-Bus when Tx (GPIO15) is low.
So, during boot the EMS-Bus was pulled low for (my impression) much to long.

@bbqkees
Copy link

bbqkees commented Jul 25, 2019

Yes indeed.
But you only block slave TX for a short moment, the EMS master can still send uninterrupted.

@susisstrolch
Copy link
Author

@proddy we have a protocol handling issue when sending to EMS bus. It happens with tx_mode 0 and also with tx_mode 2.
Look at those traces:

(00:09:16.948) ems_parseTelegram: 00 01: 8B
(00:09:16.948) emsuart_tx_buffer: 00 01: 0B 90 3E 00 20 8C
(00:09:16.964) ems_parseTelegram: 00 02: 0B 90 3E 00 20 8C
(00:09:16.964) 0B 90 3E 00 20 8C
(00:09:16.964) echo:telegram: 0B 90 3E 00 20 (CRC=8C) #data=1
(00:09:17.078) ems_parseTelegram: 00 02: 10 0B 3E 00 00 00 00 7D 00 00 00 00 00 00 00 00 00 00 05 00 AD
(00:09:17.078) 10 0B 3E 00 00 00 00 7D 00 00 00 00 00 00 00 00 00 00 05 00 AD
(00:09:17.080) emsuart_tx_buffer: 00 01: 0B
(00:09:17.085) ems_parseTelegram: 00 01: 0B
(00:09:17.085) emsuart_tx_buffer: 00 01: 0B 90 3D 00 20 80
[557087] ** error sending buffer: BRK
(00:09:17.297) ems_parseTelegram: 00 01: 89 00 86 90 

(00:09:16.948) ems_parseTelegram: 00 01: 8B Bus master sends poll request
(00:09:16.948) emsuart_tx_buffer: 00 01: 0B 90 3E 00 20 8C We send a query for 0x10
(00:09:16.964) ems_parseTelegram: 00 02: 0B 90 3E 00 20 8C We get the echo
(00:09:17.078) ems_parseTelegram: 00 02: 10 0B .... We process the data
(00:09:17.080) emsuart_tx_buffer: 00 01: 0B We tell Busmaster we're ready
(00:09:17.085) ems_parseTelegram: 00 01: 0B Busmaster echo
(00:09:17.085) emsuart_tx_buffer: 00 01: 0B 90 3D 00 20 80 We send again, w/o invitation/poll
[557087] ** error sending buffer: BRK Bus master signals "shut up"

The issue is that we interpret the echo as a poll request because of masking out the "poll bit" from Target ID.

@proddy
Copy link
Collaborator

proddy commented Jul 26, 2019

funny, I noticed that too this morning when I was trying to debug why the 0x19 MonitorSlow messages were getting blocked (buffer of 32 was too small). I saw many telegrams with a length of 1 and a value of 0x19 which was a recursive echo->poll

@susisstrolch
Copy link
Author

susisstrolch commented Jul 26, 2019 via email

@proddy
Copy link
Collaborator

proddy commented Jul 26, 2019

great, thanks Juergen. Amazing how you can code in these extreme heats. Shall I take a look at fixing the echo/poll issue or have you already squished that bug?

@proddy
Copy link
Collaborator

proddy commented Jul 29, 2019

wonderful! I'll test tonight and reach out the the HT3 guys to see if they can test

@proddy
Copy link
Collaborator

proddy commented Jul 30, 2019

Been running the txmode2 branch for 12hrs now with no crashes and hardly any corrupt telegrams. Looking good!

@susisstrolch
Copy link
Author

susisstrolch commented Jul 30, 2019 via email

@proddy
Copy link
Collaborator

proddy commented Jul 30, 2019

with 1.8.1 I would get reboots every 10-15 mins with tx_mode 2, but none with tx_mode 0 so I don't expect its telnet/mqtt related. you can switch all that off (set mqtt_ip and set publish_time 0) and it still happens. Anyway seems better in 1.9

@susisstrolch
Copy link
Author

We should really try the watchdog feed before each timeconsuming function call.
Because it‘s the software watchdog which gets triggered we don‘t get a stacktrace.
So we must iterate by try and error.

@proddy
Copy link
Collaborator

proddy commented Jul 30, 2019

yes, we need to keep the code in ISRs non-blocking and highly optimized. emsuart_tx_buffer() has grown in complexity quite significantly since 1.7 with many loops and race conditions.

@proddy
Copy link
Collaborator

proddy commented Jul 30, 2019

Or disable wdt before the Tx is called from ems.cpp with ESP.wdtDisable()and enable it after the acknowledgement poll has been received with ESP.wdtEnable(0)

@susisstrolch
Copy link
Author

I‘ll upload a log in jabber mode - there you can see that Tx/Rx aren‘t the bottlenecks.

@susisstrolch
Copy link
Author

And here an interesting arcticle about soft-wtd:

https://www.sigmdel.ca/michel/program/esp8266/arduino/watchdogs_en.html#ESP8266_WDT_TIMEOUT

@susisstrolch
Copy link
Author

pushed a new release of txmode2 branch which injects wdtfeed() in MyESP.loop.

@proddy
Copy link
Collaborator

proddy commented Jul 31, 2019

In all versions up to 1.8.0 I had this line in MyESP.loop():

yield(); // ...and breath

which somehow got lost after 1.8.1. I think it does the same as the wdtfeed() no?

@susisstrolch
Copy link
Author

susisstrolch commented Jul 31, 2019 via email

@bbqkees
Copy link

bbqkees commented Aug 1, 2019

So I've been running the latest build from yesterday evening in the txmode2 branch.
At first I got lots of reboots (every 2 minutes or so) but now its been running for 10h without hickups (jack powered, not bus powered).

@proddy
Copy link
Collaborator

proddy commented Aug 1, 2019

It‘s still in. But yield does more than calming the WTD - it also cares about WIFI stuff. wtdFeed only restarts/resets the HW and SW watchdog and doesn’t have any further overhead. Sent by mobile device

Am 31.07.2019 um 13:23 schrieb Paul @.***>: In all versions up to 1.8.0 I had this line in MyESP.loop(): yield(); // ...and breath which somehow got lost after 1.8.1. I think it does the same as the wdtfeed() no? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

There is also a delay(1) in the ems-esp.cpp loop I used to calm down the wifi after seeing how ESPhome and other projects did this because of ardunio 2.5.0. It might no longer be necessary?

@susisstrolch
Copy link
Author

delay() is also doing the WIFI and SWDT handling, so it shouldn't hurt at all.
But I can try to remove it...

@susisstrolch
Copy link
Author

Found that one: letscontrolit/ESPEasy#2477

@proddy
Copy link
Collaborator

proddy commented Aug 2, 2019

Found that one: letscontrolit/ESPEasy#2477

nice, didn't know you could do that. Let's add that too as it'll help us find the root cause for the WDT resets.

@bbqkees
Copy link

bbqkees commented Aug 2, 2019

Possibly unrelated but ran txmode2 firmware for 20+ hours (with an open but idle Telnet session) without problems.
However, after doing 'log v' it rebooted after a few minutes.

@proddy
Copy link
Collaborator

proddy commented Aug 2, 2019

that's good news for @susisstrolch's new tx code. the logv does a lot of string manipulation (as I avoid using the String library and sprintf() ) so most probably its a memory error I need to look into.

@proddy
Copy link
Collaborator

proddy commented Aug 2, 2019

@bbqkees @susisstrolch also unrelated - I uploaded my latest web version under the newweb branch if you want to play with it. Still need to refine a few things but I think its stable. Look carefully at the CHANGELOG on how to build it because the build scripts have also changed. I'm off now and will pick things up when I'm back in a week.

@bbqkees
Copy link

bbqkees commented Aug 5, 2019

Ok will try.
The txmode2 build from last week is still running here uninterrupted at 3 days (bus powered).
Telnet session still active.

@susisstrolch
Copy link
Author

susisstrolch commented Aug 5, 2019 via email

@bbqkees
Copy link

bbqkees commented Aug 8, 2019

@proddy Took me some time because of lots of build errors but I was able to build the firmware in the end.

The newweb branch needs two additional libraries: ESPAsyncUDP and ESPAsyncWebServer. (PIO Home->Find libraries->type name-> install)
Maybe its Windows or just my particular setup but the gulp build did not work with the 'debug' parameter.
So did node gulp command in the correct folder and after that went through fine and the compacted web code files were added to the 'webh' folder, the build in pio went Ok.

The new web interface looks great, will test it over the weekend.

@proddy
Copy link
Collaborator

proddy commented Aug 8, 2019 via email

@bbqkees
Copy link

bbqkees commented Aug 8, 2019

Yes, the new one with the 'pre:scripts/buildweb.py' etc.

I had the same issue when you added f.i. OneWire initially.

@proddy
Copy link
Collaborator

proddy commented Aug 8, 2019

That's strange. In theory if you remove the whole .pio folder/directory it should go fetch all the libs. I'll try on a fresh win install this weekend

@proddy
Copy link
Collaborator

proddy commented Aug 11, 2019

@bbqkees tested on a fresh win install with platformio 4 and you shouldn't need to download anything manually. As soon as the platformio.ini file is there it will automatically fetch the latest libraries. We can look at your config next week.

@susisstrolch I merged the txmode2 branch into the newweb branch and it's been running fine for the last 8hrs. Eventually I'll move all this into dev which is 1.9.0 so let me know if you're planning any further changes. We still need to test with Junkers.

@proddy
Copy link
Collaborator

proddy commented Aug 27, 2019

closing for now. txmode2 merged into dev.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants