Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev 1153 ??? #90

Open
osos opened this issue Jan 22, 2018 · 32 comments
Open

dev 1153 ??? #90

osos opened this issue Jan 22, 2018 · 32 comments

Comments

@osos
Copy link

osos commented Jan 22, 2018

What does:

dev 1153

ets Jan 8 2013,rst cause:4, boot mode:(1,6)

wdt reset

mean?

get this rather random. After a "dev 1153" message the device resets after a short while.

I cant seem to find the cause.

@3bb
Copy link

3bb commented Mar 20, 2018

I have the same problem after some socket requests.

@FayeY
Copy link
Collaborator

FayeY commented Apr 16, 2018

Could you provide any test project or test logs for reproducing the problem and debugging?

@alxAgu
Copy link

alxAgu commented Jul 6, 2018

Getting the same (error?) message and then a wdt reset.

@eriksl
Copy link

eriksl commented Sep 7, 2018

Watchdog reset. Control not given back to the SDK code in time. Don't have the CPU run in your code for more than 100-1000 milliseconds. 100 milliseconds for fluent WLAN operation, 1 s if you really can't avoid it. Use tasks in task queues.

@alxAgu
Copy link

alxAgu commented Jan 27, 2019

@eriksl, wdt resets don't usually show that particular debug message (dev 1153). A reset by the watchdog doesn't seem to be the cause, but rather an unintended consequence? The dev 1153 shows up right before the system hangs, making the watchdog kick in after a few secs.

After some tests I noticed I can replicate this if:

  1. There is a gpio interrupt triggering at 50hz or more.
  2. The interrupt routine lasts at least 3ms.
  3. The module is connected to an AP (even if no sockets connected)

The bug is rather random at this point, some time takes over 15 mins to show up, sometimes just a few secs after boot up. If you either disable interrupts OR disconnect from the AP, the error stops.

Any ideas?

A.

@eriksl
Copy link

eriksl commented Jan 27, 2019

That could very well be possible. I have had issues with GPIO interrupts for ages and at some time I gave up. It could be either software (bug in the SDK code that gets called before our ISR) or a hardware issue. The GPIO interrupt handles bouncing and noise very badly. It looks like the GPIO interupt isn't masked until the user ISR is called which can lead to a race where the ISR is called multiple times for one event. And that might trash the stack or something else bad.

I worked around this by adding a hardware debouncer and stop using GPIO interrupts, I now poll the pins using a 100 Hz timer which is quick enough for me.

@FayeY
Copy link
Collaborator

FayeY commented Jan 29, 2019

Hi @alxAgu , could you share the test code so we can reproduce the issue and help debug it? Thanks.

@alxAgu
Copy link

alxAgu commented Jan 29, 2019

Hi @alxAgu , could you share the test code so we can reproduce the issue and help debug it? Thanks.

Thanks @FayeY, I will try to get something ready soon (it's a private code so I need to prepare a stripped-down version). In the meantime, is there any explanation for the meaning of this message? It looks like it was put there for a reason, It would be very helpful to know what triggers it.

@alxAgu
Copy link

alxAgu commented Jan 30, 2019

@FayeY I'm attaching the test code. It takes 5 to 20 min for the error to show up. Thanks.

DEV_1153-MCVE.zip

Update: Forgot to mention that the interrupt pin should be attached to a clock source of at least 100hz.

@alxAgu
Copy link

alxAgu commented Feb 7, 2019

Is anyone from espressif even checking this? @FayeY? :(

@cnlohr
Copy link

cnlohr commented Feb 7, 2019

It's Chinese New Year, @alxAgu - most companies will come back online in the next few days, but almost all companies are off during this period.

@alxAgu
Copy link

alxAgu commented Feb 20, 2019

Just following up on this. I don't want this bug to remain ignored.

I'm not completely sure at this stage (more testing needed) but I noticed some routers increase the chances of the bug showing up, as it does to have an active TCP connection.

A while back I thought that the cause of this may be an interrupt being triggered while the processor was busy doing some obscure networking magic, so I decided to disable interrupts between espconn_send and sent_callbacks but no luck. In fact, the MCVE example I uploaded last month doesn't even create a TCP connection. So, we are back to square one.

I'm convinced this bug is related to another one I reported directly on espressif's forums since its behavior is VERY similar. Unfortunately, it also remains unanswered: https://bbs.espressif.com/viewtopic.php?f=7&t=25689

Can someone from espressif please let us know if this is even being looked at? @FayeY? Anyone?

Thanks.

@FayeY
Copy link
Collaborator

FayeY commented Feb 21, 2019

Hi,

Yes, we have came back to work after the Chinese new year.
This issue has been forwarded to our internal engineers, we will feedback to you as soon as there is any update.

Thanks.

@alxAgu
Copy link

alxAgu commented Feb 21, 2019 via email

@alxAgu
Copy link

alxAgu commented Mar 7, 2019

Hi, just checking on this and humbly requesting to keep it somewhere in your calendar so it won't get pushed back to the end of the queue. Thanks and looking forward to the updates.

@FayeY
Copy link
Collaborator

FayeY commented Mar 8, 2019

Hi @alxAgu ,

Yes, we do reproduced the reset issue with the test code you provided, but not the "dev 1153", but the "dev 1159" reset instead. We are further analyzing it now, will feedback to you as soon as there is any update.

Thanks.

@alxAgu
Copy link

alxAgu commented Mar 10, 2019

Hi @alxAgu ,

Yes, we do reproduced the reset issue with the test code you provided, but not the "dev 1153", but the "dev 1159" reset instead. We are further analyzing it now, will feedback to you as soon as there is any update.

Thanks.

Thanks @FayeY, It is reassuring to hear you guys are working on it. I've never seen the "dev 1159" reset, the only one that some times shows up and it is probably related is the "mac 985" I reported in your forums a while ago.

A.

@FayeY
Copy link
Collaborator

FayeY commented Mar 15, 2019

Hi @alxAgu ,

Would you like to have a try with our latest nonOS SDK? It may solve your problem.
But on the other hand, it is always suggested to use our latest ESP8266 RTOS SDK instead, the ESP8266 RTOS SDK will be the one we mainly focus on.

Thanks.

@alxAgu
Copy link

alxAgu commented Mar 15, 2019

Hi @FayeY, thanks heaps for the update, I will try to have a look this week and report back!

I understand the RTOS SDK is ideal for new developments but I think Espressif needs to realise that there are companies with products out there that depend on the NONOS SDK and migrating a production code to a complete new SDK style isn't something to take lightly. Espressif should really re-evaluate its decision and provide some sort of LTS policy. Leaving companies in the limbo after having spent hundreds of thousands of dollars in your modules doesn't fill us with a lot of confidence. How can Espressif assure any of us that after investing the time and significant resources required to carry out a migration to the RTOS SDK you are not going to decide to do the same again and stop supporting it?

Anyway, sorry for the rant, it wasn't directed at you personally. I appreciate your help and your team's efforts @FayeY.

A.

@eriksl
Copy link

eriksl commented Mar 16, 2019

It's not only about new vs. existing projects. Some of us simply don't need and don't want the overhead of the RTOS SDK and/or want more control than RTOS SDK offers. The ESP8266 is a microcontroller after all, it's not an Arduino development board or a Raspberry with an operating system.

@alxAgu
Copy link

alxAgu commented Mar 20, 2019

Hi @FayeY, I ran a few tests using the latest version from the release/v3.0.0 branch and I'm afraid the problem persists. This time, rather than dev 1153, I'm getting a dev 1163 message.

The bug still appears randomly, taking from 4min to 4h to show up. In a 12h period, it rebooted 7 times. I'm using the same test project I posted here. This is the debug output (after just 4 minutes):

2019-03-21 05:09:36	Doing stuff x118
2019-03-21 05:09:38	Doing stuff x119
2019-03-21 05:09:39	dev 1163
2019-03-21 05:09:47	
2019-03-21 05:09:47	 ets Jan  8 2013,rst cause:4, boot mode:(3,6)
2019-03-21 05:09:47	
2019-03-21 05:09:47	wdt reset
2019-03-21 05:09:47	load 0x40100000, len 2592, room 16 
2019-03-21 05:09:47	tail 0
2019-03-21 05:09:47	chksum 0xf3
2019-03-21 05:09:47	load 0x3ffe8000, len 764, room 8 
2019-03-21 05:09:47	tail 4
2019-03-21 05:09:47	chksum 0x92
2019-03-21 05:09:47	load 0x3ffe82fc, len 676, room 4 
2019-03-21 05:09:47	tail 0
2019-03-21 05:09:47	chksum 0x22
2019-03-21 05:09:47	csum 0x22
2019-03-21 05:09:47	
2019-03-21 05:09:47	2nd boot version : 1.7(5d6f877)
2019-03-21 05:09:47	SPI Speed : 80MHz
2019-03-21 05:09:47	SPI Mode : QIO
2019-03-21 05:09:47	SPI Flash Size & Map: 32Mbit(1024KB+1024KB)
2019-03-21 05:09:47	jump to run user2 @ 101000
2019-03-21 05:09:47	
2019-03-21 05:09:47	��������������������������$�ǒ��n�;�������ܒ����������;��;�������8�������r���n������������������������l�<����n����l`��>�r�8���l� ����
2019-03-21 05:09:47	
2019-03-21 05:09:47	**** System started *****
2019-03-21 05:09:47	mode : sta(aa:bb:cc:dd:ee:ff)
2019-03-21 05:09:47	add if0
2019-03-21 05:09:47	WPA2 ENTERPRISE VERSION: [v2.0] disable
2019-03-21 05:09:47	scandone
2019-03-21 05:09:49	Doing stuff x0
2019-03-21 05:09:50	scandone
2019-03-21 05:09:51	state: 0 -> 2 (b0)
2019-03-21 05:09:51	state: 2 -> 3 (0)
2019-03-21 05:09:51	state: 3 -> 5 (10)
2019-03-21 05:09:51	add 0
2019-03-21 05:09:51	aid 8
2019-03-21 05:09:51	cnt 
2019-03-21 05:09:51	
2019-03-21 05:09:51	connected with xxxxxxxxxx, channel 10
2019-03-21 05:09:51	dhcp client start...
2019-03-21 05:09:51	connect	to	ssid	xxxxxxxxxx,	channel	10
2019-03-21 05:09:51	Doing stuff x1
2019-03-21 05:09:53	Doing stuff x2
2019-03-21 05:09:54	ip:192.168.2.133,mask:255.255.255.0,gw:192.168.2.1
2019-03-21 05:09:54	Connected to AP!
2019-03-21 05:09:54	Arming i2c bus...
2019-03-21 05:09:54	I2C Initialized, Free memory: 42808
2019-03-21 05:09:55	Doing stuff x3
2019-03-21 05:09:57	Doing stuff x4

@Frogmore42
Copy link

I have been working on adding 9-bit serial to Tasmota and have been seeing random infrequent watchdog timer resets. I am using a modified version of software serial. Since it is running at 19,200 baud, and it stays in the ISR for a whole transaction (start, 9 bits, stop), the time in the ISR is about 521 us.

I turned on the internal SDK logging and got this:

[2019-03-25 10:36:16.659] AOSInput: FreeRam 18288, aos_is_serialRx @:0x224CA25D took:0x00000C80, available: 0
[2019-03-25 10:36:16.659] dev 1153
[2019-03-25 10:36:23.769]
[2019-03-25 10:36:23.770] ets Jan 8 2013,rst cause:4, boot mode:(3,7)
[2019-03-25 10:36:23.778]
[2019-03-25 10:36:23.778] wdt reset
[2019-03-25 10:36:23.795] load 0x4010f000, len 1384, room 16
[2019-03-25 10:36:23.804] tail 8
[2019-03-25 10:36:23.804] chksum 0x2d
[2019-03-25 10:36:23.804] csum 0x2d
[2019-03-25 10:36:23.804] v00000000
[2019-03-25 10:36:23.809] ~ld
[2019-03-25 10:36:23.854]
[2019-03-25 10:36:23.858] SDK:2.2.1(cfd48f3)/Core:unspecified=0/lwIP:STABLE-2_1_2_RELEASE/glue:1.1-2-ga501b57/BearSSL:6778687
[2019-03-25 10:36:23.869]

Just before the "dev 1153" message from the SDK, you can see heap is fine at 18K and it has just enabled the interrupt. There is a continual stream (with pauses) of 19.2K serial traffic coming in. So, most definitely fast interrupts play poorly with the WiFi stack (or so it would seem).

So, what exactly does dev 1153 mean?

@Frogmore42
Copy link

Doing more research it appears there was a very similar issue that was been reported before (with no known success then either, unfortunately).
https://bbs.espressif.com/viewtopic.php?t=2615

Hopefully someone who has access to the source will be able to say what is going wrong and even better provide an updated SDK that fixes it.

@Frogmore42
Copy link

and more here: https://bbs.espressif.com/viewtopic.php?t=1205

This one indicates staying in an ISR more than 50us can cause this problem. Since my scenario has an ISR that is at least 10 times that length, I guess it is not surprising that it fails sometimes (takes many hours to fail, usually).

So, probably the bottom line is don't use an ISR to do anything major (which is generally good advice). But that means that software serial at 19.2K is definitely out for the esp8266. Since the esp8266 does not support 9-bit hardware serial, I will have to switch to my Particle Photon which does have 9-bit serial capable hardware with a decent buffer and can easily keep up with 19.2K.

@eriksl
Copy link

eriksl commented Mar 26, 2019

One should never place more than a handful of lines of code in an ISR handler, on any platform. It's not what they're for.

For sending data you shouldn't need an interrupt. For receiving, I guess you're using the pin change interrupt. Change that interrupt into someting that sets a flag and handle the bitstream outside of the ISR. The ESP8266 is probably quick enough to do that, in time, for 19.2 Kbps.

My preferred solution would be to add a small piece of hardware that splits the 9 bits into 8 + 1, doubling the data rate (which the ESP8266 won't have any issue with) and then you can use one of the two builtin UARTS, which are quite versatile.

Also probably the I2S interface could be abused for this purpose, which also relieves you from handling the timing.

@Frogmore42
Copy link

@eriksl in general I agree with your statement about ISR length. In practice, many people choose to ignore it for both good and not so good reasons.

I did not write the software serial code in question. When I saw it, I was surprised at what it was doing. I looked for other implementations, but found none that claimed to work at 19.2.

While I have other solutions to my problem, I wanted to be sure my code was not the cause of the issue I was seeing. After a few days, I became convinced it was an incompatibly. I finally got a message from the core SDK that correlated with the WDT event. That message led me here. Since I had seen many others having very hard to identify reliability issues with interrupts and WDT events, I placed pointers here to the underlying issue.

Now, the issue could be resolved by saying, "don't do that." That is a possibly reasonable resolution.

Now that I know the cause, I have some options. The code has been running for almost 24 hours, since the last watchdog event. It typically lasts at least several hours. Since the thing I am measuring doesn't change very frequently, missing some data is not a big issue. I believe with the current implementation getting 99.9% uptime is possible. Getting 99.999% of the data that is "interesting" is also likely. Is that good enough for me? Probably. Is it an "excellent" solution? Not really.

As far as using interrupts for each bit transition instead of staying in the ISR for the whole "byte", the people who have tried it say it worked okay to 9600 or so. So for slow serial that would work fine. If you have code for a solution (for the esp8266) that works at 19200, please share it.

As far as the HW solution you proposed, that does not sound simple. The Particle Photon uses an MCU that has HW 9-bit serial support. It has no problem keeping up with 19200 9-bit serial. But, it doesn't have anything like Tasmota and I am not sure I want to take on porting Tasmota to it.

There is always a choice to make.

@alxAgu
Copy link

alxAgu commented Mar 26, 2019

@FayeY, have you guys been able to confirm that the bug was not fixed by the latest commits?

@FayeY
Copy link
Collaborator

FayeY commented Apr 2, 2019

Hi,

ISRs should occupy very little time (in microseconds only). Otherwise it may block the WiFi handler and cause the dev 1163 reset. It is suggested that you can post a message from the ISR and handle it in another task. In this case, the ISR may not occupy too long time to cause the problem.

Thanks.

@alxAgu
Copy link

alxAgu commented Apr 2, 2019

Hi,

ISRs should occupy very little time (in microseconds only). Otherwise it may block the WiFi handler and cause the dev 1163 reset. It is suggested that you can post a message from the ISR and handler it in another task. In this case, the ISR may not occupy too long time to cause the problem.

Thanks.

That's rather disappointing, it basically means that your devices cannot be used as I2C slaves.
Is there any way to temporarily pause/disable your WIFI handler during the ISR as a workaround @FayeY?

A.

@Frogmore42
Copy link

Having the ability to pause Wi-Fi communications would be very desireable. Clearly, if paused too long Wi-Fi connection would have to be restarted. But, that would be far better than a watchdog event.

Or, knowing that now would be a bad time to be in an ISR for an extended period would enable short circuiting the ISR. Losing data is bad, but a watchdog event is far worse.

@eriksl
Copy link

eriksl commented Apr 2, 2019

Making an I2C master using bit-banging is easy. A slave is another story. Most microcontrollers have hardware support for that, because it's very difficult to do, due to strict timing demands. Even with ATTiny's it's difficult, with their "limited" I2C hardware support ("USI").

I think the only way to do this is to combine pin change interrupts (otherwise you WILL lose a transition, sooner or later, or a start or stop condition) and timer interrupts. And then you won't have to hog the system and the watchdog timer / reset is no longer an issue.

I can't repeat this often enough, the ESP8266 is NOT suitable for the familiar microcontroller programs that run in a tight loop (and often with interrupts disable). Impossible!

Having said that, very much is still possible, if approached in a smart way, like a PWM implementation in software.

You may want to consider using the ESP32, which has a dedicated cpu core for user code or adding a microcontroller that does have proper I2C slave support (ATMega for example).

@alxAgu
Copy link

alxAgu commented Apr 7, 2019

I can confirm that limiting the time spent inside in the ISR to something in the order of microseconds (tested with 60 to 80us) does prevent these error messages (dev 1153, dev 1163) from showing up.

At least now we know why it happens, so hope it helps someone else.

Thanks to @FayeY and the rest of the team for their help!

A.

P.S I do hope you guys reconsider the decision to retire the NONOS_SDK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants