Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional deferMillis parameter to System::restart() #1611

Merged
merged 1 commit into from
Feb 11, 2019

Conversation

mikee47
Copy link
Contributor

@mikee47 mikee47 commented Feb 9, 2019

A system restart may be requested from within a deeply-nested callback routine, such as on network transfer completion. This happens in the Basic_rBoot sample application, which I have found occasionally crashes at the end of an update cycle; this is probably more likely in debug mode where there is more going on.

In such scenarios it is best to defer the actual restart request so that the callback can unwind and complete fully. Therefore, this modification will always defer a restart request, using either the task queue (if deferMillis == 0) or a system timer if a longer delay is required by the application.

A system restart may be requested from within a deeply-nested callback routine, such as on network transfer completion. This happens in the `Basic_rBoot` sample application, which I have found occasionally crashes at the end of an update cycle; this is probably more likely in debug mode where there is more going on.

In such scenarios it is best to defer the actual restart request so that the callback can unwind and complete fully. Therefore, this modification will _always_ defer a restart request, using either the task queue (if deferMillis == 0) or a system timer if a longer delay is required by the application.
@mikee47
Copy link
Contributor Author

mikee47 commented Feb 10, 2019

See also #1523. There are other framework calls which would benefit from this approach, e.g.#1598

@mikee47
Copy link
Contributor Author

mikee47 commented Feb 10, 2019

I found with testing rBoot sample that no delay is actually required, deferring the call is sufficient. I could simplify this PR... thoughts?

@mikee47
Copy link
Contributor Author

mikee47 commented Feb 10, 2019

Probably fix for #1275

@slaff
Copy link
Contributor

slaff commented Feb 11, 2019

I could simplify this PR... thoughts?

Can you elaborate on that?

@mikee47
Copy link
Contributor Author

mikee47 commented Feb 11, 2019

I could simplify this PR... thoughts?

Can you elaborate on that?

Well, only the question of whether it's necessary or useful to have the deferMillis parameter at all - could just use the task queue if that is all that is required to fix the problem. Can always leave it up to the user application to use timers, etc. if required.

Further thought: if one's application has various other network callbacks queued (e.g. websocket broadcasts) then that would be a good reason to have an additional delay.

@slaff
Copy link
Contributor

slaff commented Feb 11, 2019

What bothers me about this PR is that it fixes an issue that we don't quite well understand. Is the SDK causing the segfault, is the UART code causing it or it is something else?

I am having a suspicion that we are proposing a wrong fix. The following is true for ESP8266 and NodeMCU in particular:
If you flash the device and call software restart having the UART lines up it will cause a WDT reset. Similar to the following:

Then I reasoned that as I was using GPIO as an output, maybe the hardware wasn't reseting it to an input on a soft reset, so the reset would depend upon the state of GPIO#0.

So, I set GPIO#0 high before doing the restart and it works!

From here esp8266/Arduino#1722

What I am thinking is that just setting GIO0 to high will be enough. Can you try this?

@mikee47
Copy link
Contributor Author

mikee47 commented Feb 11, 2019

I suspect that's a different, unrelated issue. I'll use the basic_rBoot sample as a specific example. Here's what happens with a straight call to system_restart():

26229500 staticOnMessageComplete: Execution queue: 0, http://192.168.1.68:80/rom0.bin
26230111 
Firmware download finished!
26235083  - item: 0, addr: 2000, len: 307872 bytes
In callback...
Firmware updated, rebooting to rom 0...
Restarting...

state: 5 -> 0 (0)
rm 0
pm close 7
26300490 TCP connection error: -8
del if0
usl
26301301 Firmware updated.

***** Fatal exception 28

Now with the deferred restart:

11839537 staticOnMessageComplete: Execution queue: 0, http://192.168.1.68:80/rom0.bin
11840147 
Firmware download finished!
11845128  - item: 0, addr: 2000, len: 307760 bytes
In callback...
Firmware updated
Swapping from rom 1 to rom 0.
Restarting...

11902755 Firmware updated.
11903240 TCP connected
11903366 HttpConnection::onReadyToSendData: waitingQueue.count: 0
11903545 Nothing in the waiting queue
11908313 TCP onReadyToSendData: 0
11913276 TCP received: 813 bytes
11917263 HttpConnection::onReadyToSendData: waitingQueue.count: 0
11926034 Nothing in the waiting queue
11931114 TCP onReadyToSendData: 1
state: 5 -> 0 (0)
rm 0
pm close 7
11940916 TCP connection error: -8
del if0
usl
*P*VETT*uU+UZUTTQTQQjP*uZTUQTj*QEuP+TQTQQBUuZT(EQREj*QREQQujUKZVVUU(TTU�UEJPQZUJZEEQQu],QuVUPQota1 not set
ota2 not set
>> 0x3JQRQrf cal sector: 1019
freq trace enable 0
rf[112] : 00
98375 spiffs disabled

My immediate impression is that the rug is getting pulled out from under the network stack and leading to the crash. Doubtless there is work to be done to specifically address why the Exception 28 occurs - I will take a look at that in due course - but this PR simply aims to make the transition a bit cleaner and improve reliability of the operation.

@slaff
Copy link
Contributor

slaff commented Feb 11, 2019

My immediate impression is that the rug is getting pulled out from under the network stack and leading to the crash.

I would have expected that the Espressif guys have already taken care of this, but without the source code I cannot tell. Will try asking @me-no-dev and @igrr if they can shed some light.

@igrr
Copy link

igrr commented Feb 11, 2019

I haven't looked at the SDK code in a while, but I think calling system_restart from an LwIP callback is not a great idea, as that call will tear down the network interface. Deferring that using a task queue seems to be the correct approach.

@slaff
Copy link
Contributor

slaff commented Feb 11, 2019

could just use the task queue if that is all that is required to fix the problem.

I. If this is the only case where restart segfaults

@mikee47 With the information from @igrr I would suggest the following - remove the new code from System::restart and add a new method called restart in the rBoot client that has the task queue code. The method needs to have a description explaining why it is here and why it should be called instead of System::restart. And finally the sample(s) should be updated accordingly.

II. If there are more places where restart segfaults

Then we could think about adding System::restart to a queue. But this might hide the real cause of the problem.

What do you think?

@mikee47
Copy link
Contributor Author

mikee47 commented Feb 11, 2019

The SDK API docs. say of system_restart: 'The ESP8266 will not restart immediately. Please do not call other functions after calling this API.'. That would seem to imply that other system functions are affected, not just the network APIs.

I think using the 'does it segfault?' approach is perhaps risky here - it may only do it sometimes! The risk of masking problem code could be mitigated by disabling the behaviour during testing - perhaps another compiler flag.

For user applications, we should provide an API which makes life as easy as possible for user applications. That means always deferring the call, and providing the option to include an additional delay so their own stuff can finish up if necessary.

So I'm voting in favour of the general case which this PR addresses.

@slaff
Copy link
Contributor

slaff commented Feb 11, 2019

So I'm voting in favour of the general case which this PR addresses.

I had a second thought on this and have to agree with you. This will bring stability without sacrificing performance or resource usage so it's fine.

@slaff slaff added this to the 3.7.2 milestone Feb 11, 2019
@slaff slaff merged commit 76b288c into SmingHub:develop Feb 11, 2019
@mikee47 mikee47 deleted the feature/system_restart_deferral branch February 11, 2019 20:36
@slaff slaff mentioned this pull request Feb 27, 2019
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants