-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WiFi not recovering from wait for available TCP send buffer #118
Comments
Hi @GregTerrell, Thanks for letting us know, do you you think you can supply a minimal sketch that can be used to reproduce the problem? While working on #77, I might have run into this condition in f205d69. In any case, if there's a simple sketch that can reproduce this condition that would be great. |
I will see what I can do. I was testing with Azure IotHub client and remote_monitoring. But now that I know exactly how the failure happened, maybe I can create a node.js server and a brief Arduino sketch. It would be late this week or this weekend at the earliest. |
This went up in priority, when I discovered other sections of our application send more headers and hit the 16 outstanding sends quickly. I have uploaded a test configuration that can exercise the issue, there is a sketch and a node.js file. There are differences between the exact event sequences between the attached test jig (below) and testing against the Azure and the remote_monitoring example, but the test below can causes the failure within a couple minutes (see embedded comment). There a perceivable latency differences between the Node.js host and IotHub sends. Note: in my testing I have debugging enabled in the Atmel code; the marker showing the prerequisite for the failure are the Failed to alloc rx size console messages. (APP)(DBG)[hif_send][424]Failed to alloc rx size Additionally, I have a print macro added to the write() block in WiFiClient.cpp to console echo the number of socket.c sends() required to push the frame on the wire. The example below shows 2 attempts (1 for the failed alloc, 1 suceeded); the output is suppressed for attempts=1. (APP)(DBG)[hif_send][424]Failed to alloc rx size
|
Test Sketch
|
Test Node.js
|
Thanks, I'm able to reproduce with the code you provided with my Mac and a MKR1000. It fails for me on count 21. |
If I add the following:
to the end of There's an outstanding item with the WINC1500 firmware/ASF code (sort of related to #32), where not reading received data from a socket blocks all operations. With the sketch you supplied I think around count 16 or 21 the receive buffer will fill up an block all SPI communications with the WINC1500 causing the behaviour we are seeing. Could you please try adding that snippet to the end of After this, the original issue you are reporting might still exist, just need to find a another simple sketch to reproduce it ... |
I just saw your comment in the sketch:
I'd still be interest in testing without printing the response to |
So just add the...
to empty the read buffer? I can start a test to run overnight. |
@GregTerrell correct, thanks. I'm wondering if the issues you are seeing could be related to #80 (comment) as well. For that issue, "lots of ARP request cause failure" - this was also seemed random. Let's see how your overnight tests go, then we can decide if we should open a support ticket with Atmel support (link is in #80 (comment)). |
The test last night ran for about 8 hours without issue, the tests with Azure could not make it that long (about 6.5 hours was the longest). I restarted this AM without the 2sec delay between iterations and turned off M2M debug (common\include)... still running without any sign of degrading in buffer acquisition. The differences between my test setup and the Azure code are... -The Azure test always paused at the 17th send in the HTTP transaction, pausing until a send callback was received; the test doesn't stall there (this was visible with the M2M debugging output). Apparently this is not a very good test jig. I'll go back to the drawing board and think about what source changes could better exercise the issue. @sandeepmistry Thanks |
Over the weekend I attempted to adjust the “simple test” to recreate the buffer blocking issue without the Azure code… with no luck. My production code exercises the buffer exhaustion hard, sending 30-40 frames per event; I thought that this was all the condition required but there apparently something else contributing to the issue that I don’t understand. So, based on your comment about potentially seeing this situation while working on I am starting two tests momentarily to further test long term stability one with diagnostic printf() output and one without printf() in the WiFi code. Issue\workaround with Rework socket buffer layer #77…
I don’t know if the WINC behavior is an Atmel “defect”, but it certainly appears to me that the retry approach you have implemented in the sockets layer eliminates the blocking caused by the tight retry loop in the Atmel reference code. Thanks @sandeepmistry I will update after some results come back from the two tests I am starting now. |
I do see similar behavior and It became much more stable after I started using Since I do not use Azure but my own Server backend, I always get to think that it's my own work creating this issue ... I'll certainly investigate further. I have 2 processes running. One sending a JSONhttp post and One to download files. Good to know I am not alone ... i was about to write a WDTreset routine when my server stops answering ... ;-( |
@GregTerrell thanks for the update, looking forward to your test results. We can definitely incorporate a smaller change from #77 to wait for a @ericbaril72 |
After several days of testing… more information, not sure if a solution. Throughout the week I tested our application with your WiFi101-socket-buffer library (issue #77). The first test were encouraging with no failures after extended periods (24+ hours for a typical 6 hour failure interval). But, these tests had a large number of printf statements for debugging. As testing progressed a number of printf statements were commented out; the failure reappeared: solid WINC lights. The code was tight looping in the socketBufferSendWait() in the code block below.
I added a timeout to socketBufferSendWait() after 500 milliseconds (below) and a retry (3 attempts) to the send logic in WiFiClient::write(). Based on the log, the send attempt is never acknowledged with a sendCallback; I don’t have a promiscuous wifi card to Wireshark with, so I don’t know if the send gets to the network. None the less, the WINC never interrupts for the callback. Further, the WINC is now in an unstable state and has stopped processing further send requests. I didn’t check if the WINC was responsive to pings and have started a new test to determine that.
From WiFiClient::write()
At this point I am have begun working on a reset function that can be invoked when the m2m_wifi_handle_events() endless loop is detected. Any guidance on possible soft recovery\reset logic patterns with the WINC would appreciated, but not necessarily expected. Thanks, |
@GregTerrell I need to take closer look at the comments above, however this feels like it's something we should let Atmel/Microchip support know about via a support ticket. What do you think? |
I agree, the chip becomes unstable when this occurs. The device driver is endlessly waiting for a WINC async event (callback) that never arrives and the WINC doesn't seem to be recoverable without drastic actions. FYI... I just created an account on the Atmel site. |
@sandeepmistry were you waiting on me to create a support ticket with Atmel\Microchip? I haven't yet. Currently working around this by resetting the Feather when detecting the endless BufferNotAvailable loop or a newly detected (may or may not be related) issue with DNS resolving failing. The resets are happening 4-6 times a day, testing since Saturday (Dec 17th). I ended up resetting the entire Feather M0 board, because I could not get the WiFi stack to function after resetting only the WINC1500 chip as follows...
DNS resolution would not function following the reset sequence, so just rebooting the whole device. Please let me know if I should notify Atmel and create support case. I will update this issue accordingly. |
I think I'm having the same issue with my Feather M0 + ATWINC1500. While reading at full speed, and sending just 1 byte each second, it only takes 8 seconds to hang the microcontroller. I prepared a small sketch that reproduces the issue with a TCP connection. This problem has become pretty much fatal to my project, a wifi-enabled serial terminal. A long-running "find /" interrupted by control-c has about a 20% chance of hanging the microcontroller. Oh, and XON/XOFF flow control is totally unusable because the terminal sends XOFF when it can't keep up writing to the screen, which is coincidentally exactly when the ATWINC1500 is reading so fast that a write will hang it--so the XOFF kills the microcontroller (maybe 20% of the time). The only work-around I have now is to rate-limit the stream on the server side so the ATWINC1500 never receives data too fast. This is an ugly hack, and a modern wireless device should really be able to keep up with a serial terminal from the 1980s. I don't know much about ATWINC1500 chip, but is the working theory that there's a firmware issue? Or is this still possibly a bug in the WiFi101 library? |
Based on my discoveries this appears to be a firmware issue. I believe that @sandeepmistry is in the same camp, but I am guessing he is traveling for the holidays since I haven't heard a confirmation on that from him. I have attempted to build in recovery into the WiFi101 code, without success. The WINC1500 is frozen when the issue I described is encountered, only a hard reset brings it back to life. After my issue is encountered, the WINC chip will not answer pings, what do you see. BTW it is great that you have a test sketch, I can only duplicate in a full application environment after 3-7 hours. Stay tuned, if I create a ticket with Atmel on the issue pertaining to firmware I will post the ticket ID here. There is a planned firmware update coming in early 2017 (see issue #90). |
@GregTerrell apologies for the delay, I was heads down on some other stuff. Yes, please go ahead and open a support case with, and provide the case number here. You can also a link to this issue when you open the support case. Thanks. |
Atmel/Microchip issue number: 00049418 |
Just adding another voice to the chorus. I'm in the same, or a closely related, boat. @GregTerrell Can I ask how you implemented the programmatic reset of the Feather M0? |
@GregTerrell thanks for reporting, please post any updates you receive from Atmel/Microchip here. |
I tied A0 (analog out) to the Feather Reset pin, then in code output 0v, see below. I attempted to use a digital out GPIO first but could not find a method to initialize the pin on load correctly (I was following some suggestions from Instructables.com), also calling address: 0 function doesn't work. My approach to recovering is likely not the best, but I needed to move on as the WINC issue was killing my project's timeline. I have a simple library that supports creating a "recovery callback" but you could simply call a method by name or insert the GPIO action directly.
|
Thanks for the info, Greg. I'll give that a shot. (I, too, went down the "jmp 0" path, briefly). |
@GregTerrell the reset code you mentioned above is also present in the library see: https://github.com/arduino-libraries/WiFi101/blob/master/src/bsp/source/nm_bsp_arduino.c#L149-L155 This gets called on start up, I'm surprised the additional reset code is needed. Could you trying adding a printout to see if it is trigger during the end + begin sequence? |
I had previously only tried a WINC reset then wifi.begin(), which didn't work. Name resolution failed following this reset, so I resorted to HW reset of the feather. After your suggestion of wifi.end() I added that to my original soft reset, it worked so I moved on. Don't know if additional reset has any benefit. I am traveling this week without access to my lab. I will test next week when I return home. FYI... got a "thank you we will look into it" from Atmel yesterday. |
Well, the reset isn't working for me :( Unless its also locking up my Arduino101.... The connection light on my ATWINC1500 board goes out, and I stop getting MQTT messages from it. The 30min timer kicks in, and I run the above reset with 100ms delays between turning the RST pin from LOW to HIGH....but it never reconnects. |
@sandeepmistry the additional WINC reset in my code is not required. It was there due to the path of events to get a reset to work, but commenting it out with simply a WiFi.end() followed by WiFi.begin() is sufficient. On another note Atmel believes that issue is related to a known problem with the TCP window code in firmware. I just received a NIC capable of promiscuous packet capture and will be working on getting some additional data requested in the next couple days. |
I don't know if this is helpful, but here's a pcap of a 5 second hang using the sketch and netcat process that I posted earlier. Feel free to forward to Atmel. |
Thanks! I will
Greg Terrell
Pelogical
517/515-2026
<mailto:[email protected]> [email protected]
<http://www.pelogical.com/> www.pelogical.com
From: Shaw Terwilliger [mailto:[email protected]]
Sent: Wednesday, January 18, 2017 6:10 PM
To: arduino-libraries/WiFi101 <[email protected]>
Cc: Greg Terrell <[email protected]>; Mention <[email protected]>
Subject: Re: [arduino-libraries/WiFi101] WiFi not recovering from wait for available TCP send buffer (#118)
I don't know if this is helpful, but here's a pcap of a 5 second hang using the sketch and netcat process that I posted earlier. Feel free to forward to Atmel.
atwinc1500_hang.pcap.gz <https://github.com/arduino-libraries/WiFi101/files/715280/atwinc1500_hang.pcap.gz>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#118 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/ANE6Q2ggxpQWcbWuKAa9Ey0OWAK6xQpkks5rTpvYgaJpZM4LDRkj> . <https://github.com/notifications/beacon/ANE6Q_HOYWI0BLVPn-w6ZhsXSYmFerQwks5rTpvYgaJpZM4LDRkj.gif>
|
@sterwill thank you for the PCAP file! I received a message from Atmel this morning that this is...
I have not attempted to flash any devices with the firmware they forward. I am not certain how to make this available to others here, or even if I am allowed to. I believe the "official release in Q1 2017" version the technician spoke of is the 19.5.x release mentioned in other WiFi issues. Seems this becoming a very anticipated release. |
Has anyone observed that the wifi loss is strongly dependent on the use of WiFi.maxLowPowerMode() ? When using this I observe wifi loss mostly within the first hour after reset . When I don't use WiFi.maxLowPowerMode() wifi loss will occure after a day. |
@GregTerrell @erniberni @john3exonets @walendo any news on this? There's PR build of the IDE that contains the new v19.5.2 WINC1500 firmware: arduino/Arduino#6069 (comment). Any one up for testing it with the master version of the library to see if this has been fixed? |
We can test later this week (Thu or Fri).
Greg Terrell
517/940-4050 office
[email protected]<mailto:[email protected]>
[LooUQ-Logo(40-cloudNoFill-trans)]
Enabling Your Internet of Things
From: Sandeep Mistry [mailto:[email protected]]
Sent: Monday, March 13, 2017 4:27 PM
To: arduino-libraries/WiFi101 <[email protected]>
Cc: Greg Terrell <[email protected]>; Mention <[email protected]>
Subject: Re: [arduino-libraries/WiFi101] WiFi not recovering from wait for available TCP send buffer (#118)
@GregTerrell<https://github.com/GregTerrell> @erniberni<https://github.com/erniberni> @john3exonets<https://github.com/john3exonets> @walendo<https://github.com/walendo> any news on this?
There's PR build of the IDE that contains the new v19.5.2 WINC1500 firmware: arduino/Arduino#6069 (comment)<arduino/Arduino#6069 (comment)>. Any one up for testing it with the master version of the library to see if this has been fixed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#118 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ANE6Q4z-1xaEui3oj3Gg-B4nfJKs7glZks5rlaZ8gaJpZM4LDRkj>.
|
The new firmware seems to fix the problem that caused my simple test sketch to hang in under 10 seconds, but when I use an SSL connection I'm still seeing the same problem. A quick peek at the network traffic shows some window scaling, then a hard lock of the microcontroller just like before. I'll try to prepare a small SSL test sketch to make reproducing it easy. |
@sterwill thanks for trying and the update! We should open a new support ticket with Atmel/Microchip for this. |
I attempted to test with the new firmware last Friday. So far, not much success. I was not able to connect with our cloud instance at all (using a Adafruit Feather M0). At this point I don’t know what is the root cause, but ran out of time. Will continue to test, but the next 2 weeks are very busy. |
I couldn't reproduce the issue I reported yesterday with a basic sketch connecting to openssl's "s_server" feeding it endless data, so it's possible my application's lock-up is unrelated to the firmware. This is good news for us all! I'll do some more testing later this week. |
@GregTerrell Did you have any custom SSL certificates installed on the M0 before upgrading the firmware? I had to reinstall my certs after upgrading. |
Thanks for the suggestion. The cert used was in the "installed" list under the last firmware (Azure, Baltimore), but that was one of the first things I was going to check. I only had time to flash, try basic test, then call it. |
@sandeepmistry the updated firmware looks promising, been running for 13 hours straight (with 19.4.4 running 6 hours was unlikely). Memory looks stable... encouraging. I struggled to get the tests to run at all until I updated the WiFi101 library via Arduino IDE to v 0.14.0. I was beginning to think there was a special lib version that I needed to grab. I will setup a longer test over the weekend and hopefully Monday there is another issue to close. |
@GregTerrell great, thanks for the update! In the meantime, IDE v1.8.2 has been release with the new firmware update tool to accompany the v0.14.0 WiFi101 library. The goal of v0.14.0 is still to remain backwards compatible with f/w 19.4.4 and 19.3.0 as well. |
@sandeepmistry the backward compatibility with 19.4.4 and 19.3.0 is very important, so good! I was simply mentioning that I could not get 19.5.2 to work at all (would not send) until I moved to WiFi101 v 0.14.0. Glad both firmware updater and wifi 0.14.0 are in the IDE install. |
@sandeepmistry I am marking this issue closed. With the current IDE\WiFi101 (at least 0.14.0) and the Atmel 19.5.2 firmware the WINC hang issue is no longer repeatable. I have a client that has now been running for 72 hours with complete stability. Prior to the new firmware 6 hours was the most you could have hoped for. |
How do I update the winc1500 to the newest Firmware? does the update also work with e.g. the adafruit winc1500b breakout board? |
@GregTerrell awesome, thanks for the follow up and closing!
Download IDE 1.8.2 and follow https://www.arduino.cc/en/Tutorial/FirmwareUpdater It should work with the Adafruit board, but we haven't tested it. Please contact Adafruit if you run into issues. |
@GregTerrell @sandeepmistry Thanks. I tried to update my ATWINC1500-MR210PB but got a an 'error while erasing flash memory'. Any idea how to solve it? |
@marcel12bell I would suggest you open a new issue here: https://github.com/arduino-libraries/WiFi101-FirmwareUpdater-Plugin/issues - and provide more details on your setup and the steps you are following. |
While developing a solution based on the Azure IoT Hub I detected a problem where the WIFI driver would appear locked and the WINC activity light would stay on solid. For testing purposes I switch to use the remote_monitorning example project (from Microsoft in the Azure-Samples/iot-hub-c-m0wifi-getstartedkit github repository). For reference, I was using a Adafruit Feather M0 WIFI development board. The issue typically took 5-8 hours of running in order to appear, but occasionally happened shortly after test start.
The issue was initiated by the azure-iot-arduino library’s event send function. When an event is sent to iothub, azure-iot-arduino performs 18 sends to complete sending of the event via http (consisting of the http request, the headers and body [this can be improved, see workaround below] ). After 16 (of the 18) sends, the WINC code hif_send() (in m2m_hif.c) will typically start to loop waiting for a dma_addr != 0. The loop exits on a send callback receipt or 1000 requests for a dma_addr. So effectively this should implement a TCP sliding window of 16 frames. If the hif_send() times out (for-loop 1000 dma_addr requests), the layer above (WiFiClient::write() in WiFiClient.cpp) tries to send again (infinite loop) waiting for a buffer aka dma_addr. I found that once the buffer wait starts, it will require about 160-180 loops to start receiving send callbacks at open buffers for sending. I my test the events where 10 seconds apart, so all 16 send buffers would be available at the next event.
All good, most of the time. But, it seems under some condition (usually hours into the test) the send callback for any of the 16 outstanding TCP sends is never processed, a needed buffer is never freed and consequently the send window is never re-opened. The WINC appears locked up with a solid activity light as the WiFi code endlessly loops waiting for a buffer.
Does this appear to be a WINC firmware\code issue to report to Atmel (if so, how does one report this to Atmel)? The WIFI code at this point is pure Atmel reference based on their Atmel-42420-WINC1500-Software-Design-Guide_UserGuide document.
Workaround (implemented in azure-iot-hub):
Within the azure-iot-hub library (adapter\httpapi_compact.c, SendHeadsToXIO() )… consolidate sending separate TCP sends for each header and the following CRLF into a send of single string containing the header text and the terminating CRLF. This reduces the number of TCP sends for an event posting from 18 to 11 (this could be reduced to 10 with further refactoring). This keeps the WINC from ever hitting the 16 buffer limit. Prior to implementing this workaround, I could never exceed 8 hours of run time before lockup; my test has now been executing for 40 hours.
The workaround above seems to have fixed my use case (now running for 36 hours, never made it past 8 hours before workaround), but other applications sending more frequently are likely to fail without such an easily implemented workaround.
Thanks,
Greg
The text was updated successfully, but these errors were encountered: