Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WiFi not recovering from wait for available TCP send buffer #118

Closed
GregTerrell opened this issue Dec 3, 2016 · 58 comments
Closed

WiFi not recovering from wait for available TCP send buffer #118

GregTerrell opened this issue Dec 3, 2016 · 58 comments
Labels
type: imperfection Perceived defect in any part of project

Comments

@GregTerrell
Copy link

While developing a solution based on the Azure IoT Hub I detected a problem where the WIFI driver would appear locked and the WINC activity light would stay on solid. For testing purposes I switch to use the remote_monitorning example project (from Microsoft in the Azure-Samples/iot-hub-c-m0wifi-getstartedkit github repository). For reference, I was using a Adafruit Feather M0 WIFI development board. The issue typically took 5-8 hours of running in order to appear, but occasionally happened shortly after test start.

The issue was initiated by the azure-iot-arduino library’s event send function. When an event is sent to iothub, azure-iot-arduino performs 18 sends to complete sending of the event via http (consisting of the http request, the headers and body [this can be improved, see workaround below] ). After 16 (of the 18) sends, the WINC code hif_send() (in m2m_hif.c) will typically start to loop waiting for a dma_addr != 0. The loop exits on a send callback receipt or 1000 requests for a dma_addr. So effectively this should implement a TCP sliding window of 16 frames. If the hif_send() times out (for-loop 1000 dma_addr requests), the layer above (WiFiClient::write() in WiFiClient.cpp) tries to send again (infinite loop) waiting for a buffer aka dma_addr. I found that once the buffer wait starts, it will require about 160-180 loops to start receiving send callbacks at open buffers for sending. I my test the events where 10 seconds apart, so all 16 send buffers would be available at the next event.

All good, most of the time. But, it seems under some condition (usually hours into the test) the send callback for any of the 16 outstanding TCP sends is never processed, a needed buffer is never freed and consequently the send window is never re-opened. The WINC appears locked up with a solid activity light as the WiFi code endlessly loops waiting for a buffer.

Does this appear to be a WINC firmware\code issue to report to Atmel (if so, how does one report this to Atmel)? The WIFI code at this point is pure Atmel reference based on their Atmel-42420-WINC1500-Software-Design-Guide_UserGuide document.

Workaround (implemented in azure-iot-hub):
Within the azure-iot-hub library (adapter\httpapi_compact.c, SendHeadsToXIO() )… consolidate sending separate TCP sends for each header and the following CRLF into a send of single string containing the header text and the terminating CRLF. This reduces the number of TCP sends for an event posting from 18 to 11 (this could be reduced to 10 with further refactoring). This keeps the WINC from ever hitting the 16 buffer limit. Prior to implementing this workaround, I could never exceed 8 hours of run time before lockup; my test has now been executing for 40 hours.

The workaround above seems to have fixed my use case (now running for 36 hours, never made it past 8 hours before workaround), but other applications sending more frequently are likely to fail without such an easily implemented workaround.

Thanks,
Greg

@sandeepmistry
Copy link
Contributor

Hi @GregTerrell,

Thanks for letting us know, do you you think you can supply a minimal sketch that can be used to reproduce the problem?

While working on #77, I might have run into this condition in f205d69.

In any case, if there's a simple sketch that can reproduce this condition that would be great.

@sandeepmistry sandeepmistry added the status: waiting for information More information must be provided before work can proceed label Dec 5, 2016
@GregTerrell
Copy link
Author

I will see what I can do. I was testing with Azure IotHub client and remote_monitoring. But now that I know exactly how the failure happened, maybe I can create a node.js server and a brief Arduino sketch. It would be late this week or this weekend at the earliest.

@GregTerrell
Copy link
Author

This went up in priority, when I discovered other sections of our application send more headers and hit the 16 outstanding sends quickly. I have uploaded a test configuration that can exercise the issue, there is a sketch and a node.js file. There are differences between the exact event sequences between the attached test jig (below) and testing against the Azure and the remote_monitoring example, but the test below can causes the failure within a couple minutes (see embedded comment). There a perceivable latency differences between the Node.js host and IotHub sends. Note: in my testing I have debugging enabled in the Atmel code; the marker showing the prerequisite for the failure are the Failed to alloc rx size console messages.

(APP)(DBG)[hif_send][424]Failed to alloc rx size
(APP)(DBG)[hif_send][424]Failed to alloc rx size
(APP)(DBG)[hif_send][424]Failed to alloc rx size

Additionally, I have a print macro added to the write() block in WiFiClient.cpp to console echo the number of socket.c sends() required to push the frame on the wire. The example below shows 2 attempts (1 for the failed alloc, 1 suceeded); the output is suppressed for attempts=1.

(APP)(DBG)[hif_send][424]Failed to alloc rx size
(APP)(DBG)[m2m_ip_cb][433]send callback sock=0, sessionID=1
(APP)(DBG)[m2m_ip_cb][433]send callback sock=0, sessionID=1
(LOOUQ)(DBG)[write][216][WiFiClient.cpp] write COMPLETED: send attempts=2

size_t WiFiClient::write(const uint8_t *buf, size_t size)
{
	sint16 err;

	if (_socket < 0 || size == 0) {
		setWriteError();
		return 0;
	}

	// Network led ON (rev A then rev B).
	m2m_periph_gpio_set_val(M2M_PERIPH_GPIO16, 0);
	m2m_periph_gpio_set_val(M2M_PERIPH_GPIO5, 0);

	m2m_wifi_handle_events(NULL);
	int attempts = 1;

	while ((err = send(_socket, (void *)buf, size, 0)) < 0) {
		// Exit on fatal error, retry if buffer not ready.
		if (err != SOCK_ERR_BUFFER_FULL) {
			setWriteError();
			m2m_periph_gpio_set_val(M2M_PERIPH_GPIO16, 1);
			m2m_periph_gpio_set_val(M2M_PERIPH_GPIO5, 1);
			return 0;
		}

		if (attempts % 1000 == 0) {
			LOOUQ_DBG("[WiFiClient.cpp] writing socket=%d, size=%d, sends@:%d", _socket, size, attempts);
		} 
		m2m_wifi_handle_events(NULL);
		attempts ++;
	}
	
	// Network led OFF (rev A then rev B).
	m2m_periph_gpio_set_val(M2M_PERIPH_GPIO16, 1);
	m2m_periph_gpio_set_val(M2M_PERIPH_GPIO5, 1);

	if (attempts > 1) {
		LOOUQ_DBG("[WiFiClient.cpp] write COMPLETED: send attempts=%d", attempts);
	}
	return size;
}

@GregTerrell
Copy link
Author

Test Sketch

// targeting Feather M0
#include <WiFi101.h>
static WiFiClient client;              // for WiFi101  

static const char ssid[] = "LOOUQ";
static const char pass[] = "grouperT3ch";
char server[] = "192.168.10.75"; // server IP
int port = 8081;

#define BUFF_SIZE (1024)
char buf[BUFF_SIZE];
  
void setup() {
  // setup serial and wifi
  delay(10000);                                      // wait for FeatherM0 serial start
  Serial.begin(9600);
  Serial.println("Serial interface started.");
  Serial.println("Initializing Feather M0 WINC");
  pinMode(2, OUTPUT);                             // Enable Adafruit FeatherM0 chip, set pins for FeatherM0: CS=8, IRQ=7, RST=4, EN=2
  digitalWrite(2, HIGH);
  WiFi.setPins(8, 7, 4, 2);
  int status = WL_IDLE_STATUS;     // the Wifi radio's status
  while ( status != WL_CONNECTED) {
    Serial.print("Attempting to connect to WPA SSID: ");
    Serial.println(ssid);
    // Connect to WPA/WPA2 network:
    status = WiFi.begin(ssid, pass);
    // wait 5 seconds for connection:
    delay(5000);
  }
  Serial.println("Connected to WiFi");
  delay(2000);

  Serial.print("Connecting to ");
  Serial.print(server);
  Serial.print("  port: ");
  Serial.println(port);
  
  if (client.connect(server, port)) {
    Serial.print("Connected...");
    Serial.println(server);

    int testCnt = 0;
    while(1) {
      
      sendRequest(testCnt);

      /* with the read block below commented, "Failed to alloc rx size" events will
       * appear within the first 10 test cycles and progress to full failure by test
       * cycle 20-30
       * if uncommented, runs can fail in 2 minutes or 20 hours
      */
//      while (client.available()) {
//        char c = client.read();
//        Serial.write(c);
//      }
//      if (!client.connected()) {
//        Serial.println();
//        Serial.println("disconnecting from server.");
//        client.stop();
//      }

      testCnt ++;
      Serial.print("Test cycle: ");
      Serial.println(testCnt);
      delay(1000);
    }
  }
  else {
    Serial.println("Connection failed!");
  }  
}

void loop() {
}

void sendRequest(int testCnt) {
  // send request: request verb, headers (>13), CRLF delim, body
  client.println("POST http://192.168.10.75:8081/ HTTP/1.1");

  client.println("iothub-to: /devices/featherm0-2/messages/events");
  client.println("Authorization: SharedAccessSignature sr=preview.azure-devices.net/devices/featherm0-2&sig=opkb%%3d&se=1479415635&skn=");
  client.println("Accept: application/json");
  client.println("Connection: Keep-Alive");
  client.println("User-Agent: iothubclient/1.0.1");
  client.println("Content-Type: application/octet-stream");
  client.println("iothub-app-iotOS-version: v-1.0");
  client.println("iothub-app-message-type: telemetry");
  client.println("iothub-app-event-class: user");
  client.println("iothub-app-event-name: TempPressHumid");
  client.println("iothub-app-display-value: Temperature=73.40");
  client.println("iothub-messageid: featherm0-2-16139");
  client.println("iothub-app-header0: TempPressHumid");
  client.println("iothub-app-header1: TempPressHumid");
  client.println("iothub-app-header2: TempPressHumid");
  client.println("iothub-app-header3: TempPressHumid");
  client.println("iothub-app-header4: TempPressHumid");
  client.println("iothub-app-header5: TempPressHumid");
  client.println("iothub-app-header6: TempPressHumid");
  client.println("iothub-app-header7: TempPressHumid");
  client.println("iothub-app-header8: TempPressHumid");
  client.println("iothub-app-header9: TempPressHumid");
  client.println("iothub-correlationid:");
  snprintf(buf, sizeof(buf), "Test-Count: %d", testCnt);
  client.println(buf);
  client.println("Host: loouq-iotos-preview.azure-devices.net");
  client.println("Content-Length: 112");
  
  client.write("\r\n",(size_t)2);
  client.println("{\"DEVICEID\":\"featherm0-2\", \"Telemetry\":{\"Temperature\":73.400002, \"BarPressure\":28.877970, \"Humidity\":27.944336}}");
}

@GregTerrell
Copy link
Author

Test Node.js

var http = require('http');

http.createServer(function (request, response) {
    console.log();
    console.log("request recvd: method=%s, url=%s", request.method, request.url);

    request.on('error', function(err) {
        console.error(err);
    });

    var headers = request.headers;
    var hdrCnt = 0;
    for(var hdr in headers) {
        ++ hdrCnt;
        console.log("  Header[%d]=%s:%s", hdrCnt, hdr, headers[hdr]);
    }
    if (request.method == 'POST') {
        var jsonString = '';
        request.on('data', function (data) {
            jsonString += data;
        });
        request.on('end', function () {
            console.log(JSON.parse(jsonString));
        });
    }
    console.log("\r\n");

    // Every request gets the same "Hello Connect" response.
    response.writeHead(200, {"Content-Type": "text/plain"});
    response.end("Received IotRequest");

}).listen(8081);
console.log("Listening on port 8081");

@sandeepmistry sandeepmistry added type: imperfection Perceived defect in any part of project and removed status: waiting for information More information must be provided before work can proceed labels Dec 8, 2016
@sandeepmistry
Copy link
Contributor

Thanks, I'm able to reproduce with the code you provided with my Mac and a MKR1000. It fails for me on count 21.

@sandeepmistry
Copy link
Contributor

If I add the following:

while (client.available()) {
   client.read();
}

to the end of sendRequest in the sketch it no longer hangs (count was at over 230).

There's an outstanding item with the WINC1500 firmware/ASF code (sort of related to #32), where not reading received data from a socket blocks all operations. With the sketch you supplied I think around count 16 or 21 the receive buffer will fill up an block all SPI communications with the WINC1500 causing the behaviour we are seeing.

Could you please try adding that snippet to the end of sendRequest and seeing if that also makes things better?

After this, the original issue you are reporting might still exist, just need to find a another simple sketch to reproduce it ...

@sandeepmistry sandeepmistry added the status: waiting for information More information must be provided before work can proceed label Dec 8, 2016
@sandeepmistry
Copy link
Contributor

I just saw your comment in the sketch:

      /* with the read block below commented, "Failed to alloc rx size" events will
       * appear within the first 10 test cycles and progress to full failure by test
       * cycle 20-30
       * if uncommented, runs can fail in 2 minutes or 20 hours
      */
//      while (client.available()) {
//        char c = client.read();
//        Serial.write(c);
//      }
//      if (!client.connected()) {
//        Serial.println();
//        Serial.println("disconnecting from server.");
//        client.stop();
//      }

I'd still be interest in testing without printing the response to Serial to see if it might be USB related.

@GregTerrell
Copy link
Author

So just add the...

while (client.available()) {
   client.read();
}

to empty the read buffer?

I can start a test to run overnight.

@sandeepmistry
Copy link
Contributor

@GregTerrell correct, thanks.

I'm wondering if the issues you are seeing could be related to #80 (comment) as well. For that issue, "lots of ARP request cause failure" - this was also seemed random.

Let's see how your overnight tests go, then we can decide if we should open a support ticket with Atmel support (link is in #80 (comment)).

@GregTerrell
Copy link
Author

The test last night ran for about 8 hours without issue, the tests with Azure could not make it that long (about 6.5 hours was the longest). I restarted this AM without the 2sec delay between iterations and turned off M2M debug (common\include)... still running without any sign of degrading in buffer acquisition.

The differences between my test setup and the Azure code are...

-The Azure test always paused at the 17th send in the HTTP transaction, pausing until a send callback was received; the test doesn't stall there (this was visible with the M2M debugging output).
-The Session variable stays a 1 (Azure it incremented)

Apparently this is not a very good test jig. I'll go back to the drawing board and think about what source changes could better exercise the issue.

@sandeepmistry Thanks

@GregTerrell
Copy link
Author

Over the weekend I attempted to adjust the “simple test” to recreate the buffer blocking issue without the Azure code… with no luck. My production code exercises the buffer exhaustion hard, sending 30-40 frames per event; I thought that this was all the condition required but there apparently something else contributing to the issue that I don’t understand.

So, based on your comment about potentially seeing this situation while working on
Rework socket buffer layer #77, I took a deeper look at your retry logic when encountering the buffer not available conditions and your incorporation of the m2m_wifi_handle_events() function. I decided to try your new socket code with my code and Azure. Although I initially found an issue, that I applied a change to (see below), I had good results… running for 24 hours successfully (best run previously was about 6.5 hours).

I am starting two tests momentarily to further test long term stability one with diagnostic printf() output and one without printf() in the WiFi code.

Issue\workaround with Rework socket buffer layer #77
I found that when the WINC buffer exhaustion occurs the retry logic unsets the SOCKET_BUFFER_FLAG_CONNECTED while setting the SOCKET_BUFFER_FLAG_SENDING. I could not verify that the SOCKET_BUFFER_FLAG_* byte was intended to be an exclusive flag register, so I modified the
socketBufferSendWait() to OR in the SENDING flag, thus preserving the CONNECTED status. This made sense to me, as the SENDING device should be still CONNECTED.

sint8 socketBufferSendWait(SOCKET sock)
{
	SOCKET s;

	// check if any sockets are full, if so fail because m2m_wifi_handle_events
	// won't be able to run
	for (s = 0; s < MAX_SOCKET; s++) {
		if (socketBufferIsFull(s)) {
			return 0;
		}
	}
	//original: gastrSocketBuffer[sock].flag = SOCKET_BUFFER_FLAG_SENDING;
        // OR in the SENDING flag 
	gastrSocketBuffer[sock].flag |= SOCKET_BUFFER_FLAG_SENDING;

	while (gastrSocketBuffer[sock].flag & SOCKET_BUFFER_FLAG_SENDING) {
		m2m_wifi_handle_events(NULL);
	}
	return 1;
}

I don’t know if the WINC behavior is an Atmel “defect”, but it certainly appears to me that the retry approach you have implemented in the sockets layer eliminates the blocking caused by the tight retry loop in the Atmel reference code.

Thanks @sandeepmistry I will update after some results come back from the two tests I am starting now.

@ericbaril72
Copy link

ericbaril72 commented Dec 13, 2016

I do see similar behavior and It became much more stable after I started using
client.flush().

Since I do not use Azure but my own Server backend, I always get to think that it's my own work creating this issue ... I'll certainly investigate further.

I have 2 processes running. One sending a JSONhttp post and One to download files.
Within a certain period, the activity LED stays ON, the files are no longer being downloaded BUT
my JSONhttp post are still working .

Good to know I am not alone ... i was about to write a WDTreset routine when my server stops answering ... ;-(

@sandeepmistry
Copy link
Contributor

@GregTerrell thanks for the update, looking forward to your test results. We can definitely incorporate a smaller change from #77 to wait for a SOCKET_MSG_SEND or SOCKET_MSG_SENDTO event from the WINC1500. Will need to check to see if what the performance on write throughput is.

@ericbaril72 client.flush() currently flushes the RX buffer, which has a big impact the current WINC1500 <-> MCU protocol structure. Ignoring received socket data blocks communication between the 2 as discussed in #32.

@GregTerrell
Copy link
Author

After several days of testing… more information, not sure if a solution. Throughout the week I tested our application with your WiFi101-socket-buffer library (issue #77). The first test were encouraging with no failures after extended periods (24+ hours for a typical 6 hour failure interval). But, these tests had a large number of printf statements for debugging. As testing progressed a number of printf statements were commented out; the failure reappeared: solid WINC lights. The code was tight looping in the socketBufferSendWait() in the code block below.

while (gastrSocketBuffer[sock].flag & SOCKET_BUFFER_FLAG_SENDING) {
m2m_wifi_handle_events(NULL);
}

I added a timeout to socketBufferSendWait() after 500 milliseconds (below) and a retry (3 attempts) to the send logic in WiFiClient::write(). Based on the log, the send attempt is never acknowledged with a sendCallback; I don’t have a promiscuous wifi card to Wireshark with, so I don’t know if the send gets to the network. None the less, the WINC never interrupts for the callback. Further, the WINC is now in an unstable state and has stopped processing further send requests. I didn’t check if the WINC was responsive to pings and have started a new test to determine that.

sint8 socketBufferSendWait(SOCKET sock)
{
	SOCKET s;

	// check if any sockets are full, if so fail because m2m_wifi_handle_events
	// won't be able to run
	for (s = 0; s < MAX_SOCKET; s++) {
		if (socketBufferIsFull(s)) {
			printf("socketBufferSendWait(%d) detected socket(%d) full. \r\n", sock, s);
			return 0;
		}
	}

	//original: gastrSocketBuffer[sock].flag = SOCKET_BUFFER_FLAG_SENDING;
	gastrSocketBuffer[sock].flag |= SOCKET_BUFFER_FLAG_SENDING;

	#define SEND_WAIT	(500)

	unsigned long beginWait = millis();
	unsigned long waitAt = beginWait;
	while (gastrSocketBuffer[sock].flag & SOCKET_BUFFER_FLAG_SENDING && waitAt - beginWait < SEND_WAIT) {
		m2m_wifi_handle_events(NULL);
		waitAt = millis();
	}
	if (gastrSocketBuffer[sock].flag & SOCKET_BUFFER_FLAG_SENDING)
	{
		//gastrSocketBuffer[sock].flag &= ~SOCKET_BUFFER_FLAG_SENDING;
		printf("!! socketBufferSendWait(%d) m2m_wifi_handle_events() wait timeout \r\n", sock);
		return 0;
	}
	printf("<< socketBufferSendWait(%d) m2m_wifi_handle_events() completed \r\n", sock);

	return 1;
}

From WiFiClient::write()

    #define MAX_RETRIES		(3)
	int retries = 0;

	while ((err = send(_socket, (void *)buf, size, 0)) < 0) {
		// Exit on fatal error, retry if buffer not ready.
		// or can't wait for send event because a recv is pending
		if (err != SOCK_ERR_BUFFER_FULL || socketBufferSendWait(_socket) == 0) {
 			if (retries > MAX_RETRIES)
			{
				setWriteError();
				m2m_periph_gpio_set_val(M2M_PERIPH_GPIO16, 1);
				m2m_periph_gpio_set_val(M2M_PERIPH_GPIO5, 1);
				return 0;
			}
			printf("WiFiClient:send(%d) retry=%d \r\n", _socket, retries);
			retries ++; 
			m2m_wifi_handle_events(NULL);
		}
	}

At this point I am have begun working on a reset function that can be invoked when the m2m_wifi_handle_events() endless loop is detected. Any guidance on possible soft recovery\reset logic patterns with the WINC would appreciated, but not necessarily expected.

Thanks,
Greg

@sandeepmistry
Copy link
Contributor

@GregTerrell I need to take closer look at the comments above, however this feels like it's something we should let Atmel/Microchip support know about via a support ticket. What do you think?

@GregTerrell
Copy link
Author

I agree, the chip becomes unstable when this occurs. The device driver is endlessly waiting for a WINC async event (callback) that never arrives and the WINC doesn't seem to be recoverable without drastic actions. FYI... I just created an account on the Atmel site.

@GregTerrell
Copy link
Author

GregTerrell commented Dec 21, 2016

@sandeepmistry were you waiting on me to create a support ticket with Atmel\Microchip? I haven't yet.

Currently working around this by resetting the Feather when detecting the endless BufferNotAvailable loop or a newly detected (may or may not be related) issue with DNS resolving failing. The resets are happening 4-6 times a day, testing since Saturday (Dec 17th).

I ended up resetting the entire Feather M0 board, because I could not get the WiFi stack to function after resetting only the WINC1500 chip as follows...

  • attempted WINC reset
  • delay
  • wifi.init()
  • wifi.begin()

DNS resolution would not function following the reset sequence, so just rebooting the whole device.

Please let me know if I should notify Atmel and create support case. I will update this issue accordingly.

@sterwill
Copy link

I think I'm having the same issue with my Feather M0 + ATWINC1500. While reading at full speed, and sending just 1 byte each second, it only takes 8 seconds to hang the microcontroller. I prepared a small sketch that reproduces the issue with a TCP connection.

This problem has become pretty much fatal to my project, a wifi-enabled serial terminal. A long-running "find /" interrupted by control-c has about a 20% chance of hanging the microcontroller. Oh, and XON/XOFF flow control is totally unusable because the terminal sends XOFF when it can't keep up writing to the screen, which is coincidentally exactly when the ATWINC1500 is reading so fast that a write will hang it--so the XOFF kills the microcontroller (maybe 20% of the time).

The only work-around I have now is to rate-limit the stream on the server side so the ATWINC1500 never receives data too fast. This is an ugly hack, and a modern wireless device should really be able to keep up with a serial terminal from the 1980s.

I don't know much about ATWINC1500 chip, but is the working theory that there's a firmware issue? Or is this still possibly a bug in the WiFi101 library?

@GregTerrell
Copy link
Author

Based on my discoveries this appears to be a firmware issue. I believe that @sandeepmistry is in the same camp, but I am guessing he is traveling for the holidays since I haven't heard a confirmation on that from him. I have attempted to build in recovery into the WiFi101 code, without success. The WINC1500 is frozen when the issue I described is encountered, only a hard reset brings it back to life.

After my issue is encountered, the WINC chip will not answer pings, what do you see. BTW it is great that you have a test sketch, I can only duplicate in a full application environment after 3-7 hours. Stay tuned, if I create a ticket with Atmel on the issue pertaining to firmware I will post the ticket ID here. There is a planned firmware update coming in early 2017 (see issue #90).

@sandeepmistry
Copy link
Contributor

@GregTerrell apologies for the delay, I was heads down on some other stuff.

Yes, please go ahead and open a support case with, and provide the case number here. You can also a link to this issue when you open the support case.

Thanks.

@GregTerrell
Copy link
Author

Atmel/Microchip issue number: 00049418

@walendo
Copy link

walendo commented Dec 28, 2016

Just adding another voice to the chorus. I'm in the same, or a closely related, boat.

@GregTerrell Can I ask how you implemented the programmatic reset of the Feather M0?

@sandeepmistry
Copy link
Contributor

@GregTerrell thanks for reporting, please post any updates you receive from Atmel/Microchip here.

@GregTerrell
Copy link
Author

I tied A0 (analog out) to the Feather Reset pin, then in code output 0v, see below. I attempted to use a digital out GPIO first but could not find a method to initialize the pin on load correctly (I was following some suggestions from Instructables.com), also calling address: 0 function doesn't work.

My approach to recovering is likely not the best, but I needed to move on as the WINC issue was killing my project's timeline.

I have a simple library that supports creating a "recovery callback" but you could simply call a method by name or insert the GPIO action directly.

// in application .ino
void wifiRecover(const char * errMessage)
{
  Serial.println("");
  Serial.println("");
  Serial.println("*** Executing application errorRecover()!");
  Serial.print("*** Error Reported: ");
  Serial.println(errMessage);
  Serial.println("*** Reseting Feather...");
  Serial.println("");

  #if defined(OLED_ENABLED)
    // enable the OLED display
    display.clearDisplay();
    display.writeLine(2, "Rebooting");
    display.display();
  #endif
  
  // let display streams get out
  delay(250);

  analogWrite(A0, 0);
}
// in setup()
  /* Register WINC recovery function */
  iotQiRecovery_RegisterCb(wifiRecover);
// in WiFiClient.cpp
// requires  #include <iotQi_Recovery.h> if using callback library

size_t WiFiClient::write(const uint8_t *buf, size_t size)
{
	sint16 err;

	if (_socket < 0 || size == 0) {
		setWriteError();
		return 0;
	}

	// Network led ON (rev A then rev B).
	m2m_periph_gpio_set_val(M2M_PERIPH_GPIO16, 0);
	m2m_periph_gpio_set_val(M2M_PERIPH_GPIO5, 0);

	m2m_wifi_handle_events(NULL);

    //iotQi Recovery setup
	int sendAttempts = 1;

	while ((err = send(_socket, (void *)buf, size, 0)) < 0) {
		// Exit on fatal error, retry if buffer not ready.
		if (err != SOCK_ERR_BUFFER_FULL) {
			setWriteError();
			m2m_periph_gpio_set_val(M2M_PERIPH_GPIO16, 1);
			m2m_periph_gpio_set_val(M2M_PERIPH_GPIO5, 1);
			return 0;
		}

	    //iotQi Recovery test condition
		if (sendAttempts > 1000) {
			iotQiRecovery_RecoverError("[WifiClient.cpp]WiFiClient::write() too many sends with buffer full.");
		} 
		sendAttempts ++;

		m2m_wifi_handle_events(NULL);
	}
	
	// Network led OFF (rev A then rev B).
	m2m_periph_gpio_set_val(M2M_PERIPH_GPIO16, 1);
	m2m_periph_gpio_set_val(M2M_PERIPH_GPIO5, 1);

	return size;
}

Recovery callback repo

@walendo
Copy link

walendo commented Dec 29, 2016

Thanks for the info, Greg. I'll give that a shot. (I, too, went down the "jmp 0" path, briefly).

@sandeepmistry
Copy link
Contributor

@GregTerrell the reset code you mentioned above is also present in the library see: https://github.com/arduino-libraries/WiFi101/blob/master/src/bsp/source/nm_bsp_arduino.c#L149-L155

This gets called on start up, I'm surprised the additional reset code is needed. Could you trying adding a printout to see if it is trigger during the end + begin sequence?

@sandeepmistry sandeepmistry added Waiting for Atmel and removed status: waiting for information More information must be provided before work can proceed labels Jan 3, 2017
@GregTerrell
Copy link
Author

I had previously only tried a WINC reset then wifi.begin(), which didn't work. Name resolution failed following this reset, so I resorted to HW reset of the feather. After your suggestion of wifi.end() I added that to my original soft reset, it worked so I moved on. Don't know if additional reset has any benefit.

I am traveling this week without access to my lab. I will test next week when I return home.

FYI... got a "thank you we will look into it" from Atmel yesterday.

@john3exonets
Copy link

Well, the reset isn't working for me :( Unless its also locking up my Arduino101.... The connection light on my ATWINC1500 board goes out, and I stop getting MQTT messages from it. The 30min timer kicks in, and I run the above reset with 100ms delays between turning the RST pin from LOW to HIGH....but it never reconnects.

@GregTerrell
Copy link
Author

@sandeepmistry the additional WINC reset in my code is not required. It was there due to the path of events to get a reset to work, but commenting it out with simply a WiFi.end() followed by WiFi.begin() is sufficient.

On another note Atmel believes that issue is related to a known problem with the TCP window code in firmware. I just received a NIC capable of promiscuous packet capture and will be working on getting some additional data requested in the next couple days.

@sterwill
Copy link

I don't know if this is helpful, but here's a pcap of a 5 second hang using the sketch and netcat process that I posted earlier. Feel free to forward to Atmel.

atwinc1500_hang.pcap.gz

@GregTerrell
Copy link
Author

GregTerrell commented Jan 18, 2017 via email

@GregTerrell
Copy link
Author

@sterwill thank you for the PCAP file!

I received a message from Atmel this morning that this is...

From our analysis it is the same issue which I have mentioned earlier, TCP Window update issue.

We are already aware of this issue and its fixed in a later internal release version of the firmware. Please find attached the v19.4.8 firmware of WINC1500 in which we have fixed this issue.

Note: WINC1500 FW v19.4.8 is an internal release, please use this only for the development and testing purpose (not recommended for production). There won't be much support for issues related to this firmware (if any), as we are in development phase of newer firmware for official release in Q1 2017. So for production requirements we would recommend you to wait for the official release.

I have not attempted to flash any devices with the firmware they forward. I am not certain how to make this available to others here, or even if I am allowed to. I believe the "official release in Q1 2017" version the technician spoke of is the 19.5.x release mentioned in other WiFi issues. Seems this becoming a very anticipated release.

@erniberni
Copy link

Has anyone observed that the wifi loss is strongly dependent on the use of WiFi.maxLowPowerMode() ? When using this I observe wifi loss mostly within the first hour after reset . When I don't use WiFi.maxLowPowerMode() wifi loss will occure after a day.
I am sending 300-400 bytes data every 15min to a database and do a programmed reset after loosing the wifi. I am also measuring the cpu temperature and as the cpu temperature is 6-8 degC higher for a short time after reset I see the resets directly on the temperature graph.

@sandeepmistry
Copy link
Contributor

@GregTerrell @erniberni @john3exonets @walendo any news on this?

There's PR build of the IDE that contains the new v19.5.2 WINC1500 firmware: arduino/Arduino#6069 (comment). Any one up for testing it with the master version of the library to see if this has been fixed?

@GregTerrell
Copy link
Author

GregTerrell commented Mar 13, 2017 via email

@sterwill
Copy link

The new firmware seems to fix the problem that caused my simple test sketch to hang in under 10 seconds, but when I use an SSL connection I'm still seeing the same problem. A quick peek at the network traffic shows some window scaling, then a hard lock of the microcontroller just like before.

I'll try to prepare a small SSL test sketch to make reproducing it easy.

@sandeepmistry
Copy link
Contributor

@sterwill thanks for trying and the update! We should open a new support ticket with Atmel/Microchip for this.

@GregTerrell
Copy link
Author

I attempted to test with the new firmware last Friday. So far, not much success. I was not able to connect with our cloud instance at all (using a Adafruit Feather M0). At this point I don’t know what is the root cause, but ran out of time. Will continue to test, but the next 2 weeks are very busy.

@sterwill
Copy link

sterwill commented Mar 20, 2017

I couldn't reproduce the issue I reported yesterday with a basic sketch connecting to openssl's "s_server" feeding it endless data, so it's possible my application's lock-up is unrelated to the firmware. This is good news for us all!

I'll do some more testing later this week.

@sterwill
Copy link

@GregTerrell Did you have any custom SSL certificates installed on the M0 before upgrading the firmware? I had to reinstall my certs after upgrading.

@GregTerrell
Copy link
Author

Thanks for the suggestion. The cert used was in the "installed" list under the last firmware (Azure, Baltimore), but that was one of the first things I was going to check. I only had time to flash, try basic test, then call it.

@GregTerrell
Copy link
Author

@sandeepmistry the updated firmware looks promising, been running for 13 hours straight (with 19.4.4 running 6 hours was unlikely). Memory looks stable... encouraging.

I struggled to get the tests to run at all until I updated the WiFi101 library via Arduino IDE to v 0.14.0. I was beginning to think there was a special lib version that I needed to grab. I will setup a longer test over the weekend and hopefully Monday there is another issue to close.

@sandeepmistry
Copy link
Contributor

@GregTerrell great, thanks for the update!

In the meantime, IDE v1.8.2 has been release with the new firmware update tool to accompany the v0.14.0 WiFi101 library. The goal of v0.14.0 is still to remain backwards compatible with f/w 19.4.4 and 19.3.0 as well.

@GregTerrell
Copy link
Author

@sandeepmistry the backward compatibility with 19.4.4 and 19.3.0 is very important, so good! I was simply mentioning that I could not get 19.5.2 to work at all (would not send) until I moved to WiFi101 v 0.14.0. Glad both firmware updater and wifi 0.14.0 are in the IDE install.

@GregTerrell
Copy link
Author

@sandeepmistry I am marking this issue closed.

With the current IDE\WiFi101 (at least 0.14.0) and the Atmel 19.5.2 firmware the WINC hang issue is no longer repeatable. I have a client that has now been running for 72 hours with complete stability. Prior to the new firmware 6 hours was the most you could have hoped for.

@mamama1
Copy link

mamama1 commented Mar 27, 2017

How do I update the winc1500 to the newest Firmware? does the update also work with e.g. the adafruit winc1500b breakout board?

@sandeepmistry
Copy link
Contributor

@GregTerrell awesome, thanks for the follow up and closing!

@mamama1

How do I update the winc1500 to the newest Firmware? does the update also work with e.g. the adafruit winc1500b breakout board?

Download IDE 1.8.2 and follow https://www.arduino.cc/en/Tutorial/FirmwareUpdater

It should work with the Adafruit board, but we haven't tested it. Please contact Adafruit if you run into issues.

@marcel12bell
Copy link

@GregTerrell @sandeepmistry Thanks. I tried to update my ATWINC1500-MR210PB but got a an 'error while erasing flash memory'. Any idea how to solve it?

@sandeepmistry
Copy link
Contributor

@marcel12bell I would suggest you open a new issue here: https://github.com/arduino-libraries/WiFi101-FirmwareUpdater-Plugin/issues - and provide more details on your setup and the steps you are following.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: imperfection Perceived defect in any part of project
Projects
None yet
Development

No branches or pull requests

10 participants