Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wifi Access Point Mode - HttpListener hangs #1335

Open
Alex-111 opened this issue Jul 28, 2023 · 36 comments
Open

Wifi Access Point Mode - HttpListener hangs #1335

Alex-111 opened this issue Jul 28, 2023 · 36 comments

Comments

@Alex-111
Copy link

Alex-111 commented Jul 28, 2023

Target name(s)

ESP32-S3 DevkitC-1

Firmware version

latest - 1.8.1.370 ESP32-S3

Was working before? On which version?

No response

Device capabilities

No response

Description

When setting up a SoftAP the HttpListener sometimes does not accept any new requests.

How to reproduce

I started with the provided sample code "WifiAP" and would like to setup a simple Wifi Access Point with a very basic webserver. I tested with my Android phone to connect to the SSID and via Browser I requested http://192.168.4.1. The first request seems to work...

But especialy, when I connect my Smartphone to another SSID and then return back to my nanoFramework AP, no requests are accepted anymore and the browser just hangs.

It just seems that the socket listener just does not return anymore.

Here is my smaple code: https://github.com/Alex-111/WiFiAPTest/tree/master

Expected behaviour

I would expect that HttpListener always accepts webrequests,regardless if I connect my Smartphone to another WIFI and then later connect it again to the SoftAP.

Screenshots

No response

Aditional information

No response

@Alex-111 Alex-111 changed the title Wifi Access Point Mode - HttpListener hangs ESP32 - Wifi Access Point Mode - HttpListener hangs Jul 28, 2023
@Alex-111 Alex-111 changed the title ESP32 - Wifi Access Point Mode - HttpListener hangs Wifi Access Point Mode - HttpListener hangs Jul 28, 2023
@josesimoes
Copy link
Member

@Ellerbach wondering if this is somewhat related (or similar) with the fix you've made the other day on the webserver...

@Ellerbach
Copy link
Member

I tested the code and it works as expected for me:
First request, I open a browser and went to the 192.168.4.1 page.
Then I connected to another SSDI
Then I connected back to the MySsid
Then went again to the page:

image

@Ellerbach
Copy link
Member

I've been repeating multiple times with different processes (closing the browser before leaving, leaving it open, refresh, etc), it always worked as expected.
So closing this issue. This may be due to the browser, phone specific.

@Alex-111
Copy link
Author

Alex-111 commented Aug 4, 2023

@Ellerbach Thanks for testing!

Please could you tell me more about your setup:
What firmware do you use? What device do you use? Maybe it is specific to a special device/firmware combination?

@alberk8
Copy link

alberk8 commented Aug 4, 2023

I have the same issue as @Alex-111 and I am using Android Phone.

@Alex-111
Copy link
Author

Alex-111 commented Aug 4, 2023

@josesimoes As more peaple have this issue I think we should investigate a little bit more before closing the issue?

@josesimoes
Copy link
Member

@Alex-111 : @Ellerbach owns this issue, up to him. 😉

@Ellerbach
Copy link
Member

Firmware: ESP32_REV0-1.8.1.419
Device: ESP32 (a basic one)
Phone: iPhone

So let me reopen the issue, I'll try with other devices then.

@Ellerbach Ellerbach reopened this Aug 4, 2023
@Ellerbach
Copy link
Member

I've tried this time with ESP32-S3
Firmware: ESP32_S3-1.8.1.375
Phone: iPhone

Still works as expected!

@Ellerbach
Copy link
Member

Just tried with an Android phone (Samsung) and it also works as expected. I tries with the ESP32-S3.
Same scenario, connection to the SSID, confirmation that I want to use the network without internet, connecting to the 192.168.4.1, getting the page. Connecting to another SSID, doing something, connecting back to the MySsid, and same, confirming I want to use without network, going to 192.168.4.1, page loads perfectly.

So I'm really not sure what's happening with both @alberk8 and @Alex-111 but I cannot reproduce your problem with ESP32, ESP32-S3, iPhone and Android!

@alberk8
Copy link

alberk8 commented Aug 5, 2023

Are you closing and opening the web page again?

To replicate

  1. Connect to nf AP
  2. Open browser to http://192.168.4.1
    (web page loads)
  3. Change to another AP and wait for a few seconds
  4. Change back to nf AP.
  5. Go back to the page in step 2 and refresh. On Android I just swipe down.
  6. The page will be loading.........

@Alex-111
Copy link
Author

Alex-111 commented Aug 5, 2023

Are you closing and opening the web page again?

To replicate

  1. Connect to nf AP
  2. Open browser to http://192.168.4.1
    (web page loads)
  3. Change to another AP and wait for a few seconds
  4. Change back to nf AP.
  5. Go back to the page in step 2 and refresh. On Android I just swipe down.
  6. The page will be loading.........

Especially take care at step 5 sometimes the pages appears as expected because of the browser cache but you still see the loading indicator, i.e. the browser cannot get data...
Also on the debug output there is no request visible anymore.
Maybe it is also related to the hardware configuration. I think @alberk8 and I are using a device without PSRAM -> (ESP32-S3-DevkitC-1 in my case)

@alberk8
Copy link

alberk8 commented Aug 6, 2023

Additional Context. If I wait long enough like 5 minutes there is an error. A new listener is created then the page refresh without issue. The same thing also happen when I run the app in ESP32 or ESP32_S3, with or without PSRAM.

listener.GetContext()
Get Context 1, this is next line after the _listener.GetContext()
    ++++ Exception System.Net.Sockets.SocketException - 0x00000000 (4) ++++
    ++++ Message:
    ++++ System.Net.InputNetworkStreamWrapper::Read_HTTP_Line [IP: 015a] ++++
    ++++ System.Net.HttpListenerRequest::ParseHTTPRequest [IP: 000d] ++++
    ++++ System.Net.HttpListenerContext::get_Request [IP: 000d] ++++
    ++++ WifiAP.WebServerSimple::RunServer [IP: 0031] ++++
Request:
Process Request Ends
    ++++ Exception System.Net.Sockets.SocketException - CLR_E_FAIL (4) ++++
    ++++ Message:
    ++++ System.Net.Sockets.NativeSocket::send [IP: 0000] ++++
    ++++ System.Net.Sockets.Socket::Send [IP: 0018] ++++
    ++++ System.Net.Sockets.NetworkStream::Write [IP: 0051] ++++
    ++++ System.Net.HttpListenerResponse::SendHeaders [IP: 003f] ++++
    ++++ System.Net.HttpListenerResponse::Close [IP: 0010] ++++
    ++++ WifiAP.WebServerSimple::RunServer [IP: 0031] ++++
System.Net.Sockets.SocketException: Exception was thrown: System.Net.Sockets.SocketException

@Ellerbach
Copy link
Member

Are you closing and opening the web page again?

I did with various variation:

  • closing the page and reopening
  • coming back and just refresh
  • closing the page, the browser, coming back reopening a page

All worked as expected! The ESP32 device I'm using do not have PSRAM, it's the very basic one, the ESP32-S3 is a DevKit-M.
Works fine with Edge as a browser on both iPhone and Android! So I'm sorry but I really can't reproduce this :-( That would make things much easier!

@Alex-111
Copy link
Author

Alex-111 commented Aug 6, 2023

Yes. It is very strange, that it works without issues on your side, but I've exactly the same siuation as @alberk8
So let's think again what is the difference?

My setup:
image

The packages I use:

packages\nanoFramework.Iot.Device.DhcpServer.1.2.300\lib\Iot.Device.DhcpServer.dll
True


packages\nanoFramework.CoreLibrary.1.14.2\lib\mscorlib.dll
True


packages\nanoFramework.ResourceManager.1.2.13\lib\nanoFramework.ResourceManager.dll
True


packages\nanoFramework.Runtime.Events.1.11.6\lib\nanoFramework.Runtime.Events.dll
True


packages\nanoFramework.Runtime.Native.1.6.6\lib\nanoFramework.Runtime.Native.dll
True


packages\nanoFramework.System.Collections.1.5.18\lib\nanoFramework.System.Collections.dll
True


packages\nanoFramework.System.Text.1.2.37\lib\nanoFramework.System.Text.dll
True


packages\nanoFramework.System.Device.Gpio.1.1.28\lib\System.Device.Gpio.dll
True


packages\nanoFramework.System.Device.Wifi.1.5.54\lib\System.Device.Wifi.dll
True


packages\nanoFramework.System.IO.Streams.1.1.38\lib\System.IO.Streams.dll
True


packages\nanoFramework.System.Net.1.10.52\lib\System.Net.dll
True


packages\nanoFramework.System.Net.Http.Server.1.5.97\lib\System.Net.Http.dll


packages\nanoframework.System.Net.Sockets.TcpClient.1.1.52\lib\System.Net.Sockets.TcpClient.dll


packages\nanoFramework.System.Threading.1.1.19\lib\System.Threading.dll
True


packages\nanoFramework.Windows.Storage.1.5.33\lib\Windows.Storage.dll


packages\nanoFramework.Windows.Storage.Streams.1.14.24\lib\Windows.Storage.Streams.dll

Same situation in debugger or without debugger attached...

@Ellerbach Any idea what else we could check?

@Alex-111
Copy link
Author

Alex-111 commented Aug 7, 2023

@Ellerbach @alberk8
I've done some new tests and want to share my observations:

  • With an old iPhone it works almost every time. It is very difficult to replicate the issue there
  • I also changed the code to allow two concurrent connections and then connected with Android and iPhone parallel. After that on the Android it is the same as before -> hanging, but it seems that on the iPhone it hangs much more often. It seems that it also hangs, when the Android disconnects and the iPhone then gets the "faulty session"
  • The most interesting thing: Both Android and iPhone are connected to SSID -> If the iPhone hangs you can immediatly stop that hanging if you do a request on the Android phone. It seems the second connection triggers the "AutoResetEvent" in the HttpListener and then both requests are processed...

Unfortunately I still do not know what exactly causes the hanging. But maybe some of you can investigate the native code, For me it really looks like the Socket Accept does not return.

Any idea what happens in this socket code, if there are two requests in parallel? Is it ensured that no request is lost?

image

@Ellerbach
Copy link
Member

Any idea what happens in this socket code, if there are two requests in parallel? Is it ensured that no request is lost?

The sample is done in a very simple way, not ment to scale. Use the "real" WebServer nuget to get all working with multiple parallel requests at the same time. Now, that comes with the cost of size. The sample is done how to set the device where you typically have 1 and unique phone connecting 1 and unique time :-) And where you can retry but just rebooting the device.

Btw, glad you figured out a way. PR to improve the robustness of the sample is always welcome btw!

@Alex-111
Copy link
Author

Alex-111 commented Aug 7, 2023

@Ellerbach I'm aware of the drawbacks of this simple webserver but regardsless which webserver I use. The issue stays....

I also tried your webserver nuget, but when looking at the code of the full featured webserver there is no difference. Both use the HttpListener which in my opinion have the some problems in this case.... There is the same "_listener.GetContext()" which just does not return in that case... i.e. this has nothing to do with the webserver itself...

@Ellerbach
Copy link
Member

I also tried your webserver nuget, but when looking at the code of the full featured webserver there is no difference. Both use the HttpListener which in my opinion have the some problems in this case.... There is the same "_listener.GetContext()" which just does not return in that case... i.e. this has nothing to do with the webserver itself...

Let me look at this as well then. Note that on the ESP side, there are also bad behavior on the socket and it's related to Espressif, nothing we can change. Here is an example:

  • Setup an ESP as server, ask for a HTTP page where you keep the connection open
  • Do the request on the client side
  • Stop and don't close the socket on the client side
  • The ESP is blocked and you can't do another request later from the same IP address and it can fully block all the networking

And in this scenario, that's related to how things are managed on the Espressif side. totally independent of anything on the nano side unfortunately. So you'll see some side effects like this one that you cannot control. This is done differently on devices like the STM32.

Those devices are not ment to be highly scalable as web servers or sockets but rather handle one, at best few.

@Alex-111
Copy link
Author

Alex-111 commented Aug 7, 2023

@Ellerbach thanks for your answer. THis sounds really similar to the issue we have here. But isn't there a way to work around this, e.g. maybe there is a possibility to setup a timeout for the blocking, so that it does not block forever.

Imagine you have a iot-device which is able to be configured via SoftAP. If anybody connects and just goes away without closing the socket connection, then we would be forced to reset the device. THis is really not what we want...

@Ellerbach
Copy link
Member

Imagine you have a iot-device which is able to be configured via SoftAP. If anybody connects and just goes away without closing the socket connection, then we would be forced to reset the device. THis is really not what we want...

You definitely can add a timeout, that's totally possible. Still, lower level, there are some things that can break. For example, I4ve been using an ESP based device flashed with WLED (I'm using it for notifications). And if I use this device for the tests we're running here (I've tried ;-)), then it will be fully blocked. Nothing I can do except rebooting it. And it's native C, directly using the Espressif API. You can definitely add a timeout, that will help btw in your scenario. But again, those are far to be perfect! Add a watchdog, dispose everything thru a timer, things like this definitely is a good practice in all cases!

@Alex-111
Copy link
Author

Alex-111 commented Aug 7, 2023

@Ellerbach I updated my repro to try to stop the HttpListener on WIFI disconnection. Is this what you mean I should do on timeout? To dispose the HttpListener on some conditions? Or is there another timeout parameter I'm not aware of?

My sample Repro is working better with this new logic, but still there are some situations where it just blocks, even if I dispose the HttpListener and create it again after a WIFI-client connects....

If this is really the best we can get, than I would have expected a little bit more reliability... Not sure if this is something which could go beyond a hobby project in that case?

Another thought: Couldn't we open a ticket at Espressif, if this is a known issue?

@Ellerbach
Copy link
Member

Yes, you basically have to play with all this. You can also add a big try {}catch {} in the Main function with a global mechanism.
If you want, you can also periodically restart the webserver. Things like this.

Another thought: Couldn't we open a ticket at Espressif, if this is a known issue?

I'm sure one is open among the 1K+ issues ;-) https://github.com/espressif/esp-idf/issues
There are 57 open just with socket and some seems very similar to the problem I describe.

@networkfusion
Copy link
Member

IDF has been updated since the last comments. Is this still blocked?

@Ellerbach
Copy link
Member

As it's been 3 months since the last feedback on this issue, I'm closing it. If the problem persists, feel free to reopen it.

@Alex-111
Copy link
Author

Alex-111 commented Jun 3, 2024

@Ellerbach
@AdrianSoundy
It could not be tested because of #1488
But now with latest firmware it is still not responding after a few requests on my tests: with ESP_S3. Tested with WifiAP project from samples. It seems ticket 1488 is still not fixed to 100%. Therefore we cannot debug code with Visual Studio at the moment. Will test more, when debugging works again.

@Ellerbach Ellerbach reopened this Jun 3, 2024
@Ellerbach
Copy link
Member

So, reopening the issue. Thanks for providing updates.

@Alex-111
Copy link
Author

Alex-111 commented Jun 10, 2024

@Ellerbach

now #1493 is fixed and I did some further tests with my S3 and the WIFIAP sample code. When it hangs it always blocks at this line and does not return from writing to the stream. To make it block I just have to refresh the webpage (with "pull to refresh") from my Android phone about 2 or 3 times. After this it completely hangs and it has to be rebooted:

image

Any ideas why this could happen? It feels like a deadlock.

Edit: I left the dubber running and so I just found out that after some minutes maybe 10 or 15 the blocking code (writing to stream) returns with:
++++ Exception System.Net.Sockets.SocketException - CLR_E_FAIL (4) ++++
++++ Message:
++++ System.Net.Sockets.NativeSocket::send [IP: 0000] ++++
++++ System.Net.Sockets.Socket::Send [IP: 0018] ++++
++++ System.Net.Sockets.NetworkStream::Write [IP: 0051] ++++
++++ System.Net.OutputNetworkStreamWrapper::Write [IP: 0022] ++++
++++ WifiAP.WebServer::OutPutByteResponse [IP: 001d] ++++
++++ WifiAP.WebServer::ProcessRequest [IP: 0070] ++++
++++ WifiAP.WebServer::RunServer [IP: 003b] ++++
Exception thrown: 'System.Net.Sockets.SocketException' in System.Net.dll
An unhandled exception of type 'System.Net.Sockets.SocketException' occurred in System.Net.dll

@Ellerbach
Copy link
Member

It definitely requires some investigations. And will require to instrument for debug the web server.
If you are willing to, here is what I have in mind:

@Alex-111
Copy link
Author

@Ellerbach Meanwhile I had a look at the code and it seems to block here:

image

From my understanding this is not directly related to the webserver, but to the HttpListener.

response is if type HttpListenerResponse and in this line it is directly written to the stream, which seems to be a NetweorkStream -> Socket behind the scenes. So I fear we are here already on the native side?

@Ellerbach
Copy link
Member

So I fear we are here already on the native side?

Check first on the WebServer side, there is maybe a way to prevent it to happen because the stream is not properly disposed or anything like this. Then, yes, it's about following the rabbit hole the same way with he http stack and then native.

@Alex-111
Copy link
Author

@Ellerbach
I just tried to debug the managed side and pulled System.Net and System.Net.Http. I could get it to compile after upgrading all nuget packages and also I could deploy it via Visual Studio. But the debugger get not attached anymore. There is also no error in the Visual Studio logs.

On the serial line I see the following output. Seems that native code doesn't start anymore:

ESP-ROM:esp32s3-20210327<\r><\n>
Build:Mar 27 2021<\r><\n>
rst:0x1 (POWERON),boot:0x8 (SPI_FAST_FLASH_BOOT)<\r><\n>
SPIWP:0xee<\r><\n>
mode:DIO, clock div:1<\r><\n>
load:0x3fce3818,len:0x1380<\r><\n>
load:0x403c9700,len:0x4<\r><\n>
load:0x403c9704,len:0xba4<\r><\n>
load:0x403cc700,len:0x2c5c<\r><\n>
SHA-256 comparison failed:<\r><\n>
Calculated: 4020fa8290bd1c9845aee04dd4720555b4e4e5abf4f130e917b7f0c9a86e863e<\r><\n>
Expected: ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff<\r><\n>
Attempting to boot anyway...<\r><\n>
entry 0x403c98f0<\r><\n>
<27>[0;32mI (45) boot: .NET nanoFramework 2nd stage bootloader ESP-IDF v5.1.3<27>[0m<\r><\n>
<27>[0;32mI (45) boot: build Jun 10 2024 08:56:51<27>[0m<\r><\n>
<27>[0;32mI (45) boot: chip revision: v0.1<27>[0m<\r><\n>
<27>[0;32mI (49) boot.esp32s3: Boot SPI Speed : 80MHz<27>[0m<\r><\n>
<27>[0;32mI (54) boot.esp32s3: SPI Mode : DIO<27>[0m<\r><\n>
<27>[0;32mI (59) boot.esp32s3: SPI Flash Size : 8MB<27>[0m<\r><\n>
<27>[0;32mI (63) boot: Enabling RNG early entropy source...<27>[0m<\r><\n>
<27>[0;32mI (69) boot: Partition Table:<27>[0m<\r><\n>
<27>[0;32mI (72) boot: ## Label Usage Type ST Offset Length<27>[0m<\r><\n>
<27>[0;32mI (80) boot: 0 nvs WiFi data 01 02 00009000 00006000<27>[0m<\r><\n>
<27>[0;32mI (87) boot: 1 phy_init RF data 01 01 0000f000 00001000<27>[0m<\r><\n>
<27>[0;32mI (95) boot: 2 factory factory app 00 00 00010000 001a0000<27>[0m<\r><\n>
<27>[0;32mI (102) boot: 3 deploy Unknown data 01 84 001b0000 002e0000<27>[0m<\r><\n>
<27>[0;32mI (110) boot: 4 config Unknown data 01 82 00490000 00200000<27>[0m<\r><\n>
<27>[0;32mI (117) boot: End of partition table<27>[0m<\r><\n>
<27>[0;32mI (121) esp_image: segment 0: paddr=00010020 vaddr=3c0d0020 size=24f78h (151416) map<27>[0m<\r><\n>
<27>[0;32mI (147) esp_image: segment 1: paddr=00034fa0 vaddr=3fc99e00 size=03ea8h ( 16040) load<27>[0m<\r><\n>
<27>[0;32mI (149) esp_image: segment 2: paddr=00038e50 vaddr=40374000 size=071c8h ( 29128) load<27>[0m<\r><\n>
<27>[0;32mI (157) esp_image: segment 3: paddr=00040020 vaddr=42000020 size=ce614h (845332) map<27>[0m<\r><\n>
<27>[0;32mI (256) esp_image: segment 4: paddr=0010e63c vaddr=4037b1c8 size=0ebech ( 60396) load<27>[0m<\r><\n>
<27>[0;32mI (265) esp_image: segment 5: paddr=0011d230 vaddr=600fe000 size=00064h ( 100) load<27>[0m<\r><\n>
<27>[0;32mI (275) boot: Loaded app from partition at offset 0x10000<27>[0m<\r><\n>
<27>[0;32mI (275) boot: Disabling RNG early entropy source...<27>[0m<\r><\n>

Any ideas? Is this the right way to debug the managed framework code?

@alberk8
Copy link

alberk8 commented Jun 14, 2024

@Alex-111 , You should be able to debug as usual via VS when the app is deployed. It is easier (faster) to get support if you go to nF Discord server.

@Ellerbach
Copy link
Member

@Alex-111 all libs should be all up to date as we do have automations for that. So not sure what's happening!

@AdrianSoundy
Copy link
Member

AdrianSoundy commented Jun 14, 2024

The WiFiAP sample lacks some error handling which shows up when you quickly refresh the page.

There should be error handling in webserver.cs in ProcessRequest()
Try catch around the main switch
and another try catch around the response.Close();

Maybe response.Close(); should do its own exception handling internally to make sure the socket handle is closed.

From my testing it eliminates the exceptions causing a problem and the hangs from uncaught exceptions.
You will always get exceptions when refreshing pages as the socket can be closed by browser when writing a response.

Maybe some more testing can be done with this change and the sample updated.

Blocking on the write can mean the browser is no longer reading the socket but still open.
These sort of things time out eventually. You need a web server that can process multiple request. For WiFiAP to handle that you would need the ProcessRequest() to run on a separate worker thread.

@Alex-111
Copy link
Author

Alex-111 commented Jun 18, 2024

WebServer.cs.txt
@AdrianSoundy
Thanks for the hint. I tried them all:

Try Catch does not catch any exceptions in my case. It still is just hanging in ...Outputstream.Write.

I tried to go deeper into the nanoframework libraries, but after referencing System.Net and System.Net.Http as source code the debugger does not attach anymore. It just fails after deploying....

EDIT:
I also tried to execute response.Close() in another thread, when the ...Outputstream.Write hangs. But the "hanging" is not released.... This very hacky test is attached as file in my comment -> see webserver.cs.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants