Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When accepting a socket connection and ERROR_NETNAME_DELETED occurs, leads this into a closing of the serving socket (BaseProactorEventLoop) #93821

Open
Cacsjep opened this issue Jun 14, 2022 · 18 comments
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@Cacsjep
Copy link

Cacsjep commented Jun 14, 2022

Hey Guys

Application Description

We discover on our uvicorn server sometimes that the listening socket is closing.
The Uvicorn get image uploads from cameras and sometimes or often releated to mobile network conditions the connection is lost.
And when a connection is lost on remote site, this error occurs sometimes.

I am not very deep into such low level python code so I tried it hard to find the root couse and this are my results.
I hope this results show's that there is sometings and we can talk about that or you can give me hints or help do resolve this issue.

thank you in advance

Bug/Error that occurs

The Error occurs on ov.getresult() (asyncio.windows_events.py:560)
OSError: [WinError 64] The specified network name is no longer available.

After that the server is unresponsive and the listening socket is closed. In that case we need to restart the service.

To me it looks like this happens when ov.getresult() is called and the remote host is disconnected.
When I just wrap this function in a try\except block the listening socket is not closing and uvicorn detects a connection lost.

Please let me know why just wrapping this into a try block resolve this issue,
because I don't know ;-)

Original

def finish_accept(trans, key, ov):
     ov.getresult() # Here happen the OSError
     # Use SO_UPDATE_ACCEPT_CONTEXT so getsockname() etc work.
     buf = struct.pack('@P', listener.fileno())
     conn.setsockopt(socket.SOL_SOCKET,
         _overlapped.SO_UPDATE_ACCEPT_CONTEXT, buf)
     conn.settimeout(listener.gettimeout())
     return conn, conn.getpeername()

Mod (No socket closings)

def finish_accept(trans, key, ov):
     try:
         ov.getresult()
     except OSError as Error:
         print("This helps", Error)
     # Use SO_UPDATE_ACCEPT_CONTEXT so getsockname() etc work.
     buf = struct.pack('@P', listener.fileno())
     conn.setsockopt(socket.SOL_SOCKET,
         _overlapped.SO_UPDATE_ACCEPT_CONTEXT, buf)
     conn.settimeout(listener.gettimeout())
     return conn, conn.getpeername()

Error without the try\catch block

Task exception was never retrieved
future: <Task finished name='Task-248' coro=<IocpProactor.accept.<locals>.accept_coro() done, defined at C:\Program Files\Python310\lib\asyncio\windows_events.py:570> exception=OSError(22, 'The specified network name is no longer available', None, 64, None) created at C:\Program Files\Python310\lib\asyncio\tasks.py:636>
source_traceback: Object created at (most recent call last):
  File "Y:\code_projects\pol\pol\server\test\brotle.py", line 31, in <module>
    asyncio.run(serve(Starlette(debug=True, routes=routes), config), debug=True)
  File "C:\Program Files\Python310\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "C:\Program Files\Python310\lib\asyncio\base_events.py", line 633, in run_until_complete
    self.run_forever()
  File "C:\Program Files\Python310\lib\asyncio\windows_events.py", line 321, in run_forever
    super().run_forever()
  File "C:\Program Files\Python310\lib\asyncio\base_events.py", line 600, in run_forever
    self._run_once()
  File "C:\Program Files\Python310\lib\asyncio\base_events.py", line 1888, in _run_once
    handle._run()
  File "C:\Program Files\Python310\lib\asyncio\events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "C:\Program Files\Python310\lib\asyncio\proactor_events.py", line 838, in loop
    f = self._proactor.accept(sock)
  File "C:\Program Files\Python310\lib\asyncio\windows_events.py", line 580, in accept
    tasks.ensure_future(coro, loop=self._loop)
  File "C:\Program Files\Python310\lib\asyncio\tasks.py", line 615, in ensure_future
    return _ensure_future(coro_or_future, loop=loop)
  File "C:\Program Files\Python310\lib\asyncio\tasks.py", line 636, in _ensure_future
    return loop.create_task(coro_or_future)
Traceback (most recent call last):
  File "C:\Program Files\Python310\lib\asyncio\windows_events.py", line 573, in accept_coro
    await future
  File "C:\Program Files\Python310\lib\asyncio\windows_events.py", line 819, in _poll
    value = callback(transferred, key, ov)
  File "C:\Program Files\Python310\lib\asyncio\windows_events.py", line 561, in finish_accept
    ov.getresult()
OSError: [WinError 64] The specified network name is no longer available

Minimal Example to reproduce

  1. Start Server
  2. Start Client

Server (Minimal Reproducible Example)

from starlette.applications import Starlette
from starlette.responses import JSONResponse
from starlette.routing import Route

async def test_case(request):
    return JSONResponse({'hello': 'world'})

routes = [
    Route("/", endpoint=test_case, methods=['POST'])
]

##### Uvicorn Test
import uvicorn
uvicorn.run(
    Starlette(debug=True, routes=routes),
    reload=False,
    log_level="trace",
    host="0.0.0.0",
    port=8088,
)


##### Hypercorn Test
#import asyncio
#from hypercorn.config import Config
#from hypercorn.asyncio import serve

#config = Config()
#config.bind = ["localhost:8088"]
#asyncio.run(serve(Starlette(debug=True, routes=routes), config), debug=True)

Client (Minimal Reproducible Example)

import threading
import requests
import time
import sys


def test():
    for i in range(100):
        requests.post("http://127.0.0.1:8088/")

if __name__ == "__main__":
    for i in range(100):
        thread = threading.Thread(target=test, args=(), daemon=True)
        thread.start()
    time.sleep(1)
    sys.exit()

Tested with:
Uvicorn [x]
Hypercorn [x]

Environment
Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32

Linked PRs

@Cacsjep Cacsjep added the type-bug An unexpected behavior, bug, or error label Jun 14, 2022
@kumaraditya303
Copy link
Contributor

Have you reported this to uvicorn/starlette bug tracker? Can you provide a reproducer with only asyncio and no third-party packages? Have you tested main branch?

@Cacsjep
Copy link
Author

Cacsjep commented Jun 15, 2022

I will try to create a reproducer with only asncio, I just testet latest released version and not main branch.

When a OS error happen on the finish_accept function then BaseProactorEventLoop._start_serving.loop falls into that OSError except block and close the listener sock at line 864.

But its not the listening sock that produce this OS Error it is the connection socket, so an OS Error on connection sock leads into closing the listening sock.

Connection socket after this issue happen:
<socket.socket fd=848, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 8088), raddr=(0, b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')> (0, b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

While i am trying to create a reproducer with asyncio could you tell me, is that a expected behaviour?

Thank u

@Cacsjep
Copy link
Author

Cacsjep commented Jun 15, 2022

Tested with latest main branch and also can reproduce this error.
Here is a reproducer with asyncio, hope that helps.

Server

import asyncio


class Serve(asyncio.Protocol):
    def connection_made(self, transport):
        self.transport = transport

    def data_received(self, data):
        self.transport.write(b"OK")

async def main():
    loop = asyncio.get_running_loop()
    server = await loop.create_server(lambda: Serve(), '127.0.0.1', 8088)

    async with server:
        await server.serve_forever()

asyncio.run(main(), debug=True)

Client

import threading
import requests
import sys

def create():
    for i in range(100):
        requests.post("http://127.0.0.1:8088/", data="acs")

if __name__ == "__main__":
    for i in range(100):
        thread = threading.Thread(target=create, args=(), daemon=True)
        thread.start()
    sys.exit()

@Cacsjep
Copy link
Author

Cacsjep commented Jun 16, 2022

Latest Findings

When AcceptEx encounter an ERROR_NETNAME_DELETED with WSAGetLastError() (overlapped.c),
this leads into an OSError when ov.getresult() is called at windows_events.py:576, this OSError is then catched up by BaseProactorEventLoop proactor_events.py:856 and leads into closing the listening sock.

The "real" underlying error that on my case occurs is a "WSAECONNRESET" while accepting the connection socket.

How can we deal with WSAECONNRESET or ERROR_NETNAME_DELETED errors when accpeting a socket without closing the listening socket, a easy fix is just closing accept socket and raise a ConnectionResetError that is cached from BaseProactorEventLoop serving loop.

--- a/Lib/asyncio/windows_events.py
+++ b/Lib/asyncio/windows_events.py
@@ -577,7 +577,13 @@ def accept(self, listener):
         ov.AcceptEx(listener.fileno(), conn.fileno())
 
         def finish_accept(trans, key, ov):
-            ov.getresult()
+            try:
+                ov.getresult()
+            except OSError as exc:
+                if exc.winerror in (_overlapped.ERROR_NETNAME_DELETED,
+                                    _overlapped.ERROR_OPERATION_ABORTED):
+                    conn.close()
+                    raise ConnectionResetError(*exc.args)
+                raise
             # Use SO_UPDATE_ACCEPT_CONTEXT so getsockname() etc work.
             buf = struct.pack('@P', listener.fileno())
             conn.setsockopt(socket.SOL_SOCKET,
index ddb9daca02..4b3d24c4e8 100644
--- a/Lib/asyncio/proactor_events.py
+++ b/Lib/asyncio/proactor_events.py
@@ -854,6 +854,8 @@ def loop(f=None):
                 if self.is_closed():
                     return
                 f = self._proactor.accept(sock)
+            except ConnectionResetError:
+                self.call_soon(loop)
             except OSError as exc:
                 if sock.fileno() != -1:
                     self.call_exception_handler({

@Cacsjep Cacsjep changed the title windows_events.py ov.getresult() in finish_accept, returns a OS Error that is not catched and results in a socket closing When accepting a socket connection and ERROR_NETNAME_DELETED occurs, leads this into a closing of the serving socket (BaseProactorEventLoop) Jun 16, 2022
@kumaraditya303
Copy link
Contributor

Can you post a diff rather than copying functions as it will easier to apply your changes? Also seems like #18199 is similar to this issue.

@Cacsjep
Copy link
Author

Cacsjep commented Jun 17, 2022

Sure, I updated my latest comment. Yes looks similar.

@gvanrossum
Copy link
Member

The PR linked above (gh-18199) is more appropriate for the issue in recv() and friends, gh-83037. The current issue is about accept(). A separate fix needs to be developed.

Assuming the best reaction to ERROR_NETNAME_DELETED is indeed retrying, there would be a number of options. We could special-case that error in proactor_events.py::BaseProactorEventLoop._start_serving.loop, or we could silently retry in windows_events.py::accept.finish_accept.

I have another niggling suspicion. There is a background Task involved here, created by

        tasks.ensure_future(coro, loop=self._loop)

in windows_events.py::accept. It is known that sometimes such tasks get garbage-collected before they've completed (e.g. #96323). But really, I don't think it is the cause of @iUnknwn's problem, based on the logs he sent me privately (those show that task as completed with exception=OSError(22, 'The specified network name is no longer available', None, 64, None)).

Anyway, if anyone knows how to reproduce getting ERROR_NETNAME_DELETED from an accept() call, please let us know!

@gvanrossum
Copy link
Member

@kumaraditya303 Have you ever heard of ERROR_NETNAME_DELETED?

@kumaraditya303
Copy link
Contributor

I remember this, it is a very annoying error ERROR_NETNAME_DELETED and hard to reproduce, I don't remember much now and haven't used asyncio on Windows in a while.

@gvanrossum
Copy link
Member

Could it be caused by IIS in reverse proxy mode?

@kumaraditya303
Copy link
Contributor

One thing I remember is that when I was getting this I looked at how other programming languages handle it and IIRC golang just retries in this case. This might have changed as it's been a while since then.

@kumaraditya303
Copy link
Contributor

Could it be caused by IIS in reverse proxy mode?

I don't know. Maybe @eryksun can shed some light on this issue?

@eryksun
Copy link
Contributor

eryksun commented Mar 8, 2023

Winsock special cases several status codes returned by the AFD device that implements socket files. For example:

NTSTATUS Winsock Error Generic Windows Error
STATUS_LOCAL_DISCONNECT WSAECONNABORTED ERROR_NETNAME_DELETED
STATUS_CONNECTION_RESET WSAECONNRESET ERROR_NETNAME_DELETED
STATUS_REMOTE_DISCONNECT WSAECONNRESET ERROR_NETNAME_DELETED

If you're using asynchronous AcceptEx() and rely on the last error that's set by GetOverlappedResult(), then you'll get the generic Windows error ERROR_NETNAME_DELETED instead of the specialized Winsock error. The Overlapped type could support getting the last NT status that's set in overlapped.Internal. This could be used to disambiguate ConnectionAbortedError and ConnectionResetError in this case.

@gvanrossum
Copy link
Member

Either way it means the connection was reset or disconnected, it's the same to asyncio. So it seems accept() should probably just retry without doing anything (there's no connection it can return). Perhaps in debug mode it could log something at the debug level. For recv() and friends we should discuss this in gh-83037.

@gvanrossum
Copy link
Member

gvanrossum commented May 23, 2023

Looking for a volunteer to implement the proposed fix from my previous comment -- accept() should just retry instead of closing the socket, when ERROR_NETNAME_DELETED occurs. The code from gh-18199 can serve as an example.

@fercod
Copy link

fercod commented Jul 5, 2023

Latest Findings

When AcceptEx encounter an ERROR_NETNAME_DELETED with WSAGetLastError() (overlapped.c), this leads into an OSError when ov.getresult() is called at windows_events.py:576, this OSError is then catched up by BaseProactorEventLoop proactor_events.py:856 and leads into closing the listening sock.

The "real" underlying error that on my case occurs is a "WSAECONNRESET" while accepting the connection socket.

How can we deal with WSAECONNRESET or ERROR_NETNAME_DELETED errors when accpeting a socket without closing the listening socket, a easy fix is just closing accept socket and raise a ConnectionResetError that is cached from BaseProactorEventLoop serving loop.

--- a/Lib/asyncio/windows_events.py
+++ b/Lib/asyncio/windows_events.py
@@ -577,7 +577,13 @@ def accept(self, listener):
         ov.AcceptEx(listener.fileno(), conn.fileno())
 
         def finish_accept(trans, key, ov):
-            ov.getresult()
+            try:
+                ov.getresult()
+            except OSError as exc:
+                if exc.winerror in (_overlapped.ERROR_NETNAME_DELETED,
+                                    _overlapped.ERROR_OPERATION_ABORTED):
+                    conn.close()
+                    raise ConnectionResetError(*exc.args)
+                raise
             # Use SO_UPDATE_ACCEPT_CONTEXT so getsockname() etc work.
             buf = struct.pack('@P', listener.fileno())
             conn.setsockopt(socket.SOL_SOCKET,
index ddb9daca02..4b3d24c4e8 100644
--- a/Lib/asyncio/proactor_events.py
+++ b/Lib/asyncio/proactor_events.py
@@ -854,6 +854,8 @@ def loop(f=None):
                 if self.is_closed():
                     return
                 f = self._proactor.accept(sock)
+            except ConnectionResetError:
+                self.call_soon(loop)
             except OSError as exc:
                 if sock.fileno() != -1:
                     self.call_exception_handler({

Thanks. This avoids my Uvicorn API (FastAPI) from crashing. Although, the following error continues to appear on the terminal:
OSError: [WinError 64] The specified network name is no longer available
ConnectionResetError: [WinError 64] The specified network name is no longer available

It happens to me when making multiple API requets in a short time to an endpoint which fetches data from another external API.

StanHatko added a commit to StanHatko/cpython that referenced this issue Sep 13, 2024
…ETED occurs, leads this into a closing of the serving socket

This is done with the patches described by fercod in the GitHub thread python#93821.
These patches were successfully tested by me and worked properly.
The error propagates properly without closing the original socket and without causing the server to hang.
@CoreyBird
Copy link

I stumbled upon this solution after discovering it myself for the same issue mentioned. For me, this bug manifest itself by causing my server to hang up, requiring a hard restart to restore it's ability to serve again.

There are other ov.getresult() calls (one in accept, one in connect, and one in accept_pipe) - given we can get a client side closing the socket at any point I think it would make sense to catch the errors consistently for each state of the IocpProactor class.

+1 for solving the immediate issue. If we can get this in it would be a step in the right direction.

@cybergrind
Copy link
Contributor

Just in case someone needs a quick drop-in workaround without waiting for an update: https://gist.github.com/cybergrind/9cbbdc94503548d74dc4d5d3ed99248c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error
Projects
Status: Todo
Development

No branches or pull requests

8 participants