-
-
Notifications
You must be signed in to change notification settings - Fork 22
[discussion] [potential heresy] global connection pools and global tasks #125
Comments
why not generally make the with statement a requirement, even for http 1/1.1 its nice to clean up the residual resource pool and drop connections/caches |
@RonnyPfannschmidt Good question :-). Cleaning up resources is generally a good thing! Though... do you actually make a habit of closing your But even if it's a good idea in general, there are also some reasons why it might be nicer to keep it optional, rather than mandatory:
|
conceptual a divide between configuring and acquisition of a resource manager/pool could be drawn, and in many of the easy cases people care about obtaining aqired resources i believe this can play nice with atexit cleanup and async aware contextual variables so the core library would use the explicit style and surround it with a convenient shell |
I'm a fan of the approach @njsmith has proposed here. I think it's appropriate to think long and hard before relying on an implicit background task, and to design that task very carefully for robustness, but I believe that as long as those things are done (which in this discussion they sure seem to be) an implicit background task isn't inherently evil or complicating. |
well, does anyone have a good answer for when to start it, and what to do with it when the mainloop isn't running, also to which nursery would it belong |
You could start it on first use. As described, in Trio this would be a system task, cancelled automatically when the |
lovely, thanks |
@njsmith Did some words fall off in your original post? |
@jtrakk whoops, thanks, fixed |
FWIW I do rather like one feature of requiring |
Heh, I looked into the Fetch API today, and it turns out that this comment I wrote was hilariously naive. Browser cookie handling logic has special cases everywhere to figure out when to use which cookies. For example, see: https://developer.mozilla.org/en-US/docs/Web/API/Request/credentials |
Here's a brain dump from me for this issue: Some initial components of a "pool key":
Proxying is a pain because the different proxying methods being supported (HTTP->HTTP, HTTP->HTTPS, HTTPS->HTTP, HTTPS->HTTPS, SOCKS) have different times where the proxy logic kicks in.
All CONNECT tunnel proxy methods need to be able to react to an HTTP version change via TLS ALPN. I'd recommend all (X)->HTTPS be X=HTTP/1.1 since the connection is so short lived and the logic is simpler with HTTP/1.X for upgrading to a TCP tunnel. I guess it doesn't really matter that the version changes because the HTTP connection even if it's the same version of HTTP has to be restarted anyways. (I think h11 goes into a |
Yeah, proxies are going to have some special case logic; no way around it. Let's imagine we have functions
Yeah, for these I think you treat it as a plain HTTP/HTTPS connection to the proxy when looking it up in the pool/establishing it:
I think these work the same for connection establishment: SOCKS is kind of similar: since it's a connection-level thing, it needs to be described in the pool key.
The state's called But: if your proxy speaks HTTP/2, then it's possible and I guess desireable to establish a single HTTP/2 connection, and then tunnel multiple CONNECT requests through that connection simultaneously. This is probably not too hard once we have the infrastructure, because everything composes: we do This does lead to a weird situation where you may end up having some "connections" in the pool that actually represent tunnels over other "connections" in the pool. But I think that's like... fine? I haven't quite thought through how the lifecycle management should work to abstract over HTTP/1.1 vs HTTP/2, since "returning a connection to the pool" isn't really a thing in HTTP/2. It might be trivial though; maybe |
So for this case how would the
So then we'd have to hand back a "SocketStream" that's really just an interface containing an HTTP/2 stream. Lots of moving parts there. Have to worry about writes "timing out" if the proxy doesn't give back enough credits to the HTTP/2 stream.
I'm worried about handling cases where a single HTTP/2 proxy connection ends up representing tunnels to different endpoints. Have to think about all the scenarios that the pool can get locked up over, especially when available resources are low. Which connection do we evict to cause the least damage?
We'll have to worry about resources within the HTTP2Connection as that will affect whether that connection can be used for a "new" request ie tracking how many concurrent streams are being used. Will somehow have to signal to the pool which stream_id was allocated to a given request so it knows which stream to close when the connection is "returned". Maybe the Request needs to be a parameter to get_connection() and put_connection()? |
Maybe the abstraction we want to represent a single request/response cycle at a low-level is something like: class OneHTTPTransaction(ABC):
async def send_request_data(self, data) -> None:
...
async def send_eof(self, trailers: Headers):
...
async def receive_response_headers(self) -> Tuple[Code, Reason, Headers]:
...
async def receive_response_data(self, max_nbytes) -> bytes:
...
def get_response_trailers(self) -> Headers:
...
async def aclose(self):
...
class ConnManager(ABC):
async def start_one_http_transaction(self, method, target_url, request_headers, conn_config) -> OneHTTPTransaction:
... (Obviously all these names are placeholders. Also, we'd probably actually want to make So the idea is that for HTTP/1.1, closing or exhausting the And then inside async def start_one_http_transaction(self, method, target_url, request_headers, conn_config):
return await self._start_one(method, target_url, request_headers, conn_config)
async def _start_one(self, method, target_url, request_headers, conn_config, *, override_target=None):
if is_http(target_url) and wants_http_style_proxy(conn_config):
assert override_target is None
proxy_url, proxy_headers, proxy_config = get_http_style_proxy_info(conn_config)
# Recurse, and presumably next time we'll hit the branch below.
return await self._start_one(method, proxy_url, combine(proxy_headers, request_headers), proxy_config, override_target=target_url)
else:
pool_key = compute_pool_key(target_url, conn_config)
try:
return await self._start_transaction_using_existing_key(pool_key, ...)
except KeyError:
# Need to set up a new connection
if wants_http_style_proxy(conn_config):
assert is_https(target_url) # http URLs w/ http proxies shouldn't make it here
proxy_url, proxy_headers, proxy_config = get_http_style_proxy_info(conn_config)
# Recurse to get the transport object
connect_transaction = await self._start_one("CONNECT", proxy_url, proxy_headers, proxy_config)
... = await connect_transaction.receive_response_headers()
check_for_connect_error(...)
# Run any initial handshake, and add the connection to the appropriate pool
await self._setup_new_connection(pool_key, connect_transaction, ...)
else:
tcp_conn = await self._make_tcp_connection(...) # I guess this handles SOCKS internally
await self._setup_new_connection(pool_key, tcp_conn, ...)
# Now we've added the connection to the pool, so we can try this again and it will succeed:
return await self._start_transaction_using_existing_key(pool_key, ...)
async def _start_transaction_using_existing_key(self, pool_key, ...):
if pool_key in self._http2_conns:
conn = self._http2_conns[pool_key]
return await conn.start_transaction(target_url, method, request_headers)
else:
while pool_key in self._http11_conns:
# Remove the connection from the pool entirely while it's in use
conn = self._http11_conns.pop(pool_key)
if not still_alive(conn):
conn.close()
continue
await conn.start_transaction(target_url, method, request_headers)
return conn
raise KeyError It's not simple, and I'm sure it could be expressed more clearly with some judicious refactoring. But I think that basic flow can handle all the wacky combinations of different proxy types, layering of HTTP/X on top of HTTP/Y, etc., and it's only about a page of code or so. |
Hmm, I'm not sure this is actually our problem? If the proxy decides they want to reject some of our connections, they can do that equally well over HTTP/1.1 or HTTP/2. And the whole point of HTTP/2's flow control stuff is to make sure that one stream can't "lock up" other streams by hogging the underlying TCP connection. |
I'm saying that if we constrain resources in the pool to be TCP streams instead of "HTTP lifecycles" then using LRU/LFU won't be optimal because a single TCP stream could be housing many HTTP lifecycles. Do we go the route of "you can't evict an HTTPConnection while a lifecycle is active"? |
Ohh, yeah, cache eviction policy is a whole other topic, right. For HTTP/1.1 it's simpler: either a connection is in use, or it's sitting idle in the pool. If it's in use, we obviously aren't going to evict it. If it's sitting idle in the pool, we might decide to evict it based on some policy like a timeout or based on a limit on how many idle connections we're willing to retain. I'm actually not sure what the best eviction policy is... urllib3's is a bit quirky (IIRC it's separate pools per request target, each pool has its own LRU policy, and then there's a second layer of LRU on the pools themselves?). It works, but it seems more like an accidental side-effect of some implementation choices, not necessarily something that needs to be copied exactly. That's kind of a separate discussion though. Let's worry about it later. I think we can say though that whatever eviction policy we use, for HTTP/1.1 it will only consider idle connections. For active connections, the user has both the knowledge and control they need to manage resources appropriately for their use case. So then generalizing to HTTP/2... I guess the same principle applies? Obviously the bookkeeping will be a bit more complicated, because the underlying connection only becomes "idle" when there are no live streams. But I think it works out. For example, consider the most complex case, of HTTP/2 tunneled through an HTTP/2 proxy:
So I think once we have the mechanisms to handle regular non-proxied HTTP/2, then those same mechanisms will be sufficient to handle the complicated nesting cases too. |
The biggest difference between HTTP/1.1 and HTTP/2 here is that for HTTP/1.1, you don't really need to keep track of active connections, so you can use "in the pool" as your marker of idle-ness. For HTTP/2, though, you have to use some kind of central data structure to track all connections, whether they're currently active or not. So idle-vs-active needs to be tracked separately. It might make sense to do this for HTTP/1.1 too though – e.g. maybe when the pool shuts down you want to forcibly close all connections, even the active ones. |
Here's a convenient thing about current Python HTTP clients: when you do create a
Session
, you can just dosession = Session()
. You can usewith Session() as session
, but it's not required. This flexibility can be very convenient – in fact trio's own github bot uses it.Here's an annoying thing about HTTP/2 and HTTP/3: you really need a background task to handle the shared connection. (And that issue doesn't even talk about cancellation safety, which is also a huge issue that I don't see how to solve without background tasks.) According to the principles of structured concurrency, that means that if you want to support HTTP/2 or HTTP/3, then you must use a
with
block to manage sessions, which is a usability regression compared to older HTTP/1.1-only clients.We could potentially restrict this regression to just async clients, since sync clients will probably never support HTTP/2 or HTTP/3. But that's pretty awkward, since our goal is to keep the async and sync APIs matching as close as possible. Another option: we could support both
async_session = AsyncSession()
andasync with AsyncSession() as async_session
... but the first session is restricted to HTTP/1.1, and you have to write it the second way if you want your session to use HTTP/2 or HTTP/3. But, this would suck, because it's a really subtle distinction that would create a really surprising behavioral difference.Here's an annoying thing about current Python HTTP clients: you have to explicitly create and use a
Session
to make things efficient. If you just callrequests.get(...)
, then it can't reuse connections. Wouldn't it be nice if things were magically efficient for everyone? Golang handles this by having a global Client object (= their version of aSession
) that gets used by bare functions likehttp.get(...)
. We could do that too! (Well, one global session for the sync API, and one session-per-async-loop for the async API.) However... you can't have an implicitly constructed global session, if sessions require awith
. So if you want to make HTTP efficient by default out-of-the-box, then your sessions have to support the bare constructor form.So... I'm wondering if this is a place where we should commit heresy against the structured concurrency dogma, and use globally-scoped tasks to manage HTTP/2 and HTTP/3 connections. (And then if the answer is yes, we should also consider whether to have a global connection pool.)
Heresy: historical context
Back in the 1970s, the folks arguing for structured programming were originally super dogmatic about it. Structured programs weren't allowed to include constructs like
break
orcontinue
orreturn
, because those can jump out of multiple nested blocks at once. That means they broke Dijkstra's black-box rule, and were consideredgoto
s in disguise. Instead you were supposed to make sure your loop exit condition covered all possible ways the loop could exit, and arrange your functions so that you always fell off the bottom eventually.Of course today this sounds pretty silly. It's certainly possible to write some confusing code using these constructs, but they're nowhere near as bad as goto: their effects are local and pretty restricted, and avoiding them is often worse than the disease. E.g. if you encounter a fatal error inside a nested loop, you'd have to write a bunch of duplicated boilerplate conditions to carefully ferry it out of the loop and down to the bottom of your function, and that's also super error prone and confusing.
But I'm sympathetic to those early structured programming evangelists. The only reason we know that these constructs are worth it is because they made a serious attempt to go without them. And I'm sure there were dozens of other cases where
goto
proponents insisted that that some kind ofgoto
was the right solution, and only a super-dogmatic insistence on structure could force them to actually try the structured version... and discover that in fact it was better.So the lessons I take from this are:
So lets try to put those lessons to work.
So how bad would this heresy be exactly?
There's a few reasons to suspect that using background tasks to pump HTTP/2 and HTTP/3 connections might not be that bad.
First, the idea is to keep the exact same public API as we had back in the old days with HTTP/1.1 and purely synchronous operations. So right there we know that this shouldn't mess up the control flow that our users experience, because the control flow shouldn't change at all. (If we can do it right.)
And second, the housekeeping work that HTTP/2 and HTTP/3 connections need is basically the same as the housekeeping work that the kernel already does on TCP connections. Every TCP connection has its own concurrent tasks running in the kernel to manage the TCP state machine, track ACKs, manage flow control, etc. When we call
send
orrecv
on a socket, that's actually just passing some request off to these concurrent tasks, which they perform on their own schedule. But we mostly don't have to think about this – it doesn't cause any horrible concurrency problems. So hopefully HTTP/2 and HTTP/3 should work the same way.So that's promising, but let's work through the details and try to distill out some principles.
I think the fundamental reason it works for TCP, is that the
send
/recv
API is very careful to preserve the illusion that these really are happening immediately when you call them. That's what we want here too: users shouldn't even be able to tell whether their requests are using HTTP/1.1 or HTTP/2.Crucially, this includes error handling. If the kernel detects an error, like a TCP RST packet or something, it doesn't crash; it holds onto the error and reports it the next time we do a call on the socket. That's what we want for HTTP requests too. And for TCP, the semantics of
send
/recv
are such that if there's an error on a socket that we never look at, then it's OK for the kernel to discard that error. Fortunately, that also applies to HTTP/2 and HTTP/3: if some background connection dies at a moment when we aren't actually using it for an HTTP request, then that's not really an error; we just silently discard that connection and open a new one later if we need to. The fact that multiple requests might share the same connection is designed to be entirely invisible, an implementation detail.And the kernel is written very carefully to make sure it never ever crashes. Furthermore: if the kernel does crash, it takes down our program with it. However, the reverse isn't true: if our program crashes, then the kernel keeps going. Since the kernel is strictly more reliable than our code, that makes it safe for us to hand off work to it to happen later. For example, after you
close
a socket, the kernel might keep working on it for a while, to send any data that you passed tosend
earlier. If our program exits after callingclose
, that data isn't lost. OTOH if the kernel does lose the data, it's probably because the whole system died, in which case we would have lost the data no matter what we did. So the hand-off doesn't affect reliability.Twisted and asyncio also like to pump data in the background, but they don't have this property: if you hand them some data to send, and then your program exits or crashes, their event loop stops, and the data that you thought you sent might not actually be sent. For our background tasks, we could try to force the program to keep running until they finish their work, but probably an easier way to handle this is to make sure that if a user sends something, that call doesn't report success until the background task has handed the data off to the kernel. (This is also the simplest way to handle backpressure.) And of course we'd need to be very very careful that they can never ever crash, and we'd need a backup plan for if that fails and they crash anyway.
But if we do all that..... I actually can't think of any problems that these background tasks would cause? This isn't exactly a tight list of reusable principles yet, but it's a bit closer than we were.
So what kind of architecture are we talking about?
I'm imagining something like Trio's "system task" concept, which gives a way to spawn a task into a global "system" nursery. They have some special properties that are all appropriate here:
KeyboardInterrupt
, which is kinda important for that "never raise an exception" thing (but means they have to be careful not to ever get stuck in an infinite loop or anything)Emulating some of these details on top of asyncio is probably a hassle, but I'm going to assume the Trio semantics for now and figure out how to compromise for other libraries later.
So given that, we can silently create a background task whenever an HTTP/2 or HTTP/3 connection is made, and automatically kill it again when the connection is closed. That means that in principle, we could implement pretty much any of the classic
Session
patterns.However, I think the best architecture if we can manage it would be to have a single loop-global connection pool. The idea would be that it stores as little configuration state as possible; all it should hold is:
This is the stuff that you want to share as globally as possible, because the more widely its shared the better the chances that any given request in your program can skip expensive connection setup or use a more efficient protocol.
But that means we can't associate configuration with the pool. Instead, we'd associate configuration with individual requests. For example: you don't disable certificate validation on the connection pool, like urllib3 does; instead you disable it on individual requests. And part of the connection pool's job is to keep track of which connections have certification validation enabled or disabled, and match up requests with an appropriate connection. (urllib3 already does something similar, but slightly more convoluted: it stores configuration on individual
ConnectionPool
s, but then thePoolManager
class does matchmaking between requests and pools based on the per-request configuration settings – see_key_fields
and associated stuff inpoolmanager.py
. So this is taking thePoolManager
approach and doubling down on it.)Of course it's still useful to be able to set up configuration on a
Session
object once, and then re-use it multiple times. But in this approach, theSession
would only be a collection of default configuration that it applies to requests made through it, before delegating the actual request handling to the global pool. That would makeSession
lifecycle management super trivial :-).We might also want to expose the option of making new pools, for weird cases and for testing. So maybe the layering would be:
Session
holds configuration, including a reference to which low-level HTTP transport object to use to fulfill individual requests.Session
to use themI'm not 100% sure whether we'd want to expose the low-level transport object as part of the public API – #124 has lots of musing on this topic. But that seems like the basic architecture we'd want internally, at the least.
Like any global cache, we'd have to be very careful to make sure that the pool can't accidentally leak state in unexpected ways. For example, for the sync version of the pool, we'd have to make sure that it was both thread-safe and fork-safe. That seems pretty doable though.
Another interesting challenge will be figuring out where to stash the "global" connection pool. For the sync backend, we can just use a global variable. For Trio, we can use a
RunVar
. For asyncio and Twisted... maybe we need aWeakKeyDict
mapping loops/reactors to connection pools, or something? I'm sure we can figure something out.That sounds elegant and all, but does reality agree?
Currently libraries like requests and urllib3 don't have the kind of careful-and-systematic separation of configuration from the connection pool that I'm envisioning above. Could it actually work in practice?
Here's the state that requests keeps on
Session
objects:Other things I can think of:
A lot of this is assorted "configuration" stuff that would fit very naturally in my description above.
We'd also want to think hard about whether to make any of this stuff mutable, because if there's a global default
Session
and its configuration is mutable then that creates all kinds of potential for folks to get themselves into trouble.The tricky cases:
Connection caching policy obviously has to be configured on the connection cache. Arguably this is a good thing... the main reason for policy settings here is to avoid allocating too many fds, and fds are a process-global resource, so it kinda makes sense to set this policy at a process-global level. But is that good enough in practice, or do people need the option to create private connection pools with their own policy, inside particular modules?
I'm not quite sure what to do with cookies. They can mostly fit into the "configuration" model, if we expose a "cookie jar" type, and let you specify which cookie jar to use on a per-request basis. But what should we do when there's no cookie jar specified: use a process-global jar? Don't store cookies at all? And when you create a
Session
without specifying a jar, does it default to using the global settings, or should it create a new cookie jar?Go's approach is that when you make an
http.Client
, you can either specify aCookieJar
, or set it to their version ofNone
. If it'sNone
, then all the default cookie handling is disabled (so not only does it not automatically add cookies to requests or save cookies from responses, it doesn't even strip cookies when following redirects! not sure what the reasoning there was). And the default globalClient
has theCookieJar
set toNone
.I can see arguments either way! On the one hand, having a global
CookieJar
would be a vector for state to leak between different parts of your program in potentially-surprising ways. On the other hand, a lot of Python programs are quick scripts to mess with HTTP APIs and we want to make that beginner path as smooth as possible – and how often do you really want to throw away cookies? Browsers have a global cookie jar and it works out OK for them. Though it wouldn't be terrible to say that by default cookies are ignored, and if you want automatic cookie handling then you have to create aSession
. And I guess browsers do have stuff like incognito windows and facebook containers. And if you have two tasks interacting with some API on behalf of different users, you definitely don't want to accidentally mix up their cookies. So I'm leaning towards making the global session use a null cookiejar that's always empty.The text was updated successfully, but these errors were encountered: