Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] first draft of more generic timeouts throughout the request lifecycle #2842

Closed
wants to merge 1 commit into from

Conversation

N-Coder
Copy link

@N-Coder N-Coder commented Mar 16, 2018

What do these changes do?

I combined the TimeoutHandle class used for the current uber (read) timeout and it's timer() context with the list of registered Traces into the RequestLifecycle class. This class now provides a context to execute the request in and methods to notify the Traces of changes of the request change, while also keeping track of various timeouts between the various states throughout the whole request lifecycle.

The main changes lie within the new lifecycle.py file:
The attr.s class RequestTimeouts is used for configuring the various timeouts. Additionally, attr.ib metadata is used to define the signals where a timeout should be started or ended, where applicable. This makes the attribute definitions analogous to the timeout table in my comment to the issue.
Unfortunately, some of the proposed timeouts can't be defined using the currently available signals. In those cases, we could either add further signals, or add code for tracking the specific timeout explicitly.

The RequestLifecycle class now uses this meta-information together with a list of all signals and their parameters class to build the various send_* dynamically (if you don't like this dynamic generation, the methods could also be "precompiled" statically, but I didn't want to make all that writing effort right now). The new send_* methods now do the following:

  1. they cancel any active timeout that should be ended by the respective currently emmited signal
  2. they notify all registered Trace in parallel using asyncio.gather
  3. they start and register all timeouts that begin with the respective currently emmited signal

This leads to the time for the tracing of signals between timeout start and end signals to count for encompassing timeout, while the tracing for start and end signal themselves won't be counted. As timeouts are executed in parallel by asyncio.gather, the impact should be minimal when using sensible tracing function.

On the topic of merging the timeout config with the session or default config:
The attr.s class RequestTimeouts is frozen, preventing any direct modification to and thus requiring the user to create and assign a new instance to make changes. This can either be done by directly calling the class constructor and thus ignoring all set session and default configs, or by using attr.evolve(session.timeouts, overwritten_timeout=12.3) to reuse the existing config and only overwrite some specific values.
If you deem the chance of users now accidentally overwriting the default config by always creating a new instance, we could also try to hide the constructor and let the users either use session.defaulted_timeout_config(**kwargs) for the attr.evolve version or session.new_timeout_config(**kwargs) for the one without defaults.

Are there changes in behavior for the user?

These changes should be 100% backward-compatible for users only using the interface defined by ClientSession. The old read_timeout and conn_timeout parameters to the __init__ function will be translated to the new RequestTimeouts config object and the timeout parameter can either be a new config object or as previously a number that will be merged into the config object.

To respect the merge of timer and traces I changed the parameters of __init__ method and some other request-lifecycle-related methods in ClientRequest, ClientResponse or Connector objects.
To circumvent these breaking-changes, lifecycle.request_timer_context can be used as drop-in replacement of the old timer (compare client_reqrep:680, where I already did this to circumvent any changes to native code) and [lifecycle] can be used as drop-in replacement of the old traces array.
If you use these drop-ins starting from client:254, you can drop all the changes to client_reqrep.py and connector.py, as just renamed parameters and replaced for-loops with direct calls there.

Related issue number

See #2768 for discussion.

Checklist

Nothing done here yet, as this is just a draft intended as material for discussion. I had to make a PR to allow inline comments within the code. As soon as everything is settled, I'll make a proper PR.

  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt
    • The format is <Name> <Surname>.
    • Please keep alphabetical order, the file is sorted by names.
  • Add a new news fragment into the CHANGES folder
    • name it <issue_id>.<type> for example (588.bugfix)
    • if you don't have an issue_id change it to the pr id after creating the pr
    • ensure type is one of the following:
      • .feature: Signifying a new feature.
      • .bugfix: Signifying a bug fix.
      • .doc: Signifying a documentation improvement.
      • .removal: Signifying a deprecation or removal of public API.
      • .misc: A ticket has been closed, but it is not of interest to users.
    • Make sure to use full sentences with correct case and punctuation, for example: "Fix issue with non-ascii contents in doctest text files."

@N-Coder N-Coder force-pushed the lifecycle_timeouts branch from 29dcf7b to fb6089d Compare March 16, 2018 12:48
@pfreixes
Copy link
Contributor

Yeps, the works looks pretty interesting and looks like that we can build on top of the signal flow a complete timeout system that will give a great granularity to the user! so, first of all thanks for the first draft and keep pushing for this!

I have a general comment about the MR regarding how it implements the RequestLifecycle on top of the tracing system. Remember that the aiohttp.tracing.Trace class is an internal class that is not directly used by the client, so you have almost full freedom to change it.

Why am I saying this?

In this MR the RequestLifecycle behaves as a proxy of the Trace class implementing for that different strategies such as the SIGNALS inverted dictionary or the runtime injection of signals via setattr. IMHO they are weak, implicit and might be a prone error system.

My suggestion here would be, get rid of the Trace class and implement all of the signals methods as explicit ones for the RequestLifecycle class, it should help you in at least the following things:

  1. Get rid of the _tm function the start
  2. Get rid of the SIGNALS inverted index
  3. Get rid of the setattr.
  4. Call the listeners explicitly, if they are, in the methods implemented in the RequestLifecycle.

What do you think? @asvetlov ?

Also, have in mind that with this change we are forcing to call the lifecycle.send_X specific function at each specific part of the code instead of saving some python operations - calling is hard - checking if there are traces configured. Everybody is fine with this?

@N-Coder
Copy link
Author

N-Coder commented Mar 16, 2018

I agree that defining the methods more explicitly in production code and getting rid of the dynamic setattr would be a good idea. Additionally, having everything explicitly listed in tracing.py, while lifecycle.py is all implicit, is unsatisfactory. I already said that

if you don't like this dynamic generation, the methods could also be "precompiled" statically, but I didn't want to make all that writing effort right now.

So I agree with points 3 and 4 of your list, but I'd strongly suggest keeping 1 and 2:

  1. The _tm function was just for me being too lazy to write {"start": start, "end": end} all the time and wanting to catch typos. So imagine uber_timeout = attr.ib(type=float, default=None, metadata={"start": "request_start", "end": "request_end"}) standing there. This brings the definition of the possible timeout values and the lifecycle states they span very closely together, making this logic also very easy to understand, document, maintain and validate, being visually similar to the tabular representation. If this would be "precompiled" into the code of each respective send_* function, the link between timeouts and their respective lifecycle events would be a lot harder to see, let alone maintain.
  2. The SIGNALs list allows some introspection upon the various available signals, allowing generic code to work with them and preventing unnecessary duplication and redundancy, which are another major source of errors. Image making a change to the send_* functions, but forgetting one of them. As the duplicated code from sender in each of them is going to be very long, this problem might also be very hard to spot.

I'm still in favour of defining each of the send methods explicitly, but here's how I'd do what without getting too much redundant, duplicated code: using collectors and dispatchers.
All RequestLifecycle.send_* methods would be explicitly defined:

async def send_request_start(self, method, url, headers):
    return await self._send_signal(Signal.REQUEST_START, TraceRequestStartParams(method, url, headers))

async def send_request_end(...

Here, a normal member method _send_signal would take the place of the respective sender instances. It would collect all the different send calls, their signal and their parameters in a generic way. Signal could be made an enum, still allowing introspection and generic code, without the danger of typos in stringly-typed code.

async def sender(self, signal, params):
    # record timestamp
    self._signal_timestamps[signal] = time.time()

    # cancel all running timeouts that end with this signal
    while self._set_timeouts[signal]:
        timeout_handle = self._set_timeouts[signal].pop()
        timeout_handle.cancel()

    # send on_signal to all trace listeners
    await asyncio.gather(
        trace_config.dispatch_signal(signal, self._session, trace_context, params)
        for trace_config, trace_context in self._trace_configs
    )

    # start all timeouts that begin with this signal and register their handles for the end signal
    for end, timeout in self._set_timeouts[signal]:
        assert isinstance(self.request_timer_context, TimerContext)
        at = ceil(self._loop.time() + timeout)
        handle = self._loop.call_at(at, self.request_timer_context.timeout)
        self._set_timeouts[end].append(handle)

Notice how only the parameters and the call to trace_config (now without getattr) changed from the intial sender. As we dropped the Trace class, the signal would go to TraceConfig directly. After receiving a generic signal, the TraceConfig could then dispatch it to its individual handlers:

class TraceConfig:
    def __init__(self, trace_config_ctx_factory=SimpleNamespace):
        self._on_request_start = Signal(self)
        self._on_request_chunk_sent = Signal(self)
        ...

    async def dispatch_signal(self, signal, session, trace_context, params):
    	if signal == Signal.REQUEST_START:
    		self.on_request_start.send(session, trace_context, params)
    	elif signal == Signal.REQUEST_END:
    		self.on_request_end.send(session, trace_context, params)
    	elif signal == ...

    @property
    def on_request_start(self):
        return self._on_request_start

    @property
    def on_request_chunk_sent(...

If you don't like the long elif chain, one could also use a shorter dict, depending on how dynamic you like it.

With this approach, all public important functions would still be explicitly defined, but the a little more complicated glue code between them would not be duplicated.


# send on_signal to all trace listeners
params = params_class(*args, **kwargs)
await asyncio.gather(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default gather is kinda dangerous, after first exception the un-finished coroutines will continue executing un-parented, is this what you want? If so I wouldn't cmt why it's ok. I think it would be weird to return to caller and then have the signals hit ;)

Copy link
Contributor

@thehesiod thehesiod Mar 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two other issues:

  1. Do we need to wait for receipt of each signal, should these just all be ensure_futures ? If we want the receipt, does send need a timeout? ;)
  2. I don't think we want to raise out if send raises?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below, I don't have a strong opinion about how exactly the signal should be sent. I just saw it as a faster version of the old for trace: await trace.send. There's also the return_exceptions=False parameter or as an alternative asyncio.wait with its return_when=ALL_COMPLETED, FIRST_COMPLETED or FIRST_EXCEPTION parameter, depending on which exact behaviour we want. If you don't want to discuss this now, we could just as well stay with the old sequential approach for this PR.

@@ -41,7 +42,7 @@


# 5 Minute default read and connect timeout
DEFAULT_TIMEOUT = 5 * 60
DEFAULT_TIMEOUTS = RequestTimeouts(uber_timeout=5 * 60)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like this should be request_timeout instead of uber

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the old timeout worked like the uber_timeout is working now, so I though this would be suited best for keeping backwards compatibility. If changing the way things work is okay, I'd be happy to use a more sensible timeout for the default. ;)

headers,
e
)
lifecycle.clear_timeouts()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this as part of a context, it's too easy to get wrong otherwise

Copy link
Author

@N-Coder N-Coder Mar 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it is required at all, clear_timeouts is analogous to the old handle.cancel e.g. also used in resp.connection.add_callback(handle.cancel). As timeouts are automatically cancelled once the end signal is emitted, we probably don't need any clean-up any more. But I'm not 100% sure whether there were any side effects through resp.connection.add_callback that I'm not aware of right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, ya lets figure that out

for trace in traces:
await trace.send_dns_resolvehost_start(host)
if lifecycle:
await lifecycle.send_dns_resolvehost_start(host)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to have a context class for stuff like this that's a no-op if lifecycle is None, otherwise do the right thing at start and end, would make things cleaner too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, we could either make a specific with lifecycle.send_dns_resolvehost_context(): for each pair of start/end signals or even use a generic with lifecycle. enter_context(start="send_dns_resolvehost_start", end="send_dns_resolvehost_end", exception="send_dns_resolvehost_exception").

@thehesiod
Copy link
Contributor

added a few cmts, one thing that worries me (not directly related to this PR), but is the non-trivial overhead that tracing will have because there are so many signals and they're all async functions...given each async function will trigger an iteration of the main message loop. I realize that having it async allows for a more possibilities in the handler but I'm afraid with this flexibility adds a non-trivial cost...anyways, lets keep performance in mind. Let's make sure we use the fastest "time" function as well because calling the regular time() is usually not cheap as it usually involves a kernel call...I believe the fastest implementations use the CPU counter which would end up being a simple ASM instruction.

@asvetlov
Copy link
Member

asvetlov commented Mar 17, 2018

Thanks for PR.
I'll have a time for review in a day or two.

@thehesiod tracing signals should be async functions. If you do something slow in tracing callbacks -- aiohttp cannot help with it. If tracing is not used it is no-op.

given each async function will trigger an iteration of the main message loop

Strictly speaking it is not correct: if async function doesn't block (wait for a future somewhere) -- there is no task switch and no a new loop iteration.

loop.time() should be used as the single time source. It can be cheap (linux has heavy optimization for it for example) it can be expensive -- we shouldn't care, loop.time() speedup is out of aiohttp project scope.

@pfreixes
Copy link
Contributor

Another disruptive comment, I would proposal having a PR that does not couple the trace system with the new timeout feature. True, that it helped us to get a list of the timeouts that will make sense but IMHO it doesn't mean that both implementations have to be coupled. Why am I saying that?

  1. having a first MR that does not couple both implementations will make the reviewing much easier.
  2. A simple but functional implementation will give us a pragmatic and real status of the base code, allowing us to take further decisions without making guesses.
  3. Coupling both implementations are starting to mix different discussion between timeouts and tracing. Do many things at the same time won't help. Step by step.
  4. Coupling both things will give the best of both worlds? are we gonna lose some freedom that will become a shoot on our feet? I'm concerned at this.

To sum up, I would prefer a none coupled implementation from the very beginning. Thoughts? @N-Coder @thehesiod @asvetlov @kxepal ?

@asvetlov
Copy link
Member

Agree with @pfreixes

@N-Coder
Copy link
Author

N-Coder commented Mar 17, 2018

The reason why I came up with this solution was because I saw both features in the bigger context, both of them tracking the lifecycle each and every request goes through. I don't see that both things should be logically separated, as they belong to the same thing - the request lifecycle, which possibly was not as obvious when tracing was first implemented.

Some of the decisions I made are in deed opinionated and make bigger steps without being fully discussed - that's exactly the reason why I created this working draft, to be able to discuss every change I saw necessary. I thought working on actual code that already does the thing might have helped to find the best solution faster. Fiddling timeouts in between the already comparatively long and complex code and hoping for somebody to clean everything up at some later point did not seem like a good idea to me. This solution also looked a lot cleaner and nicer to me, especially if you see how much code it actually removed from the ClientSession.

If you don't like the discussion about tracing were having here, that is in my opinion something we can actually defer without building technical debt. The only discussion we had here related to tracing was how the async callback methods should be called. I though it would be clearly better to call them in parallel than in sequence (using asyncio.gather or a more fitting function even better suiting our needs). As I didn't touch the tracing.py module at all, we could just as well stay with the old sequential for trace: await trace.send approach (this time only implemented one in lifecycle.py) and defer all further discussion about tracing to after this PR.

If you want to go back to the drawing board, feel free to do so or propose another way. Unfortunately, I can't see the clean solution that you are proposing without making to much mess of the code.

@asvetlov
Copy link
Member

@N-Coder on first glaze tracing and timeouts are not tightly coupled.
Maybe after careful re-reading your changes I'll change my mind but initial reaction is expected.
Unfortunately I'm traveling now, will do review when have a time.

@thehesiod
Copy link
Contributor

thehesiod commented Mar 18, 2018

@asvetlov oh ya forgot that for now async doesn't hit the message loop unless you block, thanks

@N-Coder thanks again for all this!

It's worth considering the underlying reason why each is needed (IMO):

  1. timeouts: for limiting potential delays (in this scenario almost always network because interrupting CPU is hard/expensive in cpython)
  2. tracing: for tracking where time is spent (network/cpu)

the commonality is that both want to know/control the areas that can take a lot of time in aiohttp...so this this scenario they're very much alike, they want to know/control similar areas.

also, 1) can impact 2) given it can interrupt a trace. 2) similarly can impact 1) if the tracing callbacks take too long.

so I guess the question is:

  1. is one a superset of the other (does one want to strictly interact with more than the other)
  2. or, do they share a similar "base" where they can diverge over time (they want to interact with similar areas, but each has different areas they want to interact with)
  3. Or are they completely divergent, with no overlap of areas of interest.

I think obviously it's not 3.

@asvetlov
Copy link
Member

This is a kind of review.
At first, thank you so much @N-Coder
Very impressive work on timeout problem discovering.
While I not agree with all design decision without this PR we even had no point to start discussion about the feature implementation.

I very like RequestTimeouts public API.
But I complain the metadata.
Please let me describe my position.
I used to apply meta-programming everywhere, it reduces code lines amount, sure.
But later I've figured out that code lines saving is nothing from both code reading and debugging perspective.
Mental capacity is limited, explicit call very often is more obvious than getattr by attribute name and making a call. Debugging is complicated too obviously.
Let's keep aiohttp code easy to read even if we need to write more code.

Also this is the reason why I want to separate client timeouts from client tracing signals.
They are overlaps, sure (but not entirely: read_timeout is a timeout between receiving a data from connecting peer, there is no corresponding signal brackets like on_connection_create_start / on_connection_create_end.

Moreover I strongly suggest to not use gather for client tracing: the Signal is explicitly and intentionally a list of async callbacks. The list is controlled by user, if a list callback has failed -- the whole HTTP request must fail.
Tracing libraries (aiohttp tracing facility clients) are responsible to not raise an exception if they are not supposed to do.
Let's suppose we have a tracer which modifies HTTP headers. Why? I have no idea. Like I don't assume that the server middleware doesn't modify request/response.
It means we should apply tracing callbacks in user controlled order.

At the end thank you again for raising such important questions not for only timeouts implementation but for aiohttp design principles at all.

@asvetlov
Copy link
Member

Done by #2972

@N-Coder thank you very much for the PR, #2972 was impossible without your code.

@asvetlov asvetlov closed this May 13, 2018
@lock
Copy link

lock bot commented Oct 28, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a [new issue] for related bugs.
If you feel like there's important points made in this discussion, please include those exceprts into that [new issue].
[new issue]: https://github.com/aio-libs/aiohttp/issues/new

@lock lock bot added the outdated label Oct 28, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants