Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CookieJar improvements #7583

Open
1 of 3 tasks
Dreamsorcerer opened this issue Sep 6, 2023 · 8 comments
Open
1 of 3 tasks

CookieJar improvements #7583

Dreamsorcerer opened this issue Sep 6, 2023 · 8 comments

Comments

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Sep 6, 2023

While looking through #7577, we found a few details that could possibly be improved.

  • It should really output duplicate cookies in the Cookie header (e.g. when cookies with different paths have the same name) to match the recommended behaviour in the spec. This may be fairly complex to refactor, as the cookies/morsels used from http.cookies don't seem to support this. Upon reviewing the documentation, I get the impression that http.cookies is intended for server-side, while http.cookiejar is intended for client-side. So, we may need to replace http.cookies to support this behaviour.
  • Cookies are now ordered by path length, but it should also use creation timestamp as a tie-breaker (point 2: https://www.rfc-editor.org/rfc/rfc6265.html#section-5.4). This is probably most relevant if cookies with the same name and path match (but different domains, like a root domain and subdomain). I don't think we currently store creation timestamp, so we'd probably need to add this.
  • (Implement filter_cookies() with domain-matching and path-matching #7944) .filter_cookies() currently just iterates over all cookies and checks each one if they are valid. This could be made more efficient because they are stored like self._cookies[(domain, path)][name], therefore we could do domain-matching and path-matching on the keys, instead of testing every single cookie.
    for val in self._cookies.values():
    yield from val.values()
@Rongronggg9
Copy link
Contributor

I profiled my program with yappi and found that .filter_cookies() consumed 27.5% (23.1s/83.9s) of the total CPU time consumed by requests.

As we can see, the preparation before filtering is very expensive.

self._do_expiration()
if not isinstance(request_url, URL):
warnings.warn(
"The method accepts yarl.URL instances only, got {}".format(
type(request_url)
),
DeprecationWarning,
)
request_url = URL(request_url)
filtered: Union["SimpleCookie[str]", "BaseCookie[str]"] = (
SimpleCookie() if self._quote_cookie else BaseCookie()
)
hostname = request_url.raw_host or ""
request_origin = URL()
with contextlib.suppress(ValueError):
request_origin = request_url.origin()

However, not all requests will have cookies in their jar, for example, the initial request, or, when the session is only used to request those URLs that never sent cookies (images, videos, files, etc).

So I have another suggestion: test if there are any cookies in the jar before really doing anything.

@Dreamsorcerer
Copy link
Member Author

Open PRs that probably resolve these performance issues: #7784 #7777 #7790

@Rongronggg9
Copy link
Contributor

Open PRs that probably resolve these performance issues: #7784 #7777 #7790

I see. But they do not eliminate the need to call URL.origin(), which is also expensive, even when the jar is empty. Would you think that my suggestion is a good idea? If so, I can open a PR.

@Dreamsorcerer
Copy link
Member Author

If it's an easy change, feel free to make a PR, it's easier for me to evaluate the code.

@bdraco
Copy link
Member

bdraco commented Nov 11, 2023

I see. But they do not eliminate the need to call URL.origin(), which is also expensive, even when the jar is empty. Would you think that my suggestion is a good idea? If so, I can open a PR.

I see origin being expensive in the profile as well. Its much more expensive if its an ip address instead of a hostname because it has to recreate the ip_address object. I think you'll need to do another PR for that one

@bdraco
Copy link
Member

bdraco commented Nov 11, 2023

It would be nice if we had a simple benchmark script to compare before and after changes for the cookie jar (probably the url dispatcher as well).

The cookie jar and the url dispatcher tend to be the bottlenecks for large aiohttp installs so anything we can do to improve them will make things scale much better.

@Dreamsorcerer
Copy link
Member Author

This is a benchmarks repo, which I've not looked at yet, maybe if that is dusted off it can be used?
https://github.com/aio-libs/aiohttp-benchmarks

@bdraco
Copy link
Member

bdraco commented Nov 11, 2023

It looks like those are mostly? end-to-end benchmarks. Since we already know where the bottlenecks are, I'd be more interested in something that adds 10000 cookies to the cookie jar and does timing on how long it takes to call filter_cookies. Probably one should have an ip address in the url, and one should have a hostname.

For the url dispatcher add 5000 resources and see how much time it takes to dispatch to the to the last one in the list vs the first one in the list.

Rongronggg9 added a commit to Rongronggg9/RSS-to-Telegram-Bot that referenced this issue Nov 12, 2023
Dreamsorcerer pushed a commit that referenced this issue Nov 12, 2023
…he jar is empty or all cookies have expired (#7822)

**This is a backport of PR #7819 as merged into master (dfc3f89).**

<!-- Thank you for your contribution! -->

## What do these changes do?

The filtering itself and its preparation in `CookieJar.filter_cookies()`
is expensive. Sometimes there are no cookies in the jar or all cookies
have expired. Skip filtering and its preparation in this case.

Because the empty check is much cheaper than `_do_expiration()`, I think
it deserves to be duplicated before and after calling
`_do_expiration()`.

```console
$ python3.11 -m timeit -s 'from collections import defaultdict; d=defaultdict(foo="bar")' \
> 'if not d: pass'
50000000 loops, best of 5: 8.3 nsec per loop
$ python3.11 -m timeit -s 'from collections import defaultdict; d=defaultdict()' \
> 'if not d: pass'
50000000 loops, best of 5: 8.74 nsec per loop
$ python3.11 -m timeit -s 'from aiohttp import CookieJar; cj = CookieJar()' \
> 'cj._do_expiration()'
200000 loops, best of 5: 1.86 usec per loop
```

<!-- Please give a short brief about these changes. -->

## Are there changes in behavior for the user?

No.

<!-- Outline any notable behaviour for the end users. -->

## Related issue number

#7583 (comment)

<!-- Are there any issues opened that will be resolved by merging this
change? -->

## Checklist

- [x] I think the code is well written
- [ ] Unit tests for the changes exist
- [ ] Documentation reflects the changes
- [x] If you provide code modification, please add yourself to
`CONTRIBUTORS.txt`
  * The format is &lt;Name&gt; &lt;Surname&gt;.
  * Please keep alphabetical order, the file is sorted by names.
- [x] Add a new news fragment into the `CHANGES` folder
  * name it `<issue_id>.<type>` for example (588.bugfix)
* if you don't have an `issue_id` change it to the pr id after creating
the pr
  * ensure type is one of the following:
    * `.feature`: Signifying a new feature.
    * `.bugfix`: Signifying a bug fix.
    * `.doc`: Signifying a documentation improvement.
    * `.removal`: Signifying a deprecation or removal of public API.
* `.misc`: A ticket has been closed, but it is not of interest to users.
* Make sure to use full sentences with correct case and punctuation, for
example: "Fix issue with non-ascii contents in doctest text files."
xiangxli pushed a commit to xiangxli/aiohttp that referenced this issue Dec 4, 2023
…` when the jar is empty or all cookies have expired (aio-libs#7822)

**This is a backport of PR aio-libs#7819 as merged into master (dfc3f89).**

<!-- Thank you for your contribution! -->

The filtering itself and its preparation in `CookieJar.filter_cookies()`
is expensive. Sometimes there are no cookies in the jar or all cookies
have expired. Skip filtering and its preparation in this case.

Because the empty check is much cheaper than `_do_expiration()`, I think
it deserves to be duplicated before and after calling
`_do_expiration()`.

```console
$ python3.11 -m timeit -s 'from collections import defaultdict; d=defaultdict(foo="bar")' \
> 'if not d: pass'
50000000 loops, best of 5: 8.3 nsec per loop
$ python3.11 -m timeit -s 'from collections import defaultdict; d=defaultdict()' \
> 'if not d: pass'
50000000 loops, best of 5: 8.74 nsec per loop
$ python3.11 -m timeit -s 'from aiohttp import CookieJar; cj = CookieJar()' \
> 'cj._do_expiration()'
200000 loops, best of 5: 1.86 usec per loop
```

<!-- Please give a short brief about these changes. -->

No.

<!-- Outline any notable behaviour for the end users. -->

aio-libs#7583 (comment)

<!-- Are there any issues opened that will be resolved by merging this
change? -->

- [x] I think the code is well written
- [ ] Unit tests for the changes exist
- [ ] Documentation reflects the changes
- [x] If you provide code modification, please add yourself to
`CONTRIBUTORS.txt`
  * The format is &lt;Name&gt; &lt;Surname&gt;.
  * Please keep alphabetical order, the file is sorted by names.
- [x] Add a new news fragment into the `CHANGES` folder
  * name it `<issue_id>.<type>` for example (588.bugfix)
* if you don't have an `issue_id` change it to the pr id after creating
the pr
  * ensure type is one of the following:
    * `.feature`: Signifying a new feature.
    * `.bugfix`: Signifying a bug fix.
    * `.doc`: Signifying a documentation improvement.
    * `.removal`: Signifying a deprecation or removal of public API.
* `.misc`: A ticket has been closed, but it is not of interest to users.
* Make sure to use full sentences with correct case and punctuation, for
example: "Fix issue with non-ascii contents in doctest text files."
xiangxli added a commit to xiangxli/aiohttp that referenced this issue Dec 4, 2023
silvered-shark added a commit to silvered-shark/RSS-to-Telegram-Bot that referenced this issue Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants