Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use double caching for re._compile() #96346

Closed
serhiy-storchaka opened this issue Aug 27, 2022 · 3 comments
Closed

Use double caching for re._compile() #96346

serhiy-storchaka opened this issue Aug 27, 2022 · 3 comments
Labels
performance Performance or resource usage topic-regex type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

The caching algorithm for re._compile() is one of the hottest sites in the stdlib. It was rewritten many times (it is not all changes):

The patch for using lru_cache() was repeatedly applied and reverted 3 times! Eventually it turned out to be slower than a simple dict lookup. It does not fit very well in this case, because some results should be not cached, and additional checks should be added before calling the cached function, adding significant overhead.

But the LRU caching algorithm can have advantage over the current simple algorithm if not all compiler patterns fit in the cache. It removes entries from the cache more smarty.

I tested with random keys with exponential distribution. With the cache size 512, the largest difference is for lambda between 1/70 and 1/80. The LRU caching algorithm has 3 times less misses: 0.16% to 0.33% against 0.5 to 1.1% misses in the current algorithm. For significantly larger (almost no misses) or smaller (tens percent of misses) lambda both algorithms gives almost the same result.

Direct implementation of the LRU caching algorithm using OrderedDict or just dict would be slower, because every hit requires moving the found entry to the end. But we can use double caching. The primary smaller and faster cache does not reorder entries. But if the key is not found in the primary cache, we look it up in the secondary LRU cache. This algorithm has the same number of misses as the LRU caching algorithm, but it has faster hits.

@serhiy-storchaka serhiy-storchaka added type-feature A feature request or enhancement performance Performance or resource usage topic-regex labels Aug 27, 2022
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 27, 2022
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 27, 2022
@serhiy-storchaka
Copy link
Member Author

The code that I used to count cache misses:

from random import expovariate
from collections import OrderedDict

def test(lambd, _MAXCACHE=512, *, n=10**6):
    _cache = {}
    m = 0
    for i in range(n):
        key = int(expovariate(lambd))
        try:
            _cache[key]
        except KeyError:
            m += 1
            if len(_cache) >= _MAXCACHE:
                # Drop the oldest item
                try:
                    del _cache[next(iter(_cache))]
                except (StopIteration, RuntimeError, KeyError):
                    pass
            _cache[key] = [key]
    return m/n

def test2(lambd, _MAXCACHE=512, *, n=10**6):
    _cache = OrderedDict()
    m = 0
    for i in range(n):
        key = int(expovariate(lambd))
        try:
            v = _cache[key]
        except KeyError:
            m += 1
            if len(_cache) >= _MAXCACHE:
                # Drop the least recently item
                _cache.popitem(last=False)
            _cache[key] = [key]
        else:
            _cache.move_to_end(key)
    return m/n

def test3(lambd, _MAXCACHE=512, _MAXCACHE2=256, *, n=10**6):
    _cache = {}
    _cache2 = OrderedDict()
    m = 0
    for i in range(n):
        key = int(expovariate(lambd))
        try:
            _cache[key]
        except KeyError:
            try:
                v = _cache2[key]
            except KeyError:
                m += 1
                if len(_cache2) >= _MAXCACHE:
                    # Drop the least recently item
                    _cache2.popitem(last=False)
                _cache2[key] = v = [key]
            else:
                _cache2.move_to_end(key)
            if len(_cache) >= _MAXCACHE2:
                # Drop the oldest item
                try:
                    del _cache[next(iter(_cache))]
                except (StopIteration, RuntimeError, KeyError):
                    pass
                #_cache.clear()
            _cache[key] = v
    return m/n

For example:

$ ./python -i re_cache.py
>>> a=70; x=test(1/a); y=test2(1/a); z=test3(1/a); x, y, x/y, x/z, z/y
(0.005344, 0.001695, 3.152802359882006, 3.29064039408867, 0.9581120943952803)
>>> a=80; x=test(1/a); y=test2(1/a); z=test3(1/a); x, y, x/y, x/z, z/y
(0.011016, 0.003435, 3.2069868995633186, 3.178303519907675, 1.0090247452692866)
>>> a=40; x=test(1/a); y=test2(1/a); z=test3(1/a); x, y, x/y, x/z, z/y
(0.000433, 0.000423, 1.0236406619385343, 1.0212264150943395, 1.0023640661938535)
>>> a=400; x=test(1/a); y=test2(1/a); z=test3(1/a); x, y, x/y, x/z, z/y
(0.493634, 0.445853, 1.1071676090550024, 1.0859027154497298, 1.019582687567427)

Since it involves random simulation, the results vary from run to run by few percents.

You can see that FIFO and LRU are equally good or bad for large and small lambda, and that combined algorithm is equal to LRU for the number of misses.

It does not work well if the caches has the same or close size:

>>> a=80; x=test(1/a); y=test2(1/a); z=test3(1/a, 512, 512); x, y, x/y, x/z, z/y
(0.011229, 0.003424, 3.279497663551402, 1.0396259605592075, 3.154497663551402)
>>> a=80; x=test(1/a); y=test2(1/a); z=test3(1/a, 512, 500); x, y, x/y, x/z, z/y
(0.010768, 0.003442, 3.1284137129575824, 1.8033830179199464, 1.7347472399767576)

But if the FIFO cache has size equal to 1/2 or 3/4 of the size of the LRU cache, the difference is negligible.

>>> a=80; x=test(1/a); y=test2(1/a); z=test3(1/a, 512, 256); x, y, x/y, x/z, z/y
(0.011022, 0.003399, 3.2427184466019416, 3.265777777777778, 0.9929390997352162)
>>> a=80; x=test(1/a); y=test2(1/a); z=test3(1/a, 512, 384); x, y, x/y, x/z, z/y
(0.010715, 0.00333, 3.2177177177177176, 3.0977161029199194, 1.0387387387387386)

Smaller size reduces the usefulness of the FIFO cache and can increase the average latency. Above tests do not measure latency, they only evaluate miss rate.

@serhiy-storchaka
Copy link
Member Author

It was only theory. Here is an artificial example which shows that the optimization actually helps:

$ ./python -m timeit -n50000 -s 'from random import expovariate; from re import compile; lambd=1/80' 'compile(r"[az1\0-\uffff]"+str(int(expovariate(lambd))))'

Unpatched: 50000 loops, best of 5: 34.5 usec per loop
Patched: 50000 loops, best of 5: 17.9 usec per loop

The results are unstable, but in this case the patched variant is for sure faster than the baseline. In other examples the difference can be not such significant, and I have no any estimations what effect it has in real world code.

@ambv
Copy link
Contributor

ambv commented Oct 7, 2022

Thanks, Serhiy! ✨ 🍰 ✨

@ambv ambv closed this as completed Oct 7, 2022
carljm added a commit to carljm/cpython that referenced this issue Oct 8, 2022
* main: (38 commits)
  pythongh-92886: make test_ast pass with -O (assertions off) (pythonGH-98058)
  pythongh-92886: make test_coroutines pass with -O (assertions off) (pythonGH-98060)
  pythongh-57179: Add note on symlinks for os.walk (python#94799)
  pythongh-94808: Fix regex on exotic platforms (python#98036)
  pythongh-90085: Remove vestigial -t and -c timeit options (python#94941)
  pythonGH-83901: Improve Signature.bind error message for missing keyword-only params (python#95347)
  pythongh-61105: Add default param, note on using cookiejar subclass (python#95427)
  pythongh-96288: Add a sentence to `os.mkdir`'s docstring. (python#96271)
  pythongh-96073: fix backticks in NEWS entry (pythonGH-98056)
  pythongh-92886: [clinic.py] raise exception on invalid input instead of assertion (pythonGH-98051)
  pythongh-97997: Add col_offset field to tokenizer and use that for AST nodes (python#98000)
  pythonGH-88968: Reject socket that is already used as a transport (python#98010)
  pythongh-96346: Use double caching for re._compile() (python#96347)
  pythongh-91708: Revert params note in urllib.parse.urlparse table (python#96699)
  pythongh-96265: Fix some formatting in faq/design.rst (python#96924)
  pythongh-73196: Add namespace/scope clarification for inheritance section (python#92840)
  pythongh-97646: Change `.js` and `.mjs` files mimetype to conform to RFC 9239 (python#97934)
  pythongh-97923: Always run Ubuntu SSL tests with others in CI (python#97940)
  pythongh-97956: Mention `generate_global_objects.py` in `AC How-To` (python#97957)
  pythongh-96959: Update HTTP links which are redirected to HTTPS (python#98039)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-regex type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants