Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-116968: Reimplement Tier 2 counters #117144

Merged
merged 46 commits into from
Apr 4, 2024
Merged

Conversation

gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Mar 22, 2024

This introduces a unified 16-bit backoff counter type (_Py_BackoffCounter), shared between the Tier 1 adaptive specialization machinery and the Tier 2 optimizer.

The latter's side exit temperature now uses exponential backoff, and starts initially at 64, to avoid creating side exit traces for code that hasn't been respecialized yet (since the latter only happens after the cooldown counter has reached zero from an initial value of 52).

The threshold value for back-edge optimizations is no longer dynamic; we just use a backoff counter initialized to (16, 4).

@mdboom
Copy link
Contributor

mdboom commented Mar 22, 2024

Results of the benchmark are available here: 1% faster geom. mean, 0% faster HPT, 1% less memory. Though notably this strongly improves a lot of the interpreter-heavy benchmarks. https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20240321-3.13.0a5+-716c0c6-JIT

@mdboom
Copy link
Contributor

mdboom commented Mar 22, 2024

Additionally, concerning the results, it's probably safe to say this PR makes things better, but given the #116206 increasing the overall benchmark times from 1h15 to 2h30, they are probably noisier than usual.

@gvanrossum
Copy link
Member Author

the #116206 increasing the overall benchmark times from 1h15 to 2h30

I missed that. Do we run every benchmark twice for incremental GC? Or did that make CPython twice as slow? Regardless, it seems unfortunate that the benchmarks now take 2h30.

Regarding the benchmark numbers, I'm guessing the improvements come from not wasting so much time on fruitless efforts like in hexiom. And possibly because the JUMP_BACKWARD implementation no longer needs to reference interp->optimizer_backedge_threshold in its fast path.

There's still a lot of cleanup to do in this PR. @markshannon What do you think of my general approach?

@mdboom
Copy link
Contributor

mdboom commented Mar 22, 2024

I missed that. Do we run every benchmark twice for incremental GC? Or did that make CPython twice as slow? Regardless, it seems unfortunate that the benchmarks now take 2h30.

Nothing changed in how we run the benchmarks -- #116206 just seems to be a large regression overall, though more than made up by the follow-up in #117120. Once that's merged we should hopefully have working pystats and closer-to-baseline timings again.

@gvanrossum
Copy link
Member Author

Exciting!

@ericsnowcurrently
Copy link
Member

ericsnowcurrently commented Mar 25, 2024

Regarding the WASI failures, "call stack exhausted" means a stack overflow. Our WASI builds have a stack of about 8MB. 12 From the notes in the build script, that's derived from the stack size on Linux (ulimit -s).

There are two possibilities with the failures here:

  • the stack is going deeper than before
  • at least some stack frames are larger than before

In either case I'd normally expect a new "call stack exhausted" crash to indicate that we were already close to the limit. I'm not sure how well that does or doesn't apply for this PR. That said, given that the stack size on WASI is meant to be similar to Linux, I'd expect a problem on WASI to manifest on Linux too. Perhaps WASI is simply our canary in a coalmine here?

The next step is probably to take a look at how close were are getting to the stack size on Linux (since we know we're hitting it on WASI). If we're not getting close then we'll need to see what's so special about WASI here.

For similar failures see https://github.com/python/cpython/issues?q=is%3Aissue+%22call+stack+exhausted%22. There were cases where lowering the Python recursion limit was the solution. However, I don't think that applies here.

CC @brettcannon

Footnotes

  1. https://github.com/python/cpython/blob/7af063d/Tools/wasm/wasi.py#L285

  2. https://github.com/python/cpython/blob/7af063d/Tools/wasm/wasm_build.py#L332)

@ericsnowcurrently
Copy link
Member

Also note that 226 tests did pass.

Python/bytecodes.c Outdated Show resolved Hide resolved
@gvanrossum
Copy link
Member Author

gvanrossum commented Mar 26, 2024

This is still in draft mode. Here's my a plan:

  • Create a new (internal) counter API named BACKOFF_COUNTER / backoff_counter
  • Alias the ADAPTIVE_COUNTER APIs to use the new API
  • Use the new API names for the Tier 2 counters
  • Get rid of resume_threshold (it's unused)
  • Investigate and fix the WASI stack overflows

Once that's all done (EDIT: and the tests pass) I'll request reviews.

This changes a lot of things but the end result is arguably better.
- The initial exit temperature is 64; this must be greater
  than the specialization cooldown value (52) otherwise we might
  create a trace before we have re-specialized the Tier 1 bytecode
- There's now a handy helper function for every counter initialization
@gvanrossum
Copy link
Member Author

I'm starting 3 benchmarks:

  • Main Linux machine: Tier 1, pystats (to see if I didn't screw up Tier 1)
  • Alt Linux machine: Tier 2, pystats (to see if the Tier 2 exp. backoff is working)
  • Mac/ARM64: JIT, no pystats (to see if this makes the JIT faster or slower)

I also have prepared a blurb, but I'll merge it only when I have something else to merge (to save CI resources):

+Introduce a unified 16-bit backoff counter type (``_Py_BackoffCounter``),
+shared between the Tier 1 adaptive specializer and the Tier 2 optimizer. The
+API used for adaptive specialization counters is changed but the behavior is
+(supposed to be) identical.
+
+The behavior of the Tier 2 counters is changed:
+
+- There are no longer dynamic thresholds (we never varied these). - All
+counters now use the same exponential backoff. - The counter for
+``JUMP_BACKWARD`` starts counting down from 16. - The ``temperature`` in
+side exits starts counting down from 64.

@@ -477,13 +473,9 @@ write_location_entry_start(uint8_t *ptr, int code, int length)
#define ADAPTIVE_COOLDOWN_VALUE 52
#define ADAPTIVE_COOLDOWN_BACKOFF 0

#define MAX_BACKOFF_VALUE (16 - ADAPTIVE_BACKOFF_BITS)


static inline uint16_t
adaptive_counter_bits(uint16_t value, uint16_t backoff) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could dispense with adaptive_counter_bits(), using make_backoff_counter() directly below, but I would oppose getting rid of the cooldown() and warmup() helpers, because they are used in several/many places.

@@ -89,7 +89,7 @@ static inline uint16_t uop_get_error_target(const _PyUOpInstruction *inst)

typedef struct _exit_data {
uint32_t target;
int16_t temperature;
_Py_BackoffCounter temperature;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temperature is now a bit of a misnomer, since it counts down. Maybe it should be renamed to counter (same as in CODEUNIT)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep temperature, despite the misnomer -- it stands out and makes it easy to grep for this particular concept.

{
return counter.value == 0;
}

static inline uint16_t
initial_backoff_counter(void)
/* Initial JUMP_BACKWARD counter.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of boilerplate for these initial values; I followed your lead for adaptive_counter_warmup() and adaptive_counter_cooldown(), more or less.

@gvanrossum
Copy link
Member Author

gvanrossum commented Apr 4, 2024

Benchmarking results comment (will update as they complete):

  • JIT run -- not faster or slower, but uses 1% less memory. I notice that several benchmarks now have distinctly bimodal outcomes (perhaps due to the benchmark outpacing GC?).
  • Tier 1 run -- everything is good, almost no changes (a bit fewer method cache misses, that's presumably the result of hash randomization)
  • Tier 2 pystats -- I think this looks even slightly better than before: 653k optimization attempts (waaay down), 108k traces created (9% down), 6.3B traces executed (about even), 183B uops executed (1% up).

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for doing this.

@markshannon
Copy link
Member

A possible further improvement (for another PR) would be to make the code generator aware of counters.
Then we could change

        specializing op(_SPECIALIZE_TO_BOOL, (counter/1, value -- value)) {
            if (ADAPTIVE_COUNTER_TRIGGERS(counter)) {
                ... 
            ADVANCE_ADAPTIVE_COUNTER(this_instr[1].counter);

to

        specializing op(_SPECIALIZE_TO_BOOL, (counter/1, value -- value)) {
            if (backoff_counter_triggers(counter)) {
                ... 
            advance_backoff_counter(counter);

by having the code generator generate:

    _Py_BackoffCounter *counter = &this_instr[1].counter;

instead of

    uint16_t counter = read_u16(&this_instr[1].cache);

@gvanrossum gvanrossum enabled auto-merge (squash) April 4, 2024 14:33
@gvanrossum gvanrossum merged commit 060a96f into python:main Apr 4, 2024
60 of 61 checks passed
@gvanrossum gvanrossum deleted the exp-backoff branch April 4, 2024 15:06
@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 Ubuntu NoGIL 3.x has failed when building commit 060a96f.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/1225/builds/1939) and take a look at the build logs.
  4. Check if the failure is related to this commit (060a96f) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/1225/builds/1939

Failed tests:

  • test_math

Summary of the results of the build (if available):

==

Click to see traceback logs
remote: Enumerating objects: 69, done.        
remote: Counting objects:   1% (1/69)        
remote: Counting objects:   2% (2/69)        
remote: Counting objects:   4% (3/69)        
remote: Counting objects:   5% (4/69)        
remote: Counting objects:   7% (5/69)        
remote: Counting objects:   8% (6/69)        
remote: Counting objects:  10% (7/69)        
remote: Counting objects:  11% (8/69)        
remote: Counting objects:  13% (9/69)        
remote: Counting objects:  14% (10/69)        
remote: Counting objects:  15% (11/69)        
remote: Counting objects:  17% (12/69)        
remote: Counting objects:  18% (13/69)        
remote: Counting objects:  20% (14/69)        
remote: Counting objects:  21% (15/69)        
remote: Counting objects:  23% (16/69)        
remote: Counting objects:  24% (17/69)        
remote: Counting objects:  26% (18/69)        
remote: Counting objects:  27% (19/69)        
remote: Counting objects:  28% (20/69)        
remote: Counting objects:  30% (21/69)        
remote: Counting objects:  31% (22/69)        
remote: Counting objects:  33% (23/69)        
remote: Counting objects:  34% (24/69)        
remote: Counting objects:  36% (25/69)        
remote: Counting objects:  37% (26/69)        
remote: Counting objects:  39% (27/69)        
remote: Counting objects:  40% (28/69)        
remote: Counting objects:  42% (29/69)        
remote: Counting objects:  43% (30/69)        
remote: Counting objects:  44% (31/69)        
remote: Counting objects:  46% (32/69)        
remote: Counting objects:  47% (33/69)        
remote: Counting objects:  49% (34/69)        
remote: Counting objects:  50% (35/69)        
remote: Counting objects:  52% (36/69)        
remote: Counting objects:  53% (37/69)        
remote: Counting objects:  55% (38/69)        
remote: Counting objects:  56% (39/69)        
remote: Counting objects:  57% (40/69)        
remote: Counting objects:  59% (41/69)        
remote: Counting objects:  60% (42/69)        
remote: Counting objects:  62% (43/69)        
remote: Counting objects:  63% (44/69)        
remote: Counting objects:  65% (45/69)        
remote: Counting objects:  66% (46/69)        
remote: Counting objects:  68% (47/69)        
remote: Counting objects:  69% (48/69)        
remote: Counting objects:  71% (49/69)        
remote: Counting objects:  72% (50/69)        
remote: Counting objects:  73% (51/69)        
remote: Counting objects:  75% (52/69)        
remote: Counting objects:  76% (53/69)        
remote: Counting objects:  78% (54/69)        
remote: Counting objects:  79% (55/69)        
remote: Counting objects:  81% (56/69)        
remote: Counting objects:  82% (57/69)        
remote: Counting objects:  84% (58/69)        
remote: Counting objects:  85% (59/69)        
remote: Counting objects:  86% (60/69)        
remote: Counting objects:  88% (61/69)        
remote: Counting objects:  89% (62/69)        
remote: Counting objects:  91% (63/69)        
remote: Counting objects:  92% (64/69)        
remote: Counting objects:  94% (65/69)        
remote: Counting objects:  95% (66/69)        
remote: Counting objects:  97% (67/69)        
remote: Counting objects:  98% (68/69)        
remote: Counting objects: 100% (69/69)        
remote: Counting objects: 100% (69/69), done.        
remote: Compressing objects:   3% (1/32)        
remote: Compressing objects:   6% (2/32)        
remote: Compressing objects:   9% (3/32)        
remote: Compressing objects:  12% (4/32)        
remote: Compressing objects:  15% (5/32)        
remote: Compressing objects:  18% (6/32)        
remote: Compressing objects:  21% (7/32)        
remote: Compressing objects:  25% (8/32)        
remote: Compressing objects:  28% (9/32)        
remote: Compressing objects:  31% (10/32)        
remote: Compressing objects:  34% (11/32)        
remote: Compressing objects:  37% (12/32)        
remote: Compressing objects:  40% (13/32)        
remote: Compressing objects:  43% (14/32)        
remote: Compressing objects:  46% (15/32)        
remote: Compressing objects:  50% (16/32)        
remote: Compressing objects:  53% (17/32)        
remote: Compressing objects:  56% (18/32)        
remote: Compressing objects:  59% (19/32)        
remote: Compressing objects:  62% (20/32)        
remote: Compressing objects:  65% (21/32)        
remote: Compressing objects:  68% (22/32)        
remote: Compressing objects:  71% (23/32)        
remote: Compressing objects:  75% (24/32)        
remote: Compressing objects:  78% (25/32)        
remote: Compressing objects:  81% (26/32)        
remote: Compressing objects:  84% (27/32)        
remote: Compressing objects:  87% (28/32)        
remote: Compressing objects:  90% (29/32)        
remote: Compressing objects:  93% (30/32)        
remote: Compressing objects:  96% (31/32)        
remote: Compressing objects: 100% (32/32)        
remote: Compressing objects: 100% (32/32), done.        
remote: Total 36 (delta 33), reused 5 (delta 4), pack-reused 0        
From https://github.com/python/cpython
 * branch                  main       -> FETCH_HEAD
Note: switching to '060a96f1a9a901b01ed304aa82b886d248ca1cb6'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 060a96f1a9 gh-116968: Reimplement Tier 2 counters (#117144)
Switched to and reset branch 'main'

make: *** [Makefile:2231: buildbottest] Error 2

@tacaswell
Copy link
Contributor

This change appears to have broken building scipy

FAILED: scipy/special/_specfun.cpython-313-x86_64-linux-gnu.so.p/meson-generated__specfun.cpp.o
ccache c++ -Iscipy/special/_specfun.cpython-313-x86_64-linux-gnu.so.p -Iscipy/special -I../scipy/special -I../../../../home/tcaswell/.virtualenvs/py313/lib/python3.13/site-packages/numpy/_core/include -I/home/tcaswell/.pybuild/py313/include/python3.13 -fvisibility=hidden -fvisibility-inlines-hidden -fdiagnostics-color=always -DNDEBUG -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -std=c++17 -O3 -fpermissive -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_9_API_VERSION -MD -MQ scipy/special/_specfun.cpython-313-x86_64-linux-gnu.so.p/meson-generated__specfun.cpp.o -MF scipy/special/_specfun.cpython-313-x86_64-linux-gnu.so.p/meson-generated__specfun.cpp.o.d -o scipy/special/_specfun.cpython-313-x86_64-linux-gnu.so.p/meson-generated__specfun.cpp.o -c scipy/special/_specfun.cpython-313-x86_64-linux-gnu.so.p/_specfun.cpp
In file included from /home/tcaswell/.pybuild/py313/include/python3.13/internal/pycore_code.h:461,
                 from /home/tcaswell/.pybuild/py313/include/python3.13/internal/pycore_frame.h:13,
                 from scipy/special/_specfun.cpython-313-x86_64-linux-gnu.so.p/_specfun.cpp:14948:
/home/tcaswell/.pybuild/py313/include/python3.13/internal/pycore_backoff.h: In function ‘_Py_BackoffCounter make_backoff_counter(uint16_t, uint16_t)’:
/home/tcaswell/.pybuild/py313/include/python3.13/internal/pycore_backoff.h:47:67: error: designator order for field ‘_Py_BackoffCounter::<unnamed union>::<unnamed struct>::backoff’ does not match declaration order in ‘_Py_BackoffCounter::<unnamed union>::<unnamed struct>’
   47 |     return (_Py_BackoffCounter){.value = value, .backoff = backoff};
      |       

conformed scipy builds with 63bbe77

diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
Introduce a unified 16-bit backoff counter type (``_Py_BackoffCounter``),
shared between the Tier 1 adaptive specializer and the Tier 2 optimizer. The
API used for adaptive specialization counters is changed but the behavior is
(supposed to be) identical.

The behavior of the Tier 2 counters is changed:
- There are no longer dynamic thresholds (we never varied these).
- All counters now use the same exponential backoff.
- The counter for ``JUMP_BACKWARD`` starts counting down from 16.
- The ``temperature`` in side exits starts counting down from 64.
mpage added a commit to mpage/cpython that referenced this pull request Sep 10, 2024
- Fix a few places where we were not using atomics to (de)instrument
  opcodes.
- Fix a few places where we weren't using atomics to reset adaptive
  counters.
- Remove some redundant non-atomic resets of adaptive counters that
  presumably snuck as merge artifacts of python#118064
  and python#117144 landing close
  together.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants