-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-116968: Reimplement Tier 2 counters #117144
Conversation
Results of the benchmark are available here: 1% faster geom. mean, 0% faster HPT, 1% less memory. Though notably this strongly improves a lot of the interpreter-heavy benchmarks. https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20240321-3.13.0a5+-716c0c6-JIT |
Additionally, concerning the results, it's probably safe to say this PR makes things better, but given the #116206 increasing the overall benchmark times from 1h15 to 2h30, they are probably noisier than usual. |
I missed that. Do we run every benchmark twice for incremental GC? Or did that make CPython twice as slow? Regardless, it seems unfortunate that the benchmarks now take 2h30. Regarding the benchmark numbers, I'm guessing the improvements come from not wasting so much time on fruitless efforts like in hexiom. And possibly because the There's still a lot of cleanup to do in this PR. @markshannon What do you think of my general approach? |
Nothing changed in how we run the benchmarks -- #116206 just seems to be a large regression overall, though more than made up by the follow-up in #117120. Once that's merged we should hopefully have working pystats and closer-to-baseline timings again. |
Exciting! |
Regarding the WASI failures, "call stack exhausted" means a stack overflow. Our WASI builds have a stack of about 8MB. 12 From the notes in the build script, that's derived from the stack size on Linux ( There are two possibilities with the failures here:
In either case I'd normally expect a new "call stack exhausted" crash to indicate that we were already close to the limit. I'm not sure how well that does or doesn't apply for this PR. That said, given that the stack size on WASI is meant to be similar to Linux, I'd expect a problem on WASI to manifest on Linux too. Perhaps WASI is simply our canary in a coalmine here? The next step is probably to take a look at how close were are getting to the stack size on Linux (since we know we're hitting it on WASI). If we're not getting close then we'll need to see what's so special about WASI here. For similar failures see https://github.com/python/cpython/issues?q=is%3Aissue+%22call+stack+exhausted%22. There were cases where lowering the Python recursion limit was the solution. However, I don't think that applies here. CC @brettcannon Footnotes |
Also note that 226 tests did pass. |
This is still in draft mode. Here's my a plan:
Once that's all done (EDIT: and the tests pass) I'll request reviews. |
This reverts commit 8f79a60.
This changes a lot of things but the end result is arguably better.
- The initial exit temperature is 64; this must be greater than the specialization cooldown value (52) otherwise we might create a trace before we have re-specialized the Tier 1 bytecode - There's now a handy helper function for every counter initialization
I'm starting 3 benchmarks:
I also have prepared a blurb, but I'll merge it only when I have something else to merge (to save CI resources): +Introduce a unified 16-bit backoff counter type (``_Py_BackoffCounter``),
+shared between the Tier 1 adaptive specializer and the Tier 2 optimizer. The
+API used for adaptive specialization counters is changed but the behavior is
+(supposed to be) identical.
+
+The behavior of the Tier 2 counters is changed:
+
+- There are no longer dynamic thresholds (we never varied these). - All
+counters now use the same exponential backoff. - The counter for
+``JUMP_BACKWARD`` starts counting down from 16. - The ``temperature`` in
+side exits starts counting down from 64. |
@@ -477,13 +473,9 @@ write_location_entry_start(uint8_t *ptr, int code, int length) | |||
#define ADAPTIVE_COOLDOWN_VALUE 52 | |||
#define ADAPTIVE_COOLDOWN_BACKOFF 0 | |||
|
|||
#define MAX_BACKOFF_VALUE (16 - ADAPTIVE_BACKOFF_BITS) | |||
|
|||
|
|||
static inline uint16_t | |||
adaptive_counter_bits(uint16_t value, uint16_t backoff) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could dispense with adaptive_counter_bits()
, using make_backoff_counter()
directly below, but I would oppose getting rid of the cooldown()
and warmup()
helpers, because they are used in several/many places.
@@ -89,7 +89,7 @@ static inline uint16_t uop_get_error_target(const _PyUOpInstruction *inst) | |||
|
|||
typedef struct _exit_data { | |||
uint32_t target; | |||
int16_t temperature; | |||
_Py_BackoffCounter temperature; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temperature
is now a bit of a misnomer, since it counts down. Maybe it should be renamed to counter
(same as in CODEUNIT
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll keep temperature
, despite the misnomer -- it stands out and makes it easy to grep for this particular concept.
{ | ||
return counter.value == 0; | ||
} | ||
|
||
static inline uint16_t | ||
initial_backoff_counter(void) | ||
/* Initial JUMP_BACKWARD counter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of boilerplate for these initial values; I followed your lead for adaptive_counter_warmup()
and adaptive_counter_cooldown()
, more or less.
Benchmarking results comment (will update as they complete):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for doing this.
A possible further improvement (for another PR) would be to make the code generator aware of counters. specializing op(_SPECIALIZE_TO_BOOL, (counter/1, value -- value)) {
if (ADAPTIVE_COUNTER_TRIGGERS(counter)) {
...
ADVANCE_ADAPTIVE_COUNTER(this_instr[1].counter); to specializing op(_SPECIALIZE_TO_BOOL, (counter/1, value -- value)) {
if (backoff_counter_triggers(counter)) {
...
advance_backoff_counter(counter); by having the code generator generate: _Py_BackoffCounter *counter = &this_instr[1].counter; instead of
|
|
This change appears to have broken building scipy
conformed scipy builds with 63bbe77 |
Introduce a unified 16-bit backoff counter type (``_Py_BackoffCounter``), shared between the Tier 1 adaptive specializer and the Tier 2 optimizer. The API used for adaptive specialization counters is changed but the behavior is (supposed to be) identical. The behavior of the Tier 2 counters is changed: - There are no longer dynamic thresholds (we never varied these). - All counters now use the same exponential backoff. - The counter for ``JUMP_BACKWARD`` starts counting down from 16. - The ``temperature`` in side exits starts counting down from 64.
- Fix a few places where we were not using atomics to (de)instrument opcodes. - Fix a few places where we weren't using atomics to reset adaptive counters. - Remove some redundant non-atomic resets of adaptive counters that presumably snuck as merge artifacts of python#118064 and python#117144 landing close together.
This introduces a unified 16-bit backoff counter type (
_Py_BackoffCounter
), shared between the Tier 1 adaptive specialization machinery and the Tier 2 optimizer.The latter's side exit temperature now uses exponential backoff, and starts initially at 64, to avoid creating side exit traces for code that hasn't been respecialized yet (since the latter only happens after the cooldown counter has reached zero from an initial value of 52).
The threshold value for back-edge optimizations is no longer dynamic; we just use a backoff counter initialized to (16, 4).