Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-106529: Make FOR_ITER a viable uop #112134

Merged
merged 17 commits into from
Nov 20, 2023
Merged

Conversation

gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Nov 15, 2023

Also clean up a few nits in the code generator, and add back an important Make dependency for ceval.o.

@gvanrossum
Copy link
Member Author

gvanrossum commented Nov 15, 2023

Hold on, need to remove merge conflicts from commit history.

@gvanrossum gvanrossum marked this pull request as ready for review November 15, 2023 23:58
@@ -176,7 +175,7 @@ def variable_used_unspecialized(node: parsing.Node, name: str) -> bool:
tokens: list[lx.Token] = []
skipping = False
for i, token in enumerate(node.tokens):
if token.kind == "MACRO":
if token.kind == "CMACRO":
Copy link
Member Author

@gvanrossum gvanrossum Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This fix resulted in _SPECIALIZE_UNPACK_SEQUENCE becoming a viable uop. It was missing a TIER_ONE_ONLY marker; I've added it back. (The fix is needed to restore the feature that this function doesn't look inside #if TIER_ONE.)

@gvanrossum
Copy link
Member Author

@brandtbucher Let me know how bad this interferes with the JIT branch. The stuff I put in _FOR_ITER_TIER_TWO probably shouldn't be compiled literally into the template. Maybe I should make part of the code at the deoptimize label into a macro? (But the setting of frame->instr_ptr must differ -- we can't use target since that's also used by error exits, since GH-112065.)

@gvanrossum
Copy link
Member Author

@brandtbucher
Copy link
Member

@brandtbucher Let me know how bad this interferes with the JIT branch. The stuff I put in _FOR_ITER_TIER_TWO probably shouldn't be compiled literally into the template. Maybe I should make part of the code at the deoptimize label into a macro? (But the setting of frame->instr_ptr must differ -- we can't use target since that's also used by error exits, since GH-112065.)

So, this doesn't actually look too bad. I think with a tiny bit of reworking, this could be merged into the JIT branch with no pain:

  • Move the line that sets frame->instr_ptr from the exit_trace label into the _EXIT_TRACE instruction, using a macro (like CURRENT_TARGET() or something) to access the target member.
  • Remove the stuff from _FOR_ITER_TIER_TWO that updates the stats, saves the stack pointer, and decrefs the executor, and just goto exit_trace; instead after updating frame->instr_ptr.

@gvanrossum
Copy link
Member Author

Okay, I think I managed to do that. I can now merge into the justin branch with only a single trivial merge conflict (we both added something to the end of ceval_macros.h). Next up of course, how would I redefine CURRENT_TARGET() in templace.c?

@brandtbucher
Copy link
Member

#define CURRENT_TARGET() (target)

@gvanrossum
Copy link
Member Author

#define CURRENT_TARGET() (target)

Yup, and I also needed to move the exit_trace: label one line down (to avoid setting frame->instr_ptr twice).

It seems to work, but unsure how to prove it (it compiles and passes tests, but it would too if the JIT was never invoked).

Anyway, we should probably wait until we've decided that making Tier 2 5% slower is a good idea.

@markshannon
Copy link
Member

There seem to be a few unrelated changes in this PR.
Could you make another PR for the increase to the uop buffer size, extra debugging and change to UNPACK_SEQUENCE?

/* iterator ended normally */
Py_DECREF(iter);
STACK_SHRINK(1);
/* HACK: Emulate DEOPT_IF to jump over END_FOR */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No hacks, please 🙂

The code should look like this:

if (next == NULL) {
    if (_PyErr_Occurred(tstate)) {
        if (!_PyErr_ExceptionMatches(tstate, PyExc_StopIteration)) {
            GOTO_ERROR(error);
        }
        _PyErr_Clear(tstate);
    }
    /* iterator ended normally */
    Py_DECREF(iter);
    STACK_SHRINK(1);
    DEOPT_IF(true);
}

The trace generator can adjust the target, so it points after the END_FOR.

@gvanrossum
Copy link
Member Author

gvanrossum commented Nov 17, 2023

Could you make another PR for the increase to the uop buffer size, extra debugging and change to UNPACK_SEQUENCE?

Sure, see gh-112214. I've merged that into main, since the tests pass, next I'll merge it into this PR and do the other thing you requested.

gvanrossum added a commit that referenced this pull request Nov 17, 2023
- Double max trace size to 256
- Add a dependency on executor_cases.c.h for ceval.o
- Mark `_SPECIALIZE_UNPACK_SEQUENCE` as `TIER_ONE_ONLY`
- Add debug output back showing the optimized trace
- Bunch of cleanups to Tools/cases_generator/
@gvanrossum
Copy link
Member Author

Okay, here's the new version. @markshannon Want to review one more time?

Python/bytecodes.c Outdated Show resolved Hide resolved
Python/optimizer.c Outdated Show resolved Hide resolved
@gvanrossum
Copy link
Member Author

I've applied Brandt's suggestions.

I've also split off some more cleanups and debug improvements into gh-112218 and gh-112220. Once those land I'll merge main once more and this should shrink a bit more (the cases_generator tweaks will be separated out).

@gvanrossum
Copy link
Member Author

gvanrossum commented Nov 18, 2023

Benchmark says 4% slower (with Tier 2):
https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20231117-3.13.0a1+-14aea56-PYTHON_UOPS

  • For spectral_norm, one out of five traces is still too long (that must be one hell of a trace :-)
  • Also, spectral_norm no longer uses any unsupported opcodes (nice!)
  • Pystats diff show about 8% more traces executed, 8% more uops executed
  • Overall, FOR_ITER_GEN is now by far the most-used unsupported opcode (74k, vs. 9k for the next most-used, CALL)

@gvanrossum
Copy link
Member Author

FWIW, I have a branch on top of this that makes the uop interpreter 3% faster. The approach is to let the code generator generate code that extracts oparg and operand from the instruction stream as needed (i.e., only for opcodes that use them) rather than the interpreter loop preemptively extracting them for every instruction. I'm not completely happy with it, and it probably would make life a little harder for the JIT template, which doesn't like directly referencing next_uop -- this should probably wait until we're ready to generate the JIT template code separately from the Tier 2 interpreter code.

@brandtbucher
Copy link
Member

Honestly, as long as it’s done in macros, it shouldn’t be too bad. Macros are a great escape hatch in the short term before we start generating bespoke “JIT cases”.

@gvanrossum
Copy link
Member Author

Honestly, as long as it’s done in macros, it shouldn’t be too bad. Macros are a great escape hatch in the short term before we start generating bespoke “JIT cases”.

So it could generate something like

case _FOO_UOP: {
    oparg = CURRENT_OPARG();
    operand = CURRENT_OPERAND();
    ...
}

where the default definitions for those (in ceval_macros.h) would be

#define CURRENT_OPARG() next_uop[-1].oparg
#define CURRENT_OPERAND() next_uop[-1].operand

@brandtbucher
Copy link
Member

Yup!

@gvanrossum gvanrossum merged commit 1995955 into python:main Nov 20, 2023
25 checks passed
@gvanrossum gvanrossum deleted the for-iter-uop branch November 20, 2023 18:09
aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
- Double max trace size to 256
- Add a dependency on executor_cases.c.h for ceval.o
- Mark `_SPECIALIZE_UNPACK_SEQUENCE` as `TIER_ONE_ONLY`
- Add debug output back showing the optimized trace
- Bunch of cleanups to Tools/cases_generator/
aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
This uses the new mechanism whereby certain uops
are replaced by others during translation,
using the `_PyUop_Replacements` table.
We further special-case the `_FOR_ITER_TIER_TWO` uop
to update the deoptimization target to point
just past the corresponding `END_FOR` opcode.

Two tiny code cleanups are also part of this PR.
Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024
- Double max trace size to 256
- Add a dependency on executor_cases.c.h for ceval.o
- Mark `_SPECIALIZE_UNPACK_SEQUENCE` as `TIER_ONE_ONLY`
- Add debug output back showing the optimized trace
- Bunch of cleanups to Tools/cases_generator/
Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024
This uses the new mechanism whereby certain uops
are replaced by others during translation,
using the `_PyUop_Replacements` table.
We further special-case the `_FOR_ITER_TIER_TWO` uop
to update the deoptimization target to point
just past the corresponding `END_FOR` opcode.

Two tiny code cleanups are also part of this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants