Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call design for Tier 2 (uops) interpreter #106581

Closed
gvanrossum opened this issue Jul 10, 2023 · 27 comments
Closed

Call design for Tier 2 (uops) interpreter #106581

gvanrossum opened this issue Jul 10, 2023 · 27 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@gvanrossum
Copy link
Member

gvanrossum commented Jul 10, 2023

(Maybe this is tentative enough that it still belongs in the faster-cpython/ideas tracker, but I hope we're close enough that we can hash it out here. CC @markshannon, @brandtbucher)

(This is a WIP until I have looked a bit deeper into this.)

First order of business is splitting some of the CALL specializations into multiple ops satisfying the uop requirement: either use oparg and no cache entries, or don't use oparg and use at most one cache entry. For example, one of the more important ones, CALL_PY_EXACT_ARGS, uses both oparg (the number of arguments) and a cache entry (func_version). Splitting it into a guard and an action op is problematic: even discounting the possibility of encountering a bound method (i.e., assuming method is NULL), it contains the following DEOPT calls:

            // PyObject *callable = stack_pointer[-1-oparg];
            DEOPT_IF(tstate->interp->eval_frame, CALL);
            int argcount = oparg;
            PyFunctionObject *func = (PyFunctionObject *)callable;
            DEOPT_IF(!PyFunction_Check(callable), CALL);
            PyFunctionObject *func = (PyFunctionObject *)callable;
            DEOPT_IF(func->func_version != func_version, CALL);
            PyCodeObject *code = (PyCodeObject *)func->func_code;
            DEOPT_IF(code->co_argcount != argcount, CALL);
            DEOPT_IF(!_PyThreadState_HasStackSpace(tstate, code->co_framesize), CALL);

If we wanted to combine all this in a single guard op, that guard would require access to both oparg (to dig out callable) and func_version. The fundamental problem is that the callable, which needs to be prodded and poked for the guard to pass, is buried under the arguments, and we need to use oparg to know how deep it is buried.

What if we somehow reversed this so that the callable is on top of the stack, after the arguments? We could arrange for this by adding a COPY n+1 opcode just before the CALL opcode (or its specializations). In fact, this could even be a blessing in disguise, since now we would no longer need to push a NULL before the callable to reserve space for self -- instead, if the callable is found to be a bound method, its self can overwrite the original callable (below the arguments) and the function extracted from the bound method can overwrite the copy of the callable above the arguments. This has the advantage of no longer needing to have a "push NULL" bit in several other opcodes (the LOAD_GLOBAL and LOAD_ATTR families -- we'll have to review the logic in LOAD_ATTR a bit more to make sure this can work).

(Note that the key reason why the callable is buried below the arguments is a requirement about evaluation order in expressions -- the language reference requires that in the expression F(X) where F and X themselves are possibly complex expressions, F is evaluated before X.)

Comparing before and after, currently we have the following arrangement on the stack when CALL n or any of its specializations is reached:

    NULL
    callable
    arg[0]
    arg[1]
    ...
    arg[n-1]

This is obtained by e.g.

    PUSH_NULL
    LOAD_FAST callable
    <load n args>
    CALL n

or

    LOAD_GLOBAL (NULL + callable)
    <load n args>
    CALL n

or

    LOAD_ATTR (NULL|self + callable)
    <load n args>
    CALL n

Under my proposal the arrangement would change to

    callable
    arg[0]
    arg[1]
    ...
    arg[n-1]
    callable

and it would be obtained by

    LOAD_FAST callable  /  LOAD_GLOBAL callable  /  LOAD_ATTR callable
    <load n args>
    COPY n+1
    CALL n

It would (perhaps) even be permissible for the guard to overwrite both copies of the callable if a method is detected, since it would change from

    self.func
    <n args>
    self.func

to

    self
    <n args>
    func

where we would be assured that func has type PyFunctionObject *. (However, I think we ought to have separate specializations for the two cases, since the transformation would also require bumping oparg.)

The runtime cost would be an extra COPY instruction before each CALL; however I think this might actually be simpler than the dynamic check for bound methods, at least when using copy-and-patch.

Another cost would be requiring extra specializations for some cases that currently dynamically decide between function and method; but again I think that with copy-and-patch that is probably worth it, given that we expect that dynamic check to always go the same way for a specific location.

Linked PRs

@gvanrossum gvanrossum added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Jul 10, 2023
@markshannon
Copy link
Member

Splitting it into a guard and an action op is problematic

You can have as many guards as you need. As long as they do not perturb the state of the VM, there can more than one. We have been assuming that there would be one action, but in fact there can be no action (TO_BOOL_BOOL is just a guard with no action, and NOP has no guard or action).

In summary, a straight line instruction should be made up of zero or more guards followed optionally by an action.

@markshannon
Copy link
Member

I specifically said "straight line instruction" above, because some conditional branching instructions, like FOR_ITER_LIST are tricky in that they have internal branching. The exhausted branch pops values off the stack, but the non-exhausted branch does not.
Fortunately, calls are unconditional.

@gvanrossum
Copy link
Member Author

In summary, a straight line instruction should be made up of zero or more guards followed optionally by an action.

Sure, but any individual guard cannot use a cache entry and oparg at the same time. So how do you write the guard that checks the version of the callable Python function? It needs oparg to find the callable on the stack, and the function_version cache entry to verify the version. My solution for that is to have the callable duplicated on top of the stack before this uop sequence starts (see above for details). Is that acceptable?

@gvanrossum
Copy link
Member Author

gvanrossum commented Jul 10, 2023

some conditional branching instructions, like FOR_ITER_LIST are tricky in that they have internal branching. The exhausted branch pops values off the stack, but the non-exhausted branch does not.

Please look at my solution in gh-106542. [Now closed in favor of gh-106696 and gh-106638]

@gvanrossum
Copy link
Member Author

gvanrossum commented Jul 10, 2023

Off-line @markshannon proposed a simpler solution, that doesn't require changing all CALL specializations to handle a different stack layout. We will instead add the oparg to the Tier 2 instruction format, so they become (opcode, oparg, operand) where operand is some cache entry (if needed). Given that we currently have a 32-bit opcode and a 64-bit operand, adding a 32-bit oparg doesn't actually increase the size of the instruction. (See issue gh-106603.)

(There's also gh-105848 for reference.)

@gvanrossum
Copy link
Member Author

gvanrossum commented Jul 13, 2023

Irrespective of the stack format, there's a bigger problem with projecting a trace through a call. Suppose we are projecting a trace and we encounter CALL_PY_EXACT_ARGS. We'd like to be able to continue the trace into that call, and back, so that if we're calling something like

def add(x, y):
    return x + y

from e.g.

total = 0
for i in range(1000):
    total = add(total, i)

we can add its translation to the trace, rather than ending the trace, and end up with a closed loop:

FOR_ITER_RANGE
STORE_FAST i
# LOAD_GLOBAL NULL + add
LOAD_FAST total
LOAD_FAST i
# CALL_PY_EXACT_ARGS 2
# RESUME
# LOAD_FAST x
# LOAD_FAST y
BINARY_OP_ADD_INT
# RETURN_VALUE
STORE_FAST total
JUMP_TO_TOP

Where the commented-out lines are redundant because we really inline the call.

Apart from the problem of eliding those call-related instructions, isn't it a problem that when we start projecting we don't (necessarily) have access to the function object being called? The cache associated with CALL just has a 32-bit "func_version" field, which is not enough to recover the actual function object or the code object (which we are really after). Working our way back from the CALL specialization to the place where the function being called is loaded (here LOAD_GLOBAL, but could be anything, really) does not seem feasible.

I guess this is where we stop projecting and come back to improve the projected executor once we are about to execute the specialized call -- at that point the function object should be exactly n deep on the stack and thus we can verify its type and version and then get its code object.

@markshannon Does this seem feasible?

@gvanrossum
Copy link
Member Author

Perhaps when we end a trace with a call to Python code we can insert a special instruction that causes a new executor to be created using the callable (which is now easily found on the stack) as additional input.

@markshannon
Copy link
Member

Yes, that should work.

The (slight) downside is that it forces us to execute the superblock in the tier 2 interpreter, which is likely slower than the tier 1 interpreter, as we can't optimize or JIT compile the superblock until it is complete.

We could insert a special instruction in at the start of the function, to potentially resume superblock creation.
That would allow us to keep execution in the (faster) tier 1 interpreter, at the cost of extra work on entering the function.

FTR, the approach I origially had in mind, was some sort of LRU cache mapping version numbers to functions.

@gvanrossum
Copy link
Member Author

Presumably we would execute the first version of the superblock in the Tier 2 interpreter only once. We could also throw away the first version away when the projection reaches a specialized call, and instead insert a special executor type that records the information needed to reconstruct that superblock quickly but incorporating the callable. Then when we reach that executor it finishes the superblock, inserts it where it should go, and it will be reached on the next iteration around the enclosing loop.

@gvanrossum
Copy link
Member Author

I'm still pondering the logistics -- the optimizer receives neither the frame nor the jump origin, so it wouldn't know the location where to plug in the new executor. Updating the existing executor seems iffy, another thread could already have started executing it, and I hope to make uop executors variable-length (see gh-106608). I guess I could search for the current executor in co_executors to recover its index, but that feels expensive (O(N^2), except N is bounded).

Logistics for the cache idea don't look entirely straightforward either. We could specialize a lot of calls before we get to incorporate one into a superblock, so the cache would have to be pretty large. But maybe I should play with this first.

(The logic around _Py_next_func_version seems subtle and undocumented. Ultimately that seems to be zero except in _bootstrap_python.)

(Also, there are two calls to _PyFunction_GetVersionForCurrentState() in specialize.c that look like they could be replaced by the function_get_version() helper.)

@gvanrossum
Copy link
Member Author

I'm looking into an initial step for this -- just adding CALL_PY_EXACT_ARGS to the superblock, assuming that the Tier 2 interpreter will create, initialize and then return the new frame. For this we need to split CALL_PY_EXACT_ARGS into two uops, a guard and an action. But now we run into another yak to shave -- the generator doesn't yet handle variable-sized stack effects (in this case, args[oparg]) in uops: gh-106812.

gvanrossum added a commit that referenced this issue Jul 17, 2023
…106707)

By turning `assert(kwnames == NULL)` into a macro that is not in the "forbidden" list, many instructions that formerly were skipped because they contained such an assert (but no other mention of `kwnames`) are now supported in Tier 2. This covers 10 instructions in total (all specializations of `CALL` that invoke some C code):
- `CALL_NO_KW_TYPE_1`
- `CALL_NO_KW_STR_1`
- `CALL_NO_KW_TUPLE_1`
- `CALL_NO_KW_BUILTIN_O`
- `CALL_NO_KW_BUILTIN_FAST`
- `CALL_NO_KW_LEN`
- `CALL_NO_KW_ISINSTANCE`
- `CALL_NO_KW_METHOD_DESCRIPTOR_O`
- `CALL_NO_KW_METHOD_DESCRIPTOR_NOARGS`
- `CALL_NO_KW_METHOD_DESCRIPTOR_FAST`
@gvanrossum
Copy link
Member Author

(This is mostly a brain dump after getting thoroughly confused and slowly getting my bearings back.)

Now that I've shaved the yak, I'm free to think about how to do calls in Tier 2 again. Previously I wrote:

I'm looking into an initial step for this -- just adding CALL_PY_EXACT_ARGS to the superblock, assuming that the Tier 2 interpreter will create, initialize and then return the new frame. For this we need to split CALL_PY_EXACT_ARGS into two uops, a guard and an action.

It's not that simple though. I can now split CALL_PY_EXACT_ARGS into a guard uop and an action uop, but the action does a bunch of wild and crazy things. (It also duplicates too much code from the guard, but that can wait.)

  • Create a new frame and copy the arguments into it
  • Pop arguments and callable (etc.) off the stack
  • Move next_instr to the next instruction
  • Set the old frame's return_offset to 0
  • Call DISPATCH_INLINED(new_frame), which does even wilder and crazier things:
    • Copy the newly reduced stack_pointer into the old frame
    • Copy the updated next_instr into the old frame
    • Link the old frame into the new frame
    • Set the new frame as the current frame
    • Jump to the start_frame label, where a bit more craziness happens:
      • Check for recursion overflow, error out (to exit_unwind) if the check fails
      • SET_LOCALS_FROM_FRAME(), which sets next_instr and stack_pointer from the new frame

And then we're back at the top of the instruction dispatch loop, starting with the first instruction of the called function's code object (which is where the new frame points after initialization).

The idea is that at this point we can happily continue adding to the superblock from the called function's code object until we run out of space or find an instruction we can't handle. Apart from the issue of finding the code object (a tentative solution suggested above is to use some kind of cache with the function version as key), the _CALL_PY_EXACT_ARGS action uop needs to do some things quite differently in Tier 2 than they are done in Tier 1. It also just becomes a lot of code.

Maybe the trick is to split the CALL_PY_EXACT_ARGS uop into three separate uops (guard, action, and special handling), where the wild and crazy stuff after the new frame has been created is done by the third uop (let's call it INLINE_CALL). But then we'll have to push a frame onto the stack and pop it off, while it isn't an object. (The code generator should support this using a type declaration, but nobody else must look at the stack while it is in this state. We might want to set the low bit of the frame pointer while it's on the stack to warn things off.)

We may have to special-case INLINE_CALL in the superblock creation code (and in the optimizer!). It would be handy if it had the return address/offset as its oparg, for example (similar to SAVE_IP). Or we could insert a SAVE_IP instruction. But the optimizer will still have to be aware of the INLINE_CALL uop, since it completely changes context: what's on the stack, what variables exist.

Interesting times ahead.

@gvanrossum
Copy link
Member Author

So I think (but haven't confirmed yet) that we can do something like this. Special-case CALL_PY_EXACT_ARGS in the code generator (!) to translate it into

_CHECK_CALL_PY_EXACT_ARGS (oparg)
_CALL_PY_EXACT_ARGS (oparg)  # Does not include frame->return_offset = 0 or DISPATCH_INLINED()
SAVE_IP (return address)
PUSH_FRAME (0)

We make _CALL_PY_EXACT_ARGS something like

        op(_CALL_PY_EXACT_ARGS, (method, callable, args[oparg] -- new_frame: _PyInterpreterFramePtr)) {
            <original code, less the last two lines, which are moved into PUSH_FRAME>
        }

and PUSH_FRAME is something like

        op(PUSH_FRAME, (new_frame: _PyInterpreterFramePtr --)) {
            frame->return_offset = oparg;  // Hand-wave: this is not the same oparg as for _CALL_PY_EXACT_ARG
            DISPATCH_INLINED(new_frame);
        }

The SAVE_IP uop is inserted by the code generator by special-casing, ditto for the value of oparg for PUSH_FRAME (in most cases, return_offset is 0, but for SEND_GEN and FOR_ITER_GEN it's the original oparg).

We'll have to introduce something like POP_FRAME to use in RETURN_VALUE. TBD.

@gvanrossum
Copy link
Member Author

gvanrossum commented Aug 8, 2023

I got this working. I turned it into a PR, gh-107760, to take a break before starting on the cache for code objects indexed by func_version and projecting into the call.

@gvanrossum
Copy link
Member Author

gvanrossum commented Aug 8, 2023

Projecting through the call is also working (gh-107793). And continuing after RETURN also seems within reach (gh-107925).

@gvanrossum
Copy link
Member Author

@markshannon

In summary, a straight line instruction should be made up of zero or more guards followed optionally by an action.

This made me think. If we look at CALL_BOUND_METHOD_EXACT_ARGS, this currently fiddles with some stack entries and then jumps to the code for CALL_PY_EXACT_ARGS. As we replace the latter with a macro (see gh-107760):

        macro(CALL_PY_EXACT_ARGS) =
            unused/1 + // Skip over the counter
            _CHECK_PEP_523 +
            _CHECK_FUNCTION_EXACT_ARGS +
            _CHECK_STACK_SPACE +
            _INIT_CALL_PY_EXACT_ARGS +  // Makes the frame
            SAVE_IP +  // Tier 2 only; special-cased oparg
            _PUSH_FRAME;

it seems attractive to replace the former with a similar macro, basically equivalent to CALL_BOUND_METHOD_EXACT_ARGS + CALL_PY_EXACT_ARGS. But we would have at least two actions (one that fiddles with the stack and one that makes the frame), and we'd probably want a guard in between (_CHECK_STACK_SPACE needs to come after the "fiddle" uop, we don't want it to have to recognize a bound method object).

Note that such a macro would bulk up the tier 1 interpreter, but it would not change tier 2.

@markshannon
Copy link
Member

The reason for the "zero or more guards followed by zero or one actions" rule, is to be sure that the VM is always in a consistent state.
Guards don't change the state, so you can have as many as you want.
Actions do change the state, so they must match the behavior of the full instruction.

Bounds methods are special though.
Changing the stack from bound-method NULL arg0 ... to func self arg0 ... is a state change, but it leaves the VM in a valid and consistent state. So the guard* action guard* action is uniquely correct in this case.

As far as GO_TO_INSTRUCTION is concerned, how difficult would it be to treat it as a jump in the tier1, but as a macro expansion in tier 2.
In other words have CALL_BOUND_METHOD_EXACT_ARGS produce the same code as it does now for tier 1, but be treated as the full expansion you suggest for tier 2?

Apart from the instrumented instructions, CALL_BOUND_METHOD_EXACT_ARGS is the only use of GO_TO_INSTRUCTION, so it probably doesn't much matter which approach you use.

@gvanrossum
Copy link
Member Author

We don't have a good mechanism for doing something different in Tier 1 (bytecode) than in Tier 2 (microcode). Given that it's only one instruction I'll just use a macro like I suggested.

Maybe the true constraint is a bit different -- what we care about is being able to jump back from the microcode stream to the bytecode stream when a micro instruction does one of the following:

  • exit (that's only EXIT_TRACE)
  • deoptimize (any guard)
  • go to error (many actions)

The most important case is deoptimize -- it must be able to RE_EXECUTE the bytecode instruction from the start. This requires that the stack is as expected. In the example of bound method, if we deoptimize after the stack fiddling, the specialized bytecode instruction (CALL_BOUND_METHOD_EXACT_ARGS) will also deopt, to CALL, and all should be well, assuming that CALL is okay with the 'self' slot being already filled in.

@brandonardenwalli
Copy link
Contributor

Hello, I am trying to learn quickly and follow along in this conversation, but I am having a little trouble to find enough information to do that. I look at https://docs.python.org/3/search.html?q=tier+1 and https://docs.python.org/3/search.html?q=tier+2 and https://docs.python.org/3/search.html?q=tier and https://docs.python.org/3/search.html?q=uops but the results do not seem to be the same as conversation here. I also look at https://docs.python.org/3/search.html?q=interpreter, which seems to be a little more useful, but I still not 100% sure since there are lots of results and not one seems to be the obvious corresponding explanation for this conversation. Where should I go to learn more about this and follow in this conversation?

I have a little more questions, but I think I might know some of the answers if I can just find the information and read to it first.

@gvanrossum
Copy link
Member Author

gvanrossum commented Aug 15, 2023

@brandonardenwalli

Where should I go to learn more about this and follow in this conversation?

The Tier-1 and Tier-2 terms are our invention. The place to start is not docs.python.org but https://github.com/faster-cpython/ideas.

FWIW Tier-1 refers to standard CPython bytecode, including specialized instructions (PEP 659), whereas Tier-2 refers to micro-instructions or micro-ops (often abbreviated as uops), which is a new thing we're introducing for Python 3.13 (and we don't plan to expose it to users, not even users of the C API).

@gvanrossum
Copy link
Member Author

I found an interesting issue through #107927. This traces through calls and returns. Now consider the following:

def foo():
    for i in range(10):
        pass
for i in range(10):
    foo()

When converting the outer (bottom) loop to a superblock, we trace into foo(). The bytecode for foo() contains a JUMP_BACKWARD instruction to the top of the inner loop. The superblock creation sees that this doesn't jump to the start of the superblock (which would be the top of the outer loop), so it ends the superblock. This is fine, except now we have a superblock that traces into a function and then stops. The inner loop is then also optimized into a superblock, which is fine, so it's not the end of the world. But the question that's nagging me is, should we perhaps just not create the outer superblock?

gvanrossum added a commit that referenced this issue Aug 16, 2023
* Split `CALL_PY_EXACT_ARGS` into uops

This is only the first step for doing `CALL` in Tier 2.
The next step involves tracing into the called code object and back.
After that we'll have to do the remaining `CALL` specialization.
Finally we'll have to deal with `KW_NAMES`.

Note: this moves setting `frame->return_offset` directly in front of
`DISPATCH_INLINED()`, to make it easier to move it into `_PUSH_FRAME`.
gvanrossum added a commit that referenced this issue Aug 17, 2023
This finishes the work begun in gh-107760. When, while projecting a superblock, we encounter a call to a short, simple function, the superblock will now enter the function using `_PUSH_FRAME`, continue through it, and leave it using `_POP_FRAME`, and then continue through the original code. Multiple frame pushes and pops are even possible. It is also possible to stop appending to the superblock in the middle of a called function, when running out of space or encountering an unsupported bytecode.
gvanrossum added a commit that referenced this issue Aug 24, 2023
…08380)

I was comparing the last preceding poke with the *last* peek,
rather than the *first* peek.

Unfortunately this bug obscured another bug:
When the last preceding poke is UNUSED, the first peek disappears,
leaving the variable unassigned. This is how I fixed it:

- Rename CopyEffect to CopyItem.
- Change CopyItem to contain StackItems instead of StackEffects.
- Update those StackItems when adjusting the manager higher or lower.
- Assert that those StackItems' offsets are equivalent.
- Other clever things.

---------

Co-authored-by: Irit Katriel <[email protected]>
gvanrossum added a commit that referenced this issue Aug 25, 2023
Instead of using `GO_TO_INSTRUCTION(CALL_PY_EXACT_ARGS)` we just add the macro elements of the latter to the macro for the former. This requires lengthening the uops array in struct opcode_macro_expansion. (It also required changes to stacking.py that were merged already.)
@brandtbucher
Copy link
Member

But the question that's nagging me is, should we perhaps just not create the outer superblock?

FWIW, this is what I did in my initial tracing implementation (only trace closed inner loops). Note that this probably would have avoided the issue with test_opcache and other cases where we trace into polymorphic calls containing loops.

There are upsides and downsides to both approaches, though. It seems like we're moving in the direction of stitching many traces and side-exits together into trees, so I honestly don't think tracing the outer loop is too much of an issue. One can imagine a near future where we jump directly between the outer loop trace an inner loop trace without kicking it back into tier one each time.

@gvanrossum
Copy link
Member Author

Yeah, when the trace execution bumps right into an ENTER_EXECUTOR we could do something to just go directly to that trace on the next iteration. Does your generated machine code have something equivalent to the refcounts that are keeping the executors alive while the Tier 2 interpreter is running?

gvanrossum added a commit that referenced this issue Sep 5, 2023
Also avoid the need for the awkward .clone() call in the argument
to mgr.adjust_inverse() and mgr.adjust().
gvanrossum added a commit that referenced this issue Sep 12, 2023
I must have overlooked this when refactoring the code generator.
The Tier 1 interpreter contained a few silly things like
```
            goto resume_frame;
            STACK_SHRINK(1);
```
(and other variations, some where the unconditional `goto` was hidden in a macro).
vstinner pushed a commit to vstinner/cpython that referenced this issue Sep 13, 2023
…09338)

I must have overlooked this when refactoring the code generator.
The Tier 1 interpreter contained a few silly things like
```
            goto resume_frame;
            STACK_SHRINK(1);
```
(and other variations, some where the unconditional `goto` was hidden in a macro).
@Fidget-Spinner
Copy link
Member

I have a question for CALL_ALLOC_AND_ENTER_INIT. There are mainly two problems with converting it to uops:

  1. It is not compatible with _INIT_CALL_PY_EXACT_ARGS, as after we create the new frame, we need to fiddle with it to link it back to the shim frame.
  2. It is not compatible with _PUSH_FRAME, because that links the current frame with the old frame. In this case, that would be either linking __init__ with shim frame, or shim frame with originator frame.
    a. The first case would require two _PUSH_FRAMES which are not possible due to inconsistent stack effect with Tier 1. Even if we do that, there is the problem of inconsistent stack effect is the second frame (the __init__ frame) requires the current stack arguments, which have been destroyed by the first push of the shim frame.
    b. The second case is not possible because then the real frame to resume to is lost (the interpreter will think we should resume to shim instead of __init__.

I am currently blocked.

@gvanrossum
Copy link
Member Author

Might want to ping @markshannon about this on Discord (I just saw him mention in another context that he doesn't always act on GitHub pings).

@markshannon
Copy link
Member

markshannon commented Jan 10, 2024

What's the question exactly?

inst(CALL_ALLOC_AND_ENTER_INIT, (unused/1, unused/2, callable, null, args[oparg] -- unused))
uses its oparg, but no cache entries, so can be converted to a micro-op.

It would be hard to optimize in that form though. As you said it needs to track two frames.

What is inconsistent with tier 1? CALL_ALLOC_AND_ENTER_INIT is a tier 1 instruction.

@gvanrossum
Copy link
Member Author

I think this issue has run its course. We've settled on the following stack layout:

callable
self_or_null
arg[0]
arg[1]
...
arg[N-1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
Projects
None yet
Development

No branches or pull requests

5 participants